Monitoring.
Pages: 1 2
admin
2,930 Posts
November 30, 2018, 9:31 pmQuote from admin on November 30, 2018, 9:31 pmis this happening all the time ? is it consistent when joining from a specific node ?
is this happening all the time ? is it consistent when joining from a specific node ?
msalem
87 Posts
December 1, 2018, 10:37 amQuote from msalem on December 1, 2018, 10:37 amWell it keeps happening, but I noticed when I add the backend IP's it takes sometime for it to reflect and that is why it fails, If I wait for like 1 min or so it takes it.
https://ibb.co/pXRG8GY
Now another issue is and I keep noticing this, is adding the disks in the backend always gives an error running the dd command.
https://ibb.co/NF6RY66
Well it keeps happening, but I noticed when I add the backend IP's it takes sometime for it to reflect and that is why it fails, If I wait for like 1 min or so it takes it.
Now another issue is and I keep noticing this, is adding the disks in the backend always gives an error running the dd command.
admin
2,930 Posts
December 1, 2018, 11:47 amQuote from admin on December 1, 2018, 11:47 amThe network delay does not sound normal. Even if we did bypass the connection check, the node will fail during deployment since it cannot connect. Waiting a minute may be masking a real networking issue, for example it may show a problem during path failover or node boot. Also even after the ping responded, the latencies are higher than normal.
For the disk issue, if the disk gets added correctly as an OSD, i would ignore this, the dd command is known to output warning messages as errors + the disk is be-ing wiped out several times over and the os may be updating partition table info while the dd is running. So if it adds fine then just ignore this message.
The network delay does not sound normal. Even if we did bypass the connection check, the node will fail during deployment since it cannot connect. Waiting a minute may be masking a real networking issue, for example it may show a problem during path failover or node boot. Also even after the ping responded, the latencies are higher than normal.
For the disk issue, if the disk gets added correctly as an OSD, i would ignore this, the dd command is known to output warning messages as errors + the disk is be-ing wiped out several times over and the os may be updating partition table info while the dd is running. So if it adds fine then just ignore this message.
msalem
87 Posts
December 2, 2018, 6:01 amQuote from msalem on December 2, 2018, 6:01 amWe only noticed that with the new 2.2, the old version we have installed it many times for our POC on the same servers and it works fine.
Now I managed to add the nodes, however I randomly get OSD's dropping and I cannot re add them, we will install CentOS or other server and just wipe out all the disks again and try the installation. hoping it will fix this issue.
Dear PetaSAN user;
Cluster has one or more osd failures, please check the following osd(s):
- osd.20/srocceph2
We only noticed that with the new 2.2, the old version we have installed it many times for our POC on the same servers and it works fine.
Now I managed to add the nodes, however I randomly get OSD's dropping and I cannot re add them, we will install CentOS or other server and just wipe out all the disks again and try the installation. hoping it will fix this issue.
Dear PetaSAN user;
Cluster has one or more osd failures, please check the following osd(s):
- osd.20/srocceph2
admin
2,930 Posts
December 2, 2018, 12:19 pmQuote from admin on December 2, 2018, 12:19 pmSince pings also have a problem, this is a low level network issue: it is either something with the networking : nics/cables/switches/switch setup...or less likely, if it only happens with 2.2 then maybe the nic driver in the new kernel ( v 4.12 based on SUSE SLE 15) driver for the nic.
Can you try the setup using a different switch that is isolated from your production and just connect the PetaSAN nodes to it, check cables, make sure the switch connected to something else then do a fresh 2.2 setup with no jumbo frames/bonding/vlans and see if you have issues. If you still do, do you see an errors in the kernel log via dmesg ?
Since pings also have a problem, this is a low level network issue: it is either something with the networking : nics/cables/switches/switch setup...or less likely, if it only happens with 2.2 then maybe the nic driver in the new kernel ( v 4.12 based on SUSE SLE 15) driver for the nic.
Can you try the setup using a different switch that is isolated from your production and just connect the PetaSAN nodes to it, check cables, make sure the switch connected to something else then do a fresh 2.2 setup with no jumbo frames/bonding/vlans and see if you have issues. If you still do, do you see an errors in the kernel log via dmesg ?
Last edited on December 2, 2018, 12:20 pm by admin · #15
msalem
87 Posts
February 7, 2019, 11:10 amQuote from msalem on February 7, 2019, 11:10 amSure thing I would like to start with the networking Layer but for us to be 100% sure.
I will need the steps from you to clean the cluster again - so I can reproduce the issue on the old and new switch please.
Thank you
Sure thing I would like to start with the networking Layer but for us to be 100% sure.
I will need the steps from you to clean the cluster again - so I can reproduce the issue on the old and new switch please.
Thank you
admin
2,930 Posts
February 7, 2019, 12:05 pmQuote from admin on February 7, 2019, 12:05 pmI am not sure what you mean by clean the cluster, if you want you can start from scratch you can re-install and select option to install a new cluster. Else if you want to retain the cluster but wipe the data, you can just delete the pools and add new pools.
I am not sure what you mean by clean the cluster, if you want you can start from scratch you can re-install and select option to install a new cluster. Else if you want to retain the cluster but wipe the data, you can just delete the pools and add new pools.
Pages: 1 2
Monitoring.
admin
2,930 Posts
Quote from admin on November 30, 2018, 9:31 pmis this happening all the time ? is it consistent when joining from a specific node ?
is this happening all the time ? is it consistent when joining from a specific node ?
msalem
87 Posts
Quote from msalem on December 1, 2018, 10:37 amWell it keeps happening, but I noticed when I add the backend IP's it takes sometime for it to reflect and that is why it fails, If I wait for like 1 min or so it takes it.
https://ibb.co/pXRG8GY
Now another issue is and I keep noticing this, is adding the disks in the backend always gives an error running the dd command.
https://ibb.co/NF6RY66
Well it keeps happening, but I noticed when I add the backend IP's it takes sometime for it to reflect and that is why it fails, If I wait for like 1 min or so it takes it.
Now another issue is and I keep noticing this, is adding the disks in the backend always gives an error running the dd command.
admin
2,930 Posts
Quote from admin on December 1, 2018, 11:47 amThe network delay does not sound normal. Even if we did bypass the connection check, the node will fail during deployment since it cannot connect. Waiting a minute may be masking a real networking issue, for example it may show a problem during path failover or node boot. Also even after the ping responded, the latencies are higher than normal.
For the disk issue, if the disk gets added correctly as an OSD, i would ignore this, the dd command is known to output warning messages as errors + the disk is be-ing wiped out several times over and the os may be updating partition table info while the dd is running. So if it adds fine then just ignore this message.
The network delay does not sound normal. Even if we did bypass the connection check, the node will fail during deployment since it cannot connect. Waiting a minute may be masking a real networking issue, for example it may show a problem during path failover or node boot. Also even after the ping responded, the latencies are higher than normal.
For the disk issue, if the disk gets added correctly as an OSD, i would ignore this, the dd command is known to output warning messages as errors + the disk is be-ing wiped out several times over and the os may be updating partition table info while the dd is running. So if it adds fine then just ignore this message.
msalem
87 Posts
Quote from msalem on December 2, 2018, 6:01 amWe only noticed that with the new 2.2, the old version we have installed it many times for our POC on the same servers and it works fine.
Now I managed to add the nodes, however I randomly get OSD's dropping and I cannot re add them, we will install CentOS or other server and just wipe out all the disks again and try the installation. hoping it will fix this issue.
Dear PetaSAN user;
Cluster has one or more osd failures, please check the following osd(s):
- osd.20/srocceph2
We only noticed that with the new 2.2, the old version we have installed it many times for our POC on the same servers and it works fine.
Now I managed to add the nodes, however I randomly get OSD's dropping and I cannot re add them, we will install CentOS or other server and just wipe out all the disks again and try the installation. hoping it will fix this issue.
Dear PetaSAN user;
Cluster has one or more osd failures, please check the following osd(s):
- osd.20/srocceph2
admin
2,930 Posts
Quote from admin on December 2, 2018, 12:19 pmSince pings also have a problem, this is a low level network issue: it is either something with the networking : nics/cables/switches/switch setup...or less likely, if it only happens with 2.2 then maybe the nic driver in the new kernel ( v 4.12 based on SUSE SLE 15) driver for the nic.
Can you try the setup using a different switch that is isolated from your production and just connect the PetaSAN nodes to it, check cables, make sure the switch connected to something else then do a fresh 2.2 setup with no jumbo frames/bonding/vlans and see if you have issues. If you still do, do you see an errors in the kernel log via dmesg ?
Since pings also have a problem, this is a low level network issue: it is either something with the networking : nics/cables/switches/switch setup...or less likely, if it only happens with 2.2 then maybe the nic driver in the new kernel ( v 4.12 based on SUSE SLE 15) driver for the nic.
Can you try the setup using a different switch that is isolated from your production and just connect the PetaSAN nodes to it, check cables, make sure the switch connected to something else then do a fresh 2.2 setup with no jumbo frames/bonding/vlans and see if you have issues. If you still do, do you see an errors in the kernel log via dmesg ?
msalem
87 Posts
Quote from msalem on February 7, 2019, 11:10 amSure thing I would like to start with the networking Layer but for us to be 100% sure.
I will need the steps from you to clean the cluster again - so I can reproduce the issue on the old and new switch please.
Thank you
Sure thing I would like to start with the networking Layer but for us to be 100% sure.
I will need the steps from you to clean the cluster again - so I can reproduce the issue on the old and new switch please.
Thank you
admin
2,930 Posts
Quote from admin on February 7, 2019, 12:05 pmI am not sure what you mean by clean the cluster, if you want you can start from scratch you can re-install and select option to install a new cluster. Else if you want to retain the cluster but wipe the data, you can just delete the pools and add new pools.
I am not sure what you mean by clean the cluster, if you want you can start from scratch you can re-install and select option to install a new cluster. Else if you want to retain the cluster but wipe the data, you can just delete the pools and add new pools.