Forums - PetaSAN

ForumGeneral DiscussionMonitoring.
You need to log in to create posts and topics. Login · Register
Monitoring.

Pages: 1 2

admin
2,930 Posts

November 30, 2018, 9:31 pm
Quote from admin on November 30, 2018, 9:31 pm
is this happening all the time ? is it consistent when joining from a specific node ?

is this happening all the time ? is it consistent when joining from a specific node ?

#11

msalem
87 Posts

December 1, 2018, 10:37 am
Quote from msalem on December 1, 2018, 10:37 am
Well it keeps happening, but I noticed when I add the backend IP's it takes sometime for it to reflect and that is why it fails, If I wait for like 1 min or so it takes it.

https://ibb.co/pXRG8GY

Now another issue is and I keep noticing this, is adding the disks in the backend always gives an error running the dd command.

https://ibb.co/NF6RY66

Well it keeps happening, but I noticed when I add the backend IP's it takes sometime for it to reflect and that is why it fails, If I wait for like 1 min or so it takes it.

https://ibb.co/pXRG8GY

Now another issue is and I keep noticing this, is adding the disks in the backend always gives an error running the dd command.

https://ibb.co/NF6RY66

#12

admin
2,930 Posts

December 1, 2018, 11:47 am
Quote from admin on December 1, 2018, 11:47 am
The network delay does not sound normal. Even if we did bypass the connection check, the node will fail during deployment since it cannot connect. Waiting a minute may be masking a real networking issue, for example it may show a problem during path failover or node boot. Also even after the ping responded, the latencies are higher than normal.

For the disk issue, if the disk gets added correctly as an OSD, i would ignore this, the dd command is known to output warning messages as errors + the disk is be-ing wiped out several times over and the os may be updating partition table info while the dd is running. So if it adds fine then just ignore this message.

The network delay does not sound normal. Even if we did bypass the connection check, the node will fail during deployment since it cannot connect. Waiting a minute may be masking a real networking issue, for example it may show a problem during path failover or node boot. Also even after the ping responded, the latencies are higher than normal.

For the disk issue, if the disk gets added correctly as an OSD, i would ignore this, the dd command is known to output warning messages as errors + the disk is be-ing wiped out several times over and the os may be updating partition table info while the dd is running. So if it adds fine then just ignore this message.

#13

msalem
87 Posts

December 2, 2018, 6:01 am
Quote from msalem on December 2, 2018, 6:01 am
We only noticed that with the new 2.2, the old version we have installed it many times for our POC on the same servers and it works fine.

Now I managed to add the nodes, however I randomly get OSD's dropping and I cannot re add them, we will install CentOS or other server and just wipe out all the disks again and try the installation. hoping it will fix this issue.

Dear PetaSAN user;

Cluster has one or more osd failures, please check the following osd(s):

- osd.20/srocceph2

We only noticed that with the new 2.2, the old version we have installed it many times for our POC on the same servers and it works fine.

Now I managed to add the nodes, however I randomly get OSD's dropping and I cannot re add them, we will install CentOS or other server and just wipe out all the disks again and try the installation. hoping it will fix this issue.

Dear PetaSAN user;

Cluster has one or more osd failures, please check the following osd(s):

- osd.20/srocceph2

#14

admin
2,930 Posts

December 2, 2018, 12:19 pm
Quote from admin on December 2, 2018, 12:19 pm
Since pings also have a problem, this is a low level network issue: it is either something with the networking : nics/cables/switches/switch setup...or less likely, if it only happens with 2.2 then maybe the nic driver in the new kernel ( v 4.12 based on SUSE SLE 15) driver for the nic.

Can you try the setup using a different switch that is isolated from your production and just connect the PetaSAN nodes to it, check cables, make sure the switch connected to something else then do a fresh 2.2 setup with no jumbo frames/bonding/vlans and see if you have issues. If you still do, do you see an errors in the kernel log via dmesg ?

Since pings also have a problem, this is a low level network issue: it is either something with the networking : nics/cables/switches/switch setup...or less likely, if it only happens with 2.2 then maybe the nic driver in the new kernel ( v 4.12 based on SUSE SLE 15) driver for the nic.

Can you try the setup using a different switch that is isolated from your production and just connect the PetaSAN nodes to it, check cables, make sure the switch connected to something else then do a fresh 2.2 setup with no jumbo frames/bonding/vlans and see if you have issues. If you still do, do you see an errors in the kernel log via dmesg ?

Last edited on December 2, 2018, 12:20 pm by admin · #15

msalem
87 Posts

February 7, 2019, 11:10 am
Quote from msalem on February 7, 2019, 11:10 am
Sure thing I would like to start with the networking Layer but for us to be 100% sure.

I will need the steps from you to clean the cluster again - so I can reproduce the issue on the old and new switch please.

Thank you

Sure thing I would like to start with the networking Layer but for us to be 100% sure.

I will need the steps from you to clean the cluster again - so I can reproduce the issue on the old and new switch please.

Thank you

#16

admin
2,930 Posts

February 7, 2019, 12:05 pm
Quote from admin on February 7, 2019, 12:05 pm
I am not sure what you mean by clean the cluster, if you want you can start from scratch you can re-install and select option to install a new cluster. Else if you want to retain the cluster but wipe the data, you can just delete the pools and add new pools.

I am not sure what you mean by clean the cluster, if you want you can start from scratch you can re-install and select option to install a new cluster. Else if you want to retain the cluster but wipe the data, you can just delete the pools and add new pools.

#17

Post Reply: Monitoring.

Cancel

Pages: 1 2