Adding Node to Cluster ALWAYS fails on 1st time when using VLANs
ghbiz
76 Posts
February 16, 2019, 10:19 pmQuote from ghbiz on February 16, 2019, 10:19 pmSteps to reproduce.
- Node1. Create a NEW cluster. user vlans 111 and 222 (or whatever) for backend networks
- Node2. Add to existing cluster by adding mgmt IP of Node1 and Password.
At this point, the setup for Node2,3,4,5,6,7,8,9,....... will FAIL 100% of the time to communicate across the bankend networks on the VLAN interfaces.
To fix the above issue, simply go through the SAME setup steps a second time and the Node will succeed in adding to the cluster.
Final thoughts: Looks like the Petasan tries to communicate before the dot1q tagged interface is brought up.
Steps to reproduce.
- Node1. Create a NEW cluster. user vlans 111 and 222 (or whatever) for backend networks
- Node2. Add to existing cluster by adding mgmt IP of Node1 and Password.
At this point, the setup for Node2,3,4,5,6,7,8,9,....... will FAIL 100% of the time to communicate across the bankend networks on the VLAN interfaces.
To fix the above issue, simply go through the SAME setup steps a second time and the Node will succeed in adding to the cluster.
Final thoughts: Looks like the Petasan tries to communicate before the dot1q tagged interface is brought up.
admin
2,930 Posts
February 17, 2019, 3:23 pmQuote from admin on February 17, 2019, 3:23 pmWe cannot reproduce this. Can you give more detail on which step/page it fails on node 2 : is it immediately after you add the ip/password ? or after you add the host backed ips ? or at the very end after you define the node roles and disks ?
if it happens at end, can you try again and ssh or open up a shell on the node after you set the backend ips and before proceeding and try to ping node 1 using management and backend ips ? It could be a timing issue that after setting the interfaces, connections/pings do not work immediately, maybe if you wait a few seconds and proceed things work..maybe ? do pings show good latency ?
We cannot reproduce this. Can you give more detail on which step/page it fails on node 2 : is it immediately after you add the ip/password ? or after you add the host backed ips ? or at the very end after you define the node roles and disks ?
if it happens at end, can you try again and ssh or open up a shell on the node after you set the backend ips and before proceeding and try to ping node 1 using management and backend ips ? It could be a timing issue that after setting the interfaces, connections/pings do not work immediately, maybe if you wait a few seconds and proceed things work..maybe ? do pings show good latency ?
Last edited on February 17, 2019, 3:36 pm by admin · #2
ghbiz
76 Posts
February 18, 2019, 3:48 pmQuote from ghbiz on February 18, 2019, 3:48 pmthis happens when the cluster is testing communication from node X to node 1 over the backend IPs.
I do agree that this may simply be a timing mechanism where it maybe needs to wait for the sub-interfaces to come up on their respective VLANs.
There is also another matter that may be causing this which is STP (Spanning Tree) on the switches that once the interface comes up, STP still has the dot1q interface in blocking mode for 30 seconds to make sure their are no layer-2 STP loops.
Maybe forcing a 45 seconds WAIT may reduce this from happening when using STP aware switches.
this happens when the cluster is testing communication from node X to node 1 over the backend IPs.
I do agree that this may simply be a timing mechanism where it maybe needs to wait for the sub-interfaces to come up on their respective VLANs.
There is also another matter that may be causing this which is STP (Spanning Tree) on the switches that once the interface comes up, STP still has the dot1q interface in blocking mode for 30 seconds to make sure their are no layer-2 STP loops.
Maybe forcing a 45 seconds WAIT may reduce this from happening when using STP aware switches.
admin
2,930 Posts
February 18, 2019, 4:48 pmQuote from admin on February 18, 2019, 4:48 pmThis is too much of a delay. we do a lot of ip failover when a node fails, this delay will surely cause io to fail.
This is too much of a delay. we do a lot of ip failover when a node fails, this delay will surely cause io to fail.
ghbiz
76 Posts
February 18, 2019, 9:58 pmQuote from ghbiz on February 18, 2019, 9:58 pmHello, may be you misunderstood me.
The issue above is ONLY during the initial setup of adding the node to a cluster.
STP is something that is inherent to all vlan based switches. It is simply an observation that once we go through adding a NODE and use VLAN interfaces for Backend interfaces, we see an error that the NODE can NOT communicate with NODE1 or any of the management nodes. Once that happens, we simply reload the page, start the addition over again and then it goes through. in SSH, we see that during the second time around, the backend interfaces on the node that we are adding is already pinging and working.
Hello, may be you misunderstood me.
The issue above is ONLY during the initial setup of adding the node to a cluster.
STP is something that is inherent to all vlan based switches. It is simply an observation that once we go through adding a NODE and use VLAN interfaces for Backend interfaces, we see an error that the NODE can NOT communicate with NODE1 or any of the management nodes. Once that happens, we simply reload the page, start the addition over again and then it goes through. in SSH, we see that during the second time around, the backend interfaces on the node that we are adding is already pinging and working.
admin
2,930 Posts
February 18, 2019, 10:19 pmQuote from admin on February 18, 2019, 10:19 pmmy concern is : do you need 45 sec after configuring an ip address on a vlan subnet for it to work ? if so, this could be a problem even after you deploy and during production, if for any reason a host fails we re-assign its iSCSI ips to the other nodes (ip failover)..we cannot afford a delay of 45 sec for these ips to become functioning.
my concern is : do you need 45 sec after configuring an ip address on a vlan subnet for it to work ? if so, this could be a problem even after you deploy and during production, if for any reason a host fails we re-assign its iSCSI ips to the other nodes (ip failover)..we cannot afford a delay of 45 sec for these ips to become functioning.
Last edited on February 18, 2019, 10:20 pm by admin · #6
ghbiz
76 Posts
February 20, 2019, 4:50 amQuote from ghbiz on February 20, 2019, 4:50 amThis is only an issue during adding a new machine to a cluster. Once the cluster is up, there is no issue.
As for STP, the 30 seconds check is standard and is only at startup of the server / interface. This does not affect production traffic any more than it needs to.
This is only an issue during adding a new machine to a cluster. Once the cluster is up, there is no issue.
As for STP, the 30 seconds check is standard and is only at startup of the server / interface. This does not affect production traffic any more than it needs to.
Adding Node to Cluster ALWAYS fails on 1st time when using VLANs
ghbiz
76 Posts
Quote from ghbiz on February 16, 2019, 10:19 pmSteps to reproduce.
- Node1. Create a NEW cluster. user vlans 111 and 222 (or whatever) for backend networks
- Node2. Add to existing cluster by adding mgmt IP of Node1 and Password.
At this point, the setup for Node2,3,4,5,6,7,8,9,....... will FAIL 100% of the time to communicate across the bankend networks on the VLAN interfaces.
To fix the above issue, simply go through the SAME setup steps a second time and the Node will succeed in adding to the cluster.
Final thoughts: Looks like the Petasan tries to communicate before the dot1q tagged interface is brought up.
Steps to reproduce.
- Node1. Create a NEW cluster. user vlans 111 and 222 (or whatever) for backend networks
- Node2. Add to existing cluster by adding mgmt IP of Node1 and Password.
At this point, the setup for Node2,3,4,5,6,7,8,9,....... will FAIL 100% of the time to communicate across the bankend networks on the VLAN interfaces.
To fix the above issue, simply go through the SAME setup steps a second time and the Node will succeed in adding to the cluster.
Final thoughts: Looks like the Petasan tries to communicate before the dot1q tagged interface is brought up.
admin
2,930 Posts
Quote from admin on February 17, 2019, 3:23 pmWe cannot reproduce this. Can you give more detail on which step/page it fails on node 2 : is it immediately after you add the ip/password ? or after you add the host backed ips ? or at the very end after you define the node roles and disks ?
if it happens at end, can you try again and ssh or open up a shell on the node after you set the backend ips and before proceeding and try to ping node 1 using management and backend ips ? It could be a timing issue that after setting the interfaces, connections/pings do not work immediately, maybe if you wait a few seconds and proceed things work..maybe ? do pings show good latency ?
We cannot reproduce this. Can you give more detail on which step/page it fails on node 2 : is it immediately after you add the ip/password ? or after you add the host backed ips ? or at the very end after you define the node roles and disks ?
if it happens at end, can you try again and ssh or open up a shell on the node after you set the backend ips and before proceeding and try to ping node 1 using management and backend ips ? It could be a timing issue that after setting the interfaces, connections/pings do not work immediately, maybe if you wait a few seconds and proceed things work..maybe ? do pings show good latency ?
ghbiz
76 Posts
Quote from ghbiz on February 18, 2019, 3:48 pmthis happens when the cluster is testing communication from node X to node 1 over the backend IPs.
I do agree that this may simply be a timing mechanism where it maybe needs to wait for the sub-interfaces to come up on their respective VLANs.
There is also another matter that may be causing this which is STP (Spanning Tree) on the switches that once the interface comes up, STP still has the dot1q interface in blocking mode for 30 seconds to make sure their are no layer-2 STP loops.
Maybe forcing a 45 seconds WAIT may reduce this from happening when using STP aware switches.
this happens when the cluster is testing communication from node X to node 1 over the backend IPs.
I do agree that this may simply be a timing mechanism where it maybe needs to wait for the sub-interfaces to come up on their respective VLANs.
There is also another matter that may be causing this which is STP (Spanning Tree) on the switches that once the interface comes up, STP still has the dot1q interface in blocking mode for 30 seconds to make sure their are no layer-2 STP loops.
Maybe forcing a 45 seconds WAIT may reduce this from happening when using STP aware switches.
admin
2,930 Posts
Quote from admin on February 18, 2019, 4:48 pmThis is too much of a delay. we do a lot of ip failover when a node fails, this delay will surely cause io to fail.
This is too much of a delay. we do a lot of ip failover when a node fails, this delay will surely cause io to fail.
ghbiz
76 Posts
Quote from ghbiz on February 18, 2019, 9:58 pmHello, may be you misunderstood me.
The issue above is ONLY during the initial setup of adding the node to a cluster.
STP is something that is inherent to all vlan based switches. It is simply an observation that once we go through adding a NODE and use VLAN interfaces for Backend interfaces, we see an error that the NODE can NOT communicate with NODE1 or any of the management nodes. Once that happens, we simply reload the page, start the addition over again and then it goes through. in SSH, we see that during the second time around, the backend interfaces on the node that we are adding is already pinging and working.
Hello, may be you misunderstood me.
The issue above is ONLY during the initial setup of adding the node to a cluster.
STP is something that is inherent to all vlan based switches. It is simply an observation that once we go through adding a NODE and use VLAN interfaces for Backend interfaces, we see an error that the NODE can NOT communicate with NODE1 or any of the management nodes. Once that happens, we simply reload the page, start the addition over again and then it goes through. in SSH, we see that during the second time around, the backend interfaces on the node that we are adding is already pinging and working.
admin
2,930 Posts
Quote from admin on February 18, 2019, 10:19 pmmy concern is : do you need 45 sec after configuring an ip address on a vlan subnet for it to work ? if so, this could be a problem even after you deploy and during production, if for any reason a host fails we re-assign its iSCSI ips to the other nodes (ip failover)..we cannot afford a delay of 45 sec for these ips to become functioning.
my concern is : do you need 45 sec after configuring an ip address on a vlan subnet for it to work ? if so, this could be a problem even after you deploy and during production, if for any reason a host fails we re-assign its iSCSI ips to the other nodes (ip failover)..we cannot afford a delay of 45 sec for these ips to become functioning.
ghbiz
76 Posts
Quote from ghbiz on February 20, 2019, 4:50 amThis is only an issue during adding a new machine to a cluster. Once the cluster is up, there is no issue.
As for STP, the 30 seconds check is standard and is only at startup of the server / interface. This does not affect production traffic any more than it needs to.
This is only an issue during adding a new machine to a cluster. Once the cluster is up, there is no issue.
As for STP, the 30 seconds check is standard and is only at startup of the server / interface. This does not affect production traffic any more than it needs to.