Forums - PetaSAN

ForumBug ReportingLACP bug
You need to log in to create posts and topics. Login · Register
LACP bug

BonsaiJoe
53 Posts

August 27, 2018, 3:35 pm
Quote from BonsaiJoe on August 27, 2018, 3:35 pm
Hello,

today one of the sfp+ modules in our backend1 and backend2 lacp Chanel went down, by design nothing should happen cause of there is still one link there. But the whole backend network went down and the node run a lot of errors. The cluster was not usable until we did a reboot of this node.

the node comes up after the reboot with only 1 lacp link in the bond but runs now without any prolem

We have now tested on other nodes what happened when we remove one link from the backend lacp and it’s always the same loosing the whole backend connection and consul shuts down the server in this tests. (We don’t Know why consul did not shut down the server with the real problem)

but we can see this behavior on 3 tested nodes.

any idea?

Aug 27 12:49:41 ps02-node01 kernel: [2205628.519611] i40e 0000:17:00.1 eth1: NIC Link is Down

Aug 27 12:49:41 ps02-node01 kernel: [2205628.519833] i40e 0000:17:00.1 eth1: speed changed to 0 for port eth1

Aug 27 12:49:41 ps02-node01 kernel: [2205628.580495] bond1: link status definitely down for interface eth1, disabling it

Aug 27 12:49:41 ps02-node01 kernel: [2205628.684325] bond1: link status up again after 0 ms for interface eth3

Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received

Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Refuting a suspect message (from: ps02-node02)

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.500323239s

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.12:8300 in 2.508541769s

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.510603821s

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact quorum of nodes, stepping down

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Follower] entering Follower state (Leader: "")

Aug 27 12:49:43 ps02-node01 consul[2950]: consul.coordinate: Batch update failed: leadership lost while committing log

Aug 27 12:49:43 ps02-node01 consul[2950]: consul: cluster leadership lost

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.13:8300 192.168.52.13:8300}

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.12:8300 192.168.52.12:8300}

Aug 27 12:49:46 ps02-node01 consul[2950]: memberlist: Suspect ps02-node02 has failed, no acks received

Aug 27 12:49:48 ps02-node01 consul[2950]: memberlist: Marking ps02-node03 as failed, suspect timeout reached (0 peer confirmations)

Aug 27 12:49:48 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node03 192.168.52.13

Aug 27 12:49:48 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node03 (Addr: tcp/192.168.52.13:8300) (DC: petasan)

Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Marking ps02-node02 as failed, suspect timeout reached (0 peer confirmations)

Aug 27 12:49:51 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node02 192.168.52.12

Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received

Aug 27 12:49:51 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node02 (Addr: tcp/192.168.52.12:8300) (DC: petasan)

Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.13:8300: read tcp 192.168.52.11:33572->192.168.52.13:8300: i/o timeout

Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.12:8300: read tcp 192.168.52.11:52040->192.168.52.12:8300: i/o timeout

Aug 27 12:49:52 ps02-node01 kernel: [2205639.535978] ABORT_TASK: Found referenced iSCSI task_tag: 27614526

Aug 27 12:49:52 ps02-node01 consul[2950]: http: Request GET /v1/kv/PetaSAN/Sessions/9fc222a8-1e78-4a03-ad03-e38e8c35f183/_exp, error: No cluster leader from=127.0.0.1:49848

Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Heartbeat timeout from "" reached, starting election

Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Candidate] entering Candidate state in term 1183

Aug 27 12:49:54 ps02-node01 consul[2950]: agent: coordinate update error: No cluster leader

Aug 27 12:49:55 ps02-node01 consul[2950]: yamux: keepalive failed: i/o deadline reached

Aug 27 12:49:55 ps02-node01 consul[2950]: consul.rpc: multiplex conn accept failed: keepalive timeout from=192.168.52.13:39120

Aug 27 12:49:58 ps02-node01 ceph-osd[6583]: 2018-08-27 12:49:58.150273 7f753e9ec700 -1 osd.33 1229 heartbeat_check: no reply from 192.168.52.13:6815 osd.0 since back 2018-08-27 12:49:37.974797 front 2018-08-27 12:49:37.974797 (cutoff 2018-08-27 12:49:38.150271)

Hello,

today one of the sfp+ modules in our backend1 and backend2 lacp Chanel went down, by design nothing should happen cause of there is still one link there. But the whole backend network went down and the node run a lot of errors. The cluster was not usable until we did a reboot of this node.

the node comes up after the reboot with only 1 lacp link in the bond but runs now without any prolem

We have now tested on other nodes what happened when we remove one link from the backend lacp and it’s always the same loosing the whole backend connection and consul shuts down the server in this tests. (We don’t Know why consul did not shut down the server with the real problem)

but we can see this behavior on 3 tested nodes.

any idea?

Aug 27 12:49:41 ps02-node01 kernel: [2205628.519611] i40e 0000:17:00.1 eth1: NIC Link is Down

Aug 27 12:49:41 ps02-node01 kernel: [2205628.519833] i40e 0000:17:00.1 eth1: speed changed to 0 for port eth1

Aug 27 12:49:41 ps02-node01 kernel: [2205628.580495] bond1: link status definitely down for interface eth1, disabling it

Aug 27 12:49:41 ps02-node01 kernel: [2205628.684325] bond1: link status up again after 0 ms for interface eth3

Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received

Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Refuting a suspect message (from: ps02-node02)

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.500323239s

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.12:8300 in 2.508541769s

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.510603821s

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact quorum of nodes, stepping down

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Follower] entering Follower state (Leader: "")

Aug 27 12:49:43 ps02-node01 consul[2950]: consul.coordinate: Batch update failed: leadership lost while committing log

Aug 27 12:49:43 ps02-node01 consul[2950]: consul: cluster leadership lost

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.13:8300 192.168.52.13:8300}

Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.12:8300 192.168.52.12:8300}

Aug 27 12:49:46 ps02-node01 consul[2950]: memberlist: Suspect ps02-node02 has failed, no acks received

Aug 27 12:49:48 ps02-node01 consul[2950]: memberlist: Marking ps02-node03 as failed, suspect timeout reached (0 peer confirmations)

Aug 27 12:49:48 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node03 192.168.52.13

Aug 27 12:49:48 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node03 (Addr: tcp/192.168.52.13:8300) (DC: petasan)

Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Marking ps02-node02 as failed, suspect timeout reached (0 peer confirmations)

Aug 27 12:49:51 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node02 192.168.52.12

Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received

Aug 27 12:49:51 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node02 (Addr: tcp/192.168.52.12:8300) (DC: petasan)

Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.13:8300: read tcp 192.168.52.11:33572->192.168.52.13:8300: i/o timeout

Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.12:8300: read tcp 192.168.52.11:52040->192.168.52.12:8300: i/o timeout

Aug 27 12:49:52 ps02-node01 kernel: [2205639.535978] ABORT_TASK: Found referenced iSCSI task_tag: 27614526

Aug 27 12:49:52 ps02-node01 consul[2950]: http: Request GET /v1/kv/PetaSAN/Sessions/9fc222a8-1e78-4a03-ad03-e38e8c35f183/_exp, error: No cluster leader from=127.0.0.1:49848

Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Heartbeat timeout from "" reached, starting election

Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Candidate] entering Candidate state in term 1183

Aug 27 12:49:54 ps02-node01 consul[2950]: agent: coordinate update error: No cluster leader

Aug 27 12:49:55 ps02-node01 consul[2950]: yamux: keepalive failed: i/o deadline reached

Aug 27 12:49:55 ps02-node01 consul[2950]: consul.rpc: multiplex conn accept failed: keepalive timeout from=192.168.52.13:39120

Aug 27 12:49:58 ps02-node01 ceph-osd[6583]: 2018-08-27 12:49:58.150273 7f753e9ec700 -1 osd.33 1229 heartbeat_check: no reply from 192.168.52.13:6815 osd.0 since back 2018-08-27 12:49:37.974797 front 2018-08-27 12:49:37.974797 (cutoff 2018-08-27 12:49:38.150271)

#1

admin
2,930 Posts

August 27, 2018, 5:55 pm
Quote from admin on August 27, 2018, 5:55 pm
We have tested this on our hardware and it works. It could be many things aside from a bug: switch settings, nic hardware (sometimes bonding nics of different hardware does not work well) and lastly the i40e nic driver itself.

Can you try to configure bonding on the nodes yourself manually and see if this solves the issue..if it does not then it is probably one of the above issues.

you can also check the current settings done via PetaSAN :

# check your ips are mapped to bonds via
ip addr

# list the nics assigned to bonds:
cat /sys/class/net/BOND_NAME/bonding/slaves
cat /proc/net/bonding/BOND_NAME

# check the mode is set to lcap
cat /sys/class/net/BOND_NAME/bonding/mode

We have tested this on our hardware and it works. It could be many things aside from a bug: switch settings, nic hardware (sometimes bonding nics of different hardware does not work well) and lastly the i40e nic driver itself.

Can you try to configure bonding on the nodes yourself manually and see if this solves the issue..if it does not then it is probably one of the above issues.

you can also check the current settings done via PetaSAN :

# check your ips are mapped to bonds via
ip addr

# list the nics assigned to bonds:
cat /sys/class/net/BOND_NAME/bonding/slaves
cat /proc/net/bonding/BOND_NAME

# check the mode is set to lcap
cat /sys/class/net/BOND_NAME/bonding/mode

#2

BonsaiJoe
53 Posts

August 27, 2018, 7:55 pm
Quote from BonsaiJoe on August 27, 2018, 7:55 pm
Then it is the driver, config looks good, switch config is also ok .....do you release 2.1.0 with lastest i40e driver ?

Then it is the driver, config looks good, switch config is also ok .....do you release 2.1.0 with lastest i40e driver ?

Last edited on August 27, 2018, 7:56 pm by BonsaiJoe · #3

ghbiz
76 Posts

February 24, 2019, 3:34 am
Quote from ghbiz on February 24, 2019, 3:34 am
BonsaiJoe,

While I don't disregard your LACP methodology as i too was looking at using LACP on a Arista MLAG / Cisco VPC topology, the concept of MPIO inherently built into iSCSI warrants simplicity over complexity.

In other words, as the ADMIN has proposed, there are multiple modes to bond interfaces. LACP (802.11ad) is simply one method. No matter the bonding method, make sure that both switch and server are in compliance with each other.

BonsaiJoe,

While I don't disregard your LACP methodology as i too was looking at using LACP on a Arista MLAG / Cisco VPC topology, the concept of MPIO inherently built into iSCSI warrants simplicity over complexity.

In other words, as the ADMIN has proposed, there are multiple modes to bond interfaces. LACP (802.11ad) is simply one method. No matter the bonding method, make sure that both switch and server are in compliance with each other.

#4

Post Reply: LACP bug

Cancel