LACP bug
BonsaiJoe
53 Posts
August 27, 2018, 3:35 pmQuote from BonsaiJoe on August 27, 2018, 3:35 pmHello,
today one of the sfp+ modules in our backend1 and backend2 lacp Chanel went down, by design nothing should happen cause of there is still one link there. But the whole backend network went down and the node run a lot of errors. The cluster was not usable until we did a reboot of this node.
the node comes up after the reboot with only 1 lacp link in the bond but runs now without any prolem
We have now tested on other nodes what happened when we remove one link from the backend lacp and it’s always the same loosing the whole backend connection and consul shuts down the server in this tests. (We don’t Know why consul did not shut down the server with the real problem)
but we can see this behavior on 3 tested nodes.
any idea?
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519611] i40e 0000:17:00.1 eth1: NIC Link is Down
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519833] i40e 0000:17:00.1 eth1: speed changed to 0 for port eth1
Aug 27 12:49:41 ps02-node01 kernel: [2205628.580495] bond1: link status definitely down for interface eth1, disabling it
Aug 27 12:49:41 ps02-node01 kernel: [2205628.684325] bond1: link status up again after 0 ms for interface eth3
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Refuting a suspect message (from: ps02-node02)
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.500323239s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.12:8300 in 2.508541769s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.510603821s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact quorum of nodes, stepping down
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Follower] entering Follower state (Leader: "")
Aug 27 12:49:43 ps02-node01 consul[2950]: consul.coordinate: Batch update failed: leadership lost while committing log
Aug 27 12:49:43 ps02-node01 consul[2950]: consul: cluster leadership lost
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.13:8300 192.168.52.13:8300}
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.12:8300 192.168.52.12:8300}
Aug 27 12:49:46 ps02-node01 consul[2950]: memberlist: Suspect ps02-node02 has failed, no acks received
Aug 27 12:49:48 ps02-node01 consul[2950]: memberlist: Marking ps02-node03 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:48 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node03 192.168.52.13
Aug 27 12:49:48 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node03 (Addr: tcp/192.168.52.13:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Marking ps02-node02 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:51 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node02 192.168.52.12
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:51 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node02 (Addr: tcp/192.168.52.12:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.13:8300: read tcp 192.168.52.11:33572->192.168.52.13:8300: i/o timeout
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.12:8300: read tcp 192.168.52.11:52040->192.168.52.12:8300: i/o timeout
Aug 27 12:49:52 ps02-node01 kernel: [2205639.535978] ABORT_TASK: Found referenced iSCSI task_tag: 27614526
Aug 27 12:49:52 ps02-node01 consul[2950]: http: Request GET /v1/kv/PetaSAN/Sessions/9fc222a8-1e78-4a03-ad03-e38e8c35f183/_exp, error: No cluster leader from=127.0.0.1:49848
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Heartbeat timeout from "" reached, starting election
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Candidate] entering Candidate state in term 1183
Aug 27 12:49:54 ps02-node01 consul[2950]: agent: coordinate update error: No cluster leader
Aug 27 12:49:55 ps02-node01 consul[2950]: yamux: keepalive failed: i/o deadline reached
Aug 27 12:49:55 ps02-node01 consul[2950]: consul.rpc: multiplex conn accept failed: keepalive timeout from=192.168.52.13:39120
Aug 27 12:49:58 ps02-node01 ceph-osd[6583]: 2018-08-27 12:49:58.150273 7f753e9ec700 -1 osd.33 1229 heartbeat_check: no reply from 192.168.52.13:6815 osd.0 since back 2018-08-27 12:49:37.974797 front 2018-08-27 12:49:37.974797 (cutoff 2018-08-27 12:49:38.150271)
Hello,
today one of the sfp+ modules in our backend1 and backend2 lacp Chanel went down, by design nothing should happen cause of there is still one link there. But the whole backend network went down and the node run a lot of errors. The cluster was not usable until we did a reboot of this node.
the node comes up after the reboot with only 1 lacp link in the bond but runs now without any prolem
We have now tested on other nodes what happened when we remove one link from the backend lacp and it’s always the same loosing the whole backend connection and consul shuts down the server in this tests. (We don’t Know why consul did not shut down the server with the real problem)
but we can see this behavior on 3 tested nodes.
any idea?
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519611] i40e 0000:17:00.1 eth1: NIC Link is Down
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519833] i40e 0000:17:00.1 eth1: speed changed to 0 for port eth1
Aug 27 12:49:41 ps02-node01 kernel: [2205628.580495] bond1: link status definitely down for interface eth1, disabling it
Aug 27 12:49:41 ps02-node01 kernel: [2205628.684325] bond1: link status up again after 0 ms for interface eth3
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Refuting a suspect message (from: ps02-node02)
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.500323239s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.12:8300 in 2.508541769s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.510603821s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact quorum of nodes, stepping down
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Follower] entering Follower state (Leader: "")
Aug 27 12:49:43 ps02-node01 consul[2950]: consul.coordinate: Batch update failed: leadership lost while committing log
Aug 27 12:49:43 ps02-node01 consul[2950]: consul: cluster leadership lost
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.13:8300 192.168.52.13:8300}
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.12:8300 192.168.52.12:8300}
Aug 27 12:49:46 ps02-node01 consul[2950]: memberlist: Suspect ps02-node02 has failed, no acks received
Aug 27 12:49:48 ps02-node01 consul[2950]: memberlist: Marking ps02-node03 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:48 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node03 192.168.52.13
Aug 27 12:49:48 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node03 (Addr: tcp/192.168.52.13:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Marking ps02-node02 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:51 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node02 192.168.52.12
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:51 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node02 (Addr: tcp/192.168.52.12:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.13:8300: read tcp 192.168.52.11:33572->192.168.52.13:8300: i/o timeout
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.12:8300: read tcp 192.168.52.11:52040->192.168.52.12:8300: i/o timeout
Aug 27 12:49:52 ps02-node01 kernel: [2205639.535978] ABORT_TASK: Found referenced iSCSI task_tag: 27614526
Aug 27 12:49:52 ps02-node01 consul[2950]: http: Request GET /v1/kv/PetaSAN/Sessions/9fc222a8-1e78-4a03-ad03-e38e8c35f183/_exp, error: No cluster leader from=127.0.0.1:49848
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Heartbeat timeout from "" reached, starting election
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Candidate] entering Candidate state in term 1183
Aug 27 12:49:54 ps02-node01 consul[2950]: agent: coordinate update error: No cluster leader
Aug 27 12:49:55 ps02-node01 consul[2950]: yamux: keepalive failed: i/o deadline reached
Aug 27 12:49:55 ps02-node01 consul[2950]: consul.rpc: multiplex conn accept failed: keepalive timeout from=192.168.52.13:39120
Aug 27 12:49:58 ps02-node01 ceph-osd[6583]: 2018-08-27 12:49:58.150273 7f753e9ec700 -1 osd.33 1229 heartbeat_check: no reply from 192.168.52.13:6815 osd.0 since back 2018-08-27 12:49:37.974797 front 2018-08-27 12:49:37.974797 (cutoff 2018-08-27 12:49:38.150271)
admin
2,930 Posts
August 27, 2018, 5:55 pmQuote from admin on August 27, 2018, 5:55 pmWe have tested this on our hardware and it works. It could be many things aside from a bug: switch settings, nic hardware (sometimes bonding nics of different hardware does not work well) and lastly the i40e nic driver itself.
Can you try to configure bonding on the nodes yourself manually and see if this solves the issue..if it does not then it is probably one of the above issues.
you can also check the current settings done via PetaSAN :
# check your ips are mapped to bonds via
ip addr
# list the nics assigned to bonds:
cat /sys/class/net/BOND_NAME/bonding/slaves
cat /proc/net/bonding/BOND_NAME
# check the mode is set to lcap
cat /sys/class/net/BOND_NAME/bonding/mode
We have tested this on our hardware and it works. It could be many things aside from a bug: switch settings, nic hardware (sometimes bonding nics of different hardware does not work well) and lastly the i40e nic driver itself.
Can you try to configure bonding on the nodes yourself manually and see if this solves the issue..if it does not then it is probably one of the above issues.
you can also check the current settings done via PetaSAN :
# check your ips are mapped to bonds via
ip addr
# list the nics assigned to bonds:
cat /sys/class/net/BOND_NAME/bonding/slaves
cat /proc/net/bonding/BOND_NAME
# check the mode is set to lcap
cat /sys/class/net/BOND_NAME/bonding/mode
BonsaiJoe
53 Posts
August 27, 2018, 7:55 pmQuote from BonsaiJoe on August 27, 2018, 7:55 pmThen it is the driver, config looks good, switch config is also ok .....do you release 2.1.0 with lastest i40e driver ?
Then it is the driver, config looks good, switch config is also ok .....do you release 2.1.0 with lastest i40e driver ?
Last edited on August 27, 2018, 7:56 pm by BonsaiJoe · #3
ghbiz
76 Posts
February 24, 2019, 3:34 amQuote from ghbiz on February 24, 2019, 3:34 amBonsaiJoe,
While I don't disregard your LACP methodology as i too was looking at using LACP on a Arista MLAG / Cisco VPC topology, the concept of MPIO inherently built into iSCSI warrants simplicity over complexity.
In other words, as the ADMIN has proposed, there are multiple modes to bond interfaces. LACP (802.11ad) is simply one method. No matter the bonding method, make sure that both switch and server are in compliance with each other.
BonsaiJoe,
While I don't disregard your LACP methodology as i too was looking at using LACP on a Arista MLAG / Cisco VPC topology, the concept of MPIO inherently built into iSCSI warrants simplicity over complexity.
In other words, as the ADMIN has proposed, there are multiple modes to bond interfaces. LACP (802.11ad) is simply one method. No matter the bonding method, make sure that both switch and server are in compliance with each other.
LACP bug
BonsaiJoe
53 Posts
Quote from BonsaiJoe on August 27, 2018, 3:35 pmHello,
today one of the sfp+ modules in our backend1 and backend2 lacp Chanel went down, by design nothing should happen cause of there is still one link there. But the whole backend network went down and the node run a lot of errors. The cluster was not usable until we did a reboot of this node.
the node comes up after the reboot with only 1 lacp link in the bond but runs now without any prolem
We have now tested on other nodes what happened when we remove one link from the backend lacp and it’s always the same loosing the whole backend connection and consul shuts down the server in this tests. (We don’t Know why consul did not shut down the server with the real problem)
but we can see this behavior on 3 tested nodes.
any idea?
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519611] i40e 0000:17:00.1 eth1: NIC Link is Down
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519833] i40e 0000:17:00.1 eth1: speed changed to 0 for port eth1
Aug 27 12:49:41 ps02-node01 kernel: [2205628.580495] bond1: link status definitely down for interface eth1, disabling it
Aug 27 12:49:41 ps02-node01 kernel: [2205628.684325] bond1: link status up again after 0 ms for interface eth3
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Refuting a suspect message (from: ps02-node02)
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.500323239s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.12:8300 in 2.508541769s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.510603821s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact quorum of nodes, stepping down
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Follower] entering Follower state (Leader: "")
Aug 27 12:49:43 ps02-node01 consul[2950]: consul.coordinate: Batch update failed: leadership lost while committing log
Aug 27 12:49:43 ps02-node01 consul[2950]: consul: cluster leadership lost
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.13:8300 192.168.52.13:8300}
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.12:8300 192.168.52.12:8300}
Aug 27 12:49:46 ps02-node01 consul[2950]: memberlist: Suspect ps02-node02 has failed, no acks received
Aug 27 12:49:48 ps02-node01 consul[2950]: memberlist: Marking ps02-node03 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:48 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node03 192.168.52.13
Aug 27 12:49:48 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node03 (Addr: tcp/192.168.52.13:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Marking ps02-node02 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:51 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node02 192.168.52.12
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:51 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node02 (Addr: tcp/192.168.52.12:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.13:8300: read tcp 192.168.52.11:33572->192.168.52.13:8300: i/o timeout
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.12:8300: read tcp 192.168.52.11:52040->192.168.52.12:8300: i/o timeout
Aug 27 12:49:52 ps02-node01 kernel: [2205639.535978] ABORT_TASK: Found referenced iSCSI task_tag: 27614526
Aug 27 12:49:52 ps02-node01 consul[2950]: http: Request GET /v1/kv/PetaSAN/Sessions/9fc222a8-1e78-4a03-ad03-e38e8c35f183/_exp, error: No cluster leader from=127.0.0.1:49848
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Heartbeat timeout from "" reached, starting election
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Candidate] entering Candidate state in term 1183
Aug 27 12:49:54 ps02-node01 consul[2950]: agent: coordinate update error: No cluster leader
Aug 27 12:49:55 ps02-node01 consul[2950]: yamux: keepalive failed: i/o deadline reached
Aug 27 12:49:55 ps02-node01 consul[2950]: consul.rpc: multiplex conn accept failed: keepalive timeout from=192.168.52.13:39120
Aug 27 12:49:58 ps02-node01 ceph-osd[6583]: 2018-08-27 12:49:58.150273 7f753e9ec700 -1 osd.33 1229 heartbeat_check: no reply from 192.168.52.13:6815 osd.0 since back 2018-08-27 12:49:37.974797 front 2018-08-27 12:49:37.974797 (cutoff 2018-08-27 12:49:38.150271)
Hello,
today one of the sfp+ modules in our backend1 and backend2 lacp Chanel went down, by design nothing should happen cause of there is still one link there. But the whole backend network went down and the node run a lot of errors. The cluster was not usable until we did a reboot of this node.
the node comes up after the reboot with only 1 lacp link in the bond but runs now without any prolem
We have now tested on other nodes what happened when we remove one link from the backend lacp and it’s always the same loosing the whole backend connection and consul shuts down the server in this tests. (We don’t Know why consul did not shut down the server with the real problem)
but we can see this behavior on 3 tested nodes.
any idea?
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519611] i40e 0000:17:00.1 eth1: NIC Link is Down
Aug 27 12:49:41 ps02-node01 kernel: [2205628.519833] i40e 0000:17:00.1 eth1: speed changed to 0 for port eth1
Aug 27 12:49:41 ps02-node01 kernel: [2205628.580495] bond1: link status definitely down for interface eth1, disabling it
Aug 27 12:49:41 ps02-node01 kernel: [2205628.684325] bond1: link status up again after 0 ms for interface eth3
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:43 ps02-node01 consul[2950]: memberlist: Refuting a suspect message (from: ps02-node02)
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.500323239s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.12:8300 in 2.508541769s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact 192.168.52.13:8300 in 2.510603821s
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Failed to contact quorum of nodes, stepping down
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Follower] entering Follower state (Leader: "")
Aug 27 12:49:43 ps02-node01 consul[2950]: consul.coordinate: Batch update failed: leadership lost while committing log
Aug 27 12:49:43 ps02-node01 consul[2950]: consul: cluster leadership lost
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.13:8300 192.168.52.13:8300}
Aug 27 12:49:43 ps02-node01 consul[2950]: raft: aborting pipeline replication to peer {Voter 192.168.52.12:8300 192.168.52.12:8300}
Aug 27 12:49:46 ps02-node01 consul[2950]: memberlist: Suspect ps02-node02 has failed, no acks received
Aug 27 12:49:48 ps02-node01 consul[2950]: memberlist: Marking ps02-node03 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:48 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node03 192.168.52.13
Aug 27 12:49:48 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node03 (Addr: tcp/192.168.52.13:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Marking ps02-node02 as failed, suspect timeout reached (0 peer confirmations)
Aug 27 12:49:51 ps02-node01 consul[2950]: serf: EventMemberFailed: ps02-node02 192.168.52.12
Aug 27 12:49:51 ps02-node01 consul[2950]: memberlist: Suspect ps02-node03 has failed, no acks received
Aug 27 12:49:51 ps02-node01 consul[2950]: consul: Removing LAN server ps02-node02 (Addr: tcp/192.168.52.12:8300) (DC: petasan)
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.13:8300: read tcp 192.168.52.11:33572->192.168.52.13:8300: i/o timeout
Aug 27 12:49:51 ps02-node01 consul[2950]: raft: Failed to heartbeat to 192.168.52.12:8300: read tcp 192.168.52.11:52040->192.168.52.12:8300: i/o timeout
Aug 27 12:49:52 ps02-node01 kernel: [2205639.535978] ABORT_TASK: Found referenced iSCSI task_tag: 27614526
Aug 27 12:49:52 ps02-node01 consul[2950]: http: Request GET /v1/kv/PetaSAN/Sessions/9fc222a8-1e78-4a03-ad03-e38e8c35f183/_exp, error: No cluster leader from=127.0.0.1:49848
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Heartbeat timeout from "" reached, starting election
Aug 27 12:49:53 ps02-node01 consul[2950]: raft: Node at 192.168.52.11:8300 [Candidate] entering Candidate state in term 1183
Aug 27 12:49:54 ps02-node01 consul[2950]: agent: coordinate update error: No cluster leader
Aug 27 12:49:55 ps02-node01 consul[2950]: yamux: keepalive failed: i/o deadline reached
Aug 27 12:49:55 ps02-node01 consul[2950]: consul.rpc: multiplex conn accept failed: keepalive timeout from=192.168.52.13:39120
Aug 27 12:49:58 ps02-node01 ceph-osd[6583]: 2018-08-27 12:49:58.150273 7f753e9ec700 -1 osd.33 1229 heartbeat_check: no reply from 192.168.52.13:6815 osd.0 since back 2018-08-27 12:49:37.974797 front 2018-08-27 12:49:37.974797 (cutoff 2018-08-27 12:49:38.150271)
admin
2,930 Posts
Quote from admin on August 27, 2018, 5:55 pmWe have tested this on our hardware and it works. It could be many things aside from a bug: switch settings, nic hardware (sometimes bonding nics of different hardware does not work well) and lastly the i40e nic driver itself.
Can you try to configure bonding on the nodes yourself manually and see if this solves the issue..if it does not then it is probably one of the above issues.
you can also check the current settings done via PetaSAN :
# check your ips are mapped to bonds via
ip addr# list the nics assigned to bonds:
cat /sys/class/net/BOND_NAME/bonding/slaves
cat /proc/net/bonding/BOND_NAME# check the mode is set to lcap
cat /sys/class/net/BOND_NAME/bonding/mode
We have tested this on our hardware and it works. It could be many things aside from a bug: switch settings, nic hardware (sometimes bonding nics of different hardware does not work well) and lastly the i40e nic driver itself.
Can you try to configure bonding on the nodes yourself manually and see if this solves the issue..if it does not then it is probably one of the above issues.
you can also check the current settings done via PetaSAN :
# check your ips are mapped to bonds via
ip addr# list the nics assigned to bonds:
cat /sys/class/net/BOND_NAME/bonding/slaves
cat /proc/net/bonding/BOND_NAME# check the mode is set to lcap
cat /sys/class/net/BOND_NAME/bonding/mode
BonsaiJoe
53 Posts
Quote from BonsaiJoe on August 27, 2018, 7:55 pmThen it is the driver, config looks good, switch config is also ok .....do you release 2.1.0 with lastest i40e driver ?
Then it is the driver, config looks good, switch config is also ok .....do you release 2.1.0 with lastest i40e driver ?
ghbiz
76 Posts
Quote from ghbiz on February 24, 2019, 3:34 amBonsaiJoe,
While I don't disregard your LACP methodology as i too was looking at using LACP on a Arista MLAG / Cisco VPC topology, the concept of MPIO inherently built into iSCSI warrants simplicity over complexity.
In other words, as the ADMIN has proposed, there are multiple modes to bond interfaces. LACP (802.11ad) is simply one method. No matter the bonding method, make sure that both switch and server are in compliance with each other.
BonsaiJoe,
While I don't disregard your LACP methodology as i too was looking at using LACP on a Arista MLAG / Cisco VPC topology, the concept of MPIO inherently built into iSCSI warrants simplicity over complexity.
In other words, as the ADMIN has proposed, there are multiple modes to bond interfaces. LACP (802.11ad) is simply one method. No matter the bonding method, make sure that both switch and server are in compliance with each other.