Nodes shutting down
admin
2,930 Posts
June 27, 2018, 6:05 pmQuote from admin on June 27, 2018, 6:05 pm
- Can you give more info on your setup: how many OSDs per node, how much RAM /cores per node. Are you using ssds or hdds, for hdds do you have a separate wal/db, do you use a write back cache controller ?
- In the maintenance tab can you dis-able fencing and scrub/deep-scrub for a couple of days and see if it has an effect ? If it fixes issues, can you re-enable fencing and keep scrub disabled.
- Can you view your historical sats, %cpu, ram., disk busy : do you see any spikes reaching above 90% . Do you have any client load at the time of the problems.
- Can you give more info on your setup: how many OSDs per node, how much RAM /cores per node. Are you using ssds or hdds, for hdds do you have a separate wal/db, do you use a write back cache controller ?
- In the maintenance tab can you dis-able fencing and scrub/deep-scrub for a couple of days and see if it has an effect ? If it fixes issues, can you re-enable fencing and keep scrub disabled.
- Can you view your historical sats, %cpu, ram., disk busy : do you see any spikes reaching above 90% . Do you have any client load at the time of the problems.
khopkins
96 Posts
June 27, 2018, 6:55 pmQuote from khopkins on June 27, 2018, 6:55 pmHello, the setup is using (3) Dell R410 w/2 ethernet ports per server, w/16g for memory. Disks are arranged as a single 2tb and (2) 2tb for the second drive. Hard drives are 7200rpm 2tb drives SATA. Perc6 is setup for drive 0 as a single drive with drive 2 and 3 for RAID 0, don't see anything on cache (no battery on it). Stats are all good as there is nothing using it and have history for it, may have reached 1% in last 7 days. Only 1 OSD per node. Turned off fencing and scrub to see what happens. All ethernet going to a single switch. Should I rebuild taking out the RAID?
Hello, the setup is using (3) Dell R410 w/2 ethernet ports per server, w/16g for memory. Disks are arranged as a single 2tb and (2) 2tb for the second drive. Hard drives are 7200rpm 2tb drives SATA. Perc6 is setup for drive 0 as a single drive with drive 2 and 3 for RAID 0, don't see anything on cache (no battery on it). Stats are all good as there is nothing using it and have history for it, may have reached 1% in last 7 days. Only 1 OSD per node. Turned off fencing and scrub to see what happens. All ethernet going to a single switch. Should I rebuild taking out the RAID?
khopkins
96 Posts
June 27, 2018, 7:59 pmQuote from khopkins on June 27, 2018, 7:59 pmJust noticed while doing a "ceph status" it has errors.
ceph status
2018-06-27 14:53:35.018233 7feabd2dc700 -1 Errors while parsing config file!
2018-06-27 14:53:35.018238 7feabd2dc700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018239 7feabd2dc700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018240 7feabd2dc700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
There is no file called "ceph.conf" in the directory. Only has "XenStorage.client.admin.keyring and XenStorage.conf". This was found on all nodes.
Just noticed while doing a "ceph status" it has errors.
ceph status
2018-06-27 14:53:35.018233 7feabd2dc700 -1 Errors while parsing config file!
2018-06-27 14:53:35.018238 7feabd2dc700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018239 7feabd2dc700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018240 7feabd2dc700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
There is no file called "ceph.conf" in the directory. Only has "XenStorage.client.admin.keyring and XenStorage.conf". This was found on all nodes.
admin
2,930 Posts
June 27, 2018, 9:36 pmQuote from admin on June 27, 2018, 9:36 pmYour cluster name is XenStorage, many ceph commands need the extra parameter --cluster XenStorage
so for status
ceph status --cluster XenStorage
I would test the fencing scrub/deep-scrub first. What we think is happening is for some reason (hardware/scrub/load) a node is not able to communicate with the rest of the cluster ( in your case the other 2 nodes) and is not able to respond to heartbeats, it clears all resources and gets killed via fencing.
After this you can try to break up the raid into raid0 single disks. Do you see any kernel messages in dmesg that could point to hardware/driver issues ? do you have any bios settings that cause power save to switch off hardware ?
Your cluster name is XenStorage, many ceph commands need the extra parameter --cluster XenStorage
so for status
ceph status --cluster XenStorage
I would test the fencing scrub/deep-scrub first. What we think is happening is for some reason (hardware/scrub/load) a node is not able to communicate with the rest of the cluster ( in your case the other 2 nodes) and is not able to respond to heartbeats, it clears all resources and gets killed via fencing.
After this you can try to break up the raid into raid0 single disks. Do you see any kernel messages in dmesg that could point to hardware/driver issues ? do you have any bios settings that cause power save to switch off hardware ?
Last edited on June 27, 2018, 9:37 pm by admin · #14
khopkins
96 Posts
June 28, 2018, 6:40 pmQuote from khopkins on June 28, 2018, 6:40 pmThink we might have found something. In the syslog before the shutdown, this appeared.
Jun
26
9:39:40 PM
PS-Node-2
kernel: [734541.229868] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
Jun
26
9:39:40 PM
PS-Node-2
kernel: [734541.229941] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
Jun
26
9:38:36 PM
PS-Node-2
snmpd[925]: message repeated 8 times: [ Connection from UDP: [172.16.14.58]:64952->[172.16.14.31]:161]
Jun
26
9:39:42 PM
PS-Node-2
consul[1229]: memberlist: Suspect PS-Node-1 has failed, no acks received
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: 172.16.14.30 local addr 172.16.14.31 -> <null>
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #20 eth0, 10.0.4.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130435 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #21 eth1, 10.0.5.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #22 eth1, fe80::7a2b:cbff:fe26:2fd2%3#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #23 eth0, 172.16.14.27#123, interface stats: received=0, sent=0, dropped=0, active_time=130396 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #24 eth0, 172.16.14.25#123, interface stats: received=0, sent=0, dropped=0, active_time=130378 secs
So the interface was shutting down likely causing the back end network to fail as you suggested. Started looking into this of why and found some interesting topics on the Broadcom NIC loosing connection. Came across a few articles and found out the drivers for this chip did have an bug with Linux. The fix was to disable the MSI. How would you do this on your system? Want to make sure before doing any changes.
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20
)
DeviceName: Embedded NIC 2
Subsystem: Dell PowerEdge R410 BCM5716 Gigabit Ethernet
Flags: bus master, fast devsel, latency 0, IRQ 29
Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number d4-ae-52-ff-fe-9d-7f-7e
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting <?>
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2
Kernel modules: bnx2
cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Think we might have found something. In the syslog before the shutdown, this appeared.
Jun
26
9:39:40 PM
PS-Node-2
kernel: [734541.229868] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
Jun
26
9:39:40 PM
PS-Node-2
kernel: [734541.229941] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
Jun
26
9:38:36 PM
PS-Node-2
snmpd[925]: message repeated 8 times: [ Connection from UDP: [172.16.14.58]:64952->[172.16.14.31]:161]
Jun
26
9:39:42 PM
PS-Node-2
consul[1229]: memberlist: Suspect PS-Node-1 has failed, no acks received
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: 172.16.14.30 local addr 172.16.14.31 -> <null>
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #20 eth0, 10.0.4.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130435 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #21 eth1, 10.0.5.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #22 eth1, fe80::7a2b:cbff:fe26:2fd2%3#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #23 eth0, 172.16.14.27#123, interface stats: received=0, sent=0, dropped=0, active_time=130396 secs
Jun
26
9:39:42 PM
PS-Node-2
ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #24 eth0, 172.16.14.25#123, interface stats: received=0, sent=0, dropped=0, active_time=130378 secs
So the interface was shutting down likely causing the back end network to fail as you suggested. Started looking into this of why and found some interesting topics on the Broadcom NIC loosing connection. Came across a few articles and found out the drivers for this chip did have an bug with Linux. The fix was to disable the MSI. How would you do this on your system? Want to make sure before doing any changes.
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20
)
DeviceName: Embedded NIC 2
Subsystem: Dell PowerEdge R410 BCM5716 Gigabit Ethernet
Flags: bus master, fast devsel, latency 0, IRQ 29
Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number d4-ae-52-ff-fe-9d-7f-7e
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting <?>
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2
Kernel modules: bnx2
cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
admin
2,930 Posts
June 28, 2018, 8:48 pmQuote from admin on June 28, 2018, 8:48 pmTo disable msi while loading the bnx2 driver, create a file:
nano /etc/modprobe.d/bnx2.conf
with the line:
options bnx2 disable_msi=1
then reboot.
what is the output of:
ethtool -i ethX
Do you see any other kernel errors in:
dmesg | grep bnx2
dmesg | grep firmware
The "NIC Copper Link is Down" could also be caused by other factors: cables, switch
To disable msi while loading the bnx2 driver, create a file:
nano /etc/modprobe.d/bnx2.conf
with the line:
options bnx2 disable_msi=1
then reboot.
what is the output of:
ethtool -i ethX
Do you see any other kernel errors in:
dmesg | grep bnx2
dmesg | grep firmware
The "NIC Copper Link is Down" could also be caused by other factors: cables, switch
Last edited on June 28, 2018, 8:52 pm by admin · #16
khopkins
96 Posts
June 29, 2018, 2:15 pmQuote from khopkins on June 29, 2018, 2:15 pmethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: 6.2.15 bc 5.2.3 NCSI 2.0.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
dmesg | grep bnx2
[ 1.203530] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.204312] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.205056] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 35.488525] bnx2 0000:01:00.0 eth0: using MSIX
[ 37.839123] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 37.839125] bnx2:
[ 39.884288] bnx2 0000:01:00.1 eth1: using MSIX
[ 42.465421] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 42.465424] bnx2:
dmesg | grep firmware
[ 0.431078] acpi PNP0A08:00: PCIe AER handled by firmware
[ 1.059801] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
modinfo bnx2
filename: /lib/modules/4.4.92-09-petasan/kernel/drivers/net/ethernet/broadcom/bnx2.ko
firmware: bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware: bnx2/bnx2-rv2p-09-6.0.17.fw
firmware: bnx2/bnx2-mips-09-6.2.1b.fw
firmware: bnx2/bnx2-rv2p-06-6.0.15.fw
firmware: bnx2/bnx2-mips-06-6.2.3.fw
version: 2.2.6
license: GPL
description: QLogic BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: DAEFDB682746C4E3AE27475
alias: pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias: pci:v000014E4d00001639sv*sd*bc*sc*i*
alias: pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias: pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
intree: Y
vermagic: 4.4.92-09-petasan SMP mod_unload modversions
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)
All cables and switch have been replaced. Everything has been normalized and will see how it does over the weekend. Will let you know how it goes and appreciate the help in this, really want to use this for our systems, thanks.
ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: 6.2.15 bc 5.2.3 NCSI 2.0.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
dmesg | grep bnx2
[ 1.203530] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.204312] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.205056] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 35.488525] bnx2 0000:01:00.0 eth0: using MSIX
[ 37.839123] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 37.839125] bnx2:
[ 39.884288] bnx2 0000:01:00.1 eth1: using MSIX
[ 42.465421] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 42.465424] bnx2:
dmesg | grep firmware
[ 0.431078] acpi PNP0A08:00: PCIe AER handled by firmware
[ 1.059801] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
modinfo bnx2
filename: /lib/modules/4.4.92-09-petasan/kernel/drivers/net/ethernet/broadcom/bnx2.ko
firmware: bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware: bnx2/bnx2-rv2p-09-6.0.17.fw
firmware: bnx2/bnx2-mips-09-6.2.1b.fw
firmware: bnx2/bnx2-rv2p-06-6.0.15.fw
firmware: bnx2/bnx2-mips-06-6.2.3.fw
version: 2.2.6
license: GPL
description: QLogic BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: DAEFDB682746C4E3AE27475
alias: pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias: pci:v000014E4d00001639sv*sd*bc*sc*i*
alias: pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias: pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
intree: Y
vermagic: 4.4.92-09-petasan SMP mod_unload modversions
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)
All cables and switch have been replaced. Everything has been normalized and will see how it does over the weekend. Will let you know how it goes and appreciate the help in this, really want to use this for our systems, thanks.
khopkins
96 Posts
July 11, 2018, 1:55 pmQuote from khopkins on July 11, 2018, 1:55 pmWell, have reloaded stack and put in the options for the bnx2 driver and it crashed one of the servers again. Turned of the fencing and ran it a while and its staying up so the driver qlich is triggering the shutdown . Below is the log from kern concerning the bnx2
dmesg | grep bnx2
[ 1.303861] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.304627] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.305295] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 28.244185] bnx2 0000:01:00.0 eth0: using MSIX
[ 30.594935] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 30.594937] bnx2:
[ 32.972159] bnx2 0000:01:00.1 eth1: using MSIX
[ 35.344049] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 35.344052] bnx2:
[109192.923660] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
[109195.604253] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
[109196.239101] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[109196.239104] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
[109198.723257] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[109198.723261] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
The NIC doesn't seem to be down long so what is the default for timeout on fencing, and can the time be extending for a little while longer?
Well, have reloaded stack and put in the options for the bnx2 driver and it crashed one of the servers again. Turned of the fencing and ran it a while and its staying up so the driver qlich is triggering the shutdown . Below is the log from kern concerning the bnx2
dmesg | grep bnx2
[ 1.303861] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.304627] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.305295] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 28.244185] bnx2 0000:01:00.0 eth0: using MSIX
[ 30.594935] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 30.594937] bnx2:
[ 32.972159] bnx2 0000:01:00.1 eth1: using MSIX
[ 35.344049] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 35.344052] bnx2:
[109192.923660] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
[109195.604253] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
[109196.239101] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[109196.239104] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
[109198.723257] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[109198.723261] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
The NIC doesn't seem to be down long so what is the default for timeout on fencing, and can the time be extending for a little while longer?
admin
2,930 Posts
July 11, 2018, 3:32 pmQuote from admin on July 11, 2018, 3:32 pmCan you double check the driver read the param correctly:
cat /sys/module/bnx2/parameters/disable_msi
Can you double check the driver read the param correctly:
cat /sys/module/bnx2/parameters/disable_msi
khopkins
96 Posts
July 11, 2018, 3:48 pmQuote from khopkins on July 11, 2018, 3:48 pmlooks like you might have hit on it. There is a file in "modprobe.d" called "bnx2.conf", with the "options bnx2 disable_msi=1. If the command cat /sys/module/bnx2/parameters/disable_msi is run, the return is "0", if reading it right. Nodes were rebooted when config was made.
root@PS-NODE-1:/etc/modprobe.d# ls
blacklist-ath_pci.conf blacklist-framebuffer.conf blacklist-watchdog.conf bnx2.conf iwlwifi.conf
blacklist-firewire.conf blacklist-rare-network.conf blacklist.conf fbdev-blacklist.conf mlx4.conf
root@PS-NODE-1:/etc/modprobe.d# cat bnx2.conf
options bnx2 disable_msi=1
cat /sys/module/bnx2/parameters/disable_msi
0
looks like you might have hit on it. There is a file in "modprobe.d" called "bnx2.conf", with the "options bnx2 disable_msi=1. If the command cat /sys/module/bnx2/parameters/disable_msi is run, the return is "0", if reading it right. Nodes were rebooted when config was made.
root@PS-NODE-1:/etc/modprobe.d# ls
blacklist-ath_pci.conf blacklist-framebuffer.conf blacklist-watchdog.conf bnx2.conf iwlwifi.conf
blacklist-firewire.conf blacklist-rare-network.conf blacklist.conf fbdev-blacklist.conf mlx4.conf
root@PS-NODE-1:/etc/modprobe.d# cat bnx2.conf
options bnx2 disable_msi=1
cat /sys/module/bnx2/parameters/disable_msi
0
Nodes shutting down
admin
2,930 Posts
Quote from admin on June 27, 2018, 6:05 pm
- Can you give more info on your setup: how many OSDs per node, how much RAM /cores per node. Are you using ssds or hdds, for hdds do you have a separate wal/db, do you use a write back cache controller ?
- In the maintenance tab can you dis-able fencing and scrub/deep-scrub for a couple of days and see if it has an effect ? If it fixes issues, can you re-enable fencing and keep scrub disabled.
- Can you view your historical sats, %cpu, ram., disk busy : do you see any spikes reaching above 90% . Do you have any client load at the time of the problems.
- Can you give more info on your setup: how many OSDs per node, how much RAM /cores per node. Are you using ssds or hdds, for hdds do you have a separate wal/db, do you use a write back cache controller ?
- In the maintenance tab can you dis-able fencing and scrub/deep-scrub for a couple of days and see if it has an effect ? If it fixes issues, can you re-enable fencing and keep scrub disabled.
- Can you view your historical sats, %cpu, ram., disk busy : do you see any spikes reaching above 90% . Do you have any client load at the time of the problems.
khopkins
96 Posts
Quote from khopkins on June 27, 2018, 6:55 pmHello, the setup is using (3) Dell R410 w/2 ethernet ports per server, w/16g for memory. Disks are arranged as a single 2tb and (2) 2tb for the second drive. Hard drives are 7200rpm 2tb drives SATA. Perc6 is setup for drive 0 as a single drive with drive 2 and 3 for RAID 0, don't see anything on cache (no battery on it). Stats are all good as there is nothing using it and have history for it, may have reached 1% in last 7 days. Only 1 OSD per node. Turned off fencing and scrub to see what happens. All ethernet going to a single switch. Should I rebuild taking out the RAID?
Hello, the setup is using (3) Dell R410 w/2 ethernet ports per server, w/16g for memory. Disks are arranged as a single 2tb and (2) 2tb for the second drive. Hard drives are 7200rpm 2tb drives SATA. Perc6 is setup for drive 0 as a single drive with drive 2 and 3 for RAID 0, don't see anything on cache (no battery on it). Stats are all good as there is nothing using it and have history for it, may have reached 1% in last 7 days. Only 1 OSD per node. Turned off fencing and scrub to see what happens. All ethernet going to a single switch. Should I rebuild taking out the RAID?
khopkins
96 Posts
Quote from khopkins on June 27, 2018, 7:59 pmJust noticed while doing a "ceph status" it has errors.
ceph status
2018-06-27 14:53:35.018233 7feabd2dc700 -1 Errors while parsing config file!
2018-06-27 14:53:35.018238 7feabd2dc700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018239 7feabd2dc700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018240 7feabd2dc700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
There is no file called "ceph.conf" in the directory. Only has "XenStorage.client.admin.keyring and XenStorage.conf". This was found on all nodes.
Just noticed while doing a "ceph status" it has errors.
ceph status
2018-06-27 14:53:35.018233 7feabd2dc700 -1 Errors while parsing config file!
2018-06-27 14:53:35.018238 7feabd2dc700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018239 7feabd2dc700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018240 7feabd2dc700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
There is no file called "ceph.conf" in the directory. Only has "XenStorage.client.admin.keyring and XenStorage.conf". This was found on all nodes.
admin
2,930 Posts
Quote from admin on June 27, 2018, 9:36 pmYour cluster name is XenStorage, many ceph commands need the extra parameter --cluster XenStorage
so for status
ceph status --cluster XenStorage
I would test the fencing scrub/deep-scrub first. What we think is happening is for some reason (hardware/scrub/load) a node is not able to communicate with the rest of the cluster ( in your case the other 2 nodes) and is not able to respond to heartbeats, it clears all resources and gets killed via fencing.
After this you can try to break up the raid into raid0 single disks. Do you see any kernel messages in dmesg that could point to hardware/driver issues ? do you have any bios settings that cause power save to switch off hardware ?
Your cluster name is XenStorage, many ceph commands need the extra parameter --cluster XenStorage
so for status
ceph status --cluster XenStorage
I would test the fencing scrub/deep-scrub first. What we think is happening is for some reason (hardware/scrub/load) a node is not able to communicate with the rest of the cluster ( in your case the other 2 nodes) and is not able to respond to heartbeats, it clears all resources and gets killed via fencing.
After this you can try to break up the raid into raid0 single disks. Do you see any kernel messages in dmesg that could point to hardware/driver issues ? do you have any bios settings that cause power save to switch off hardware ?
khopkins
96 Posts
Quote from khopkins on June 28, 2018, 6:40 pmThink we might have found something. In the syslog before the shutdown, this appeared.
Jun 26 9:39:40 PM PS-Node-2 kernel: [734541.229868] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down Jun 26 9:39:40 PM PS-Node-2 kernel: [734541.229941] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down Jun 26 9:38:36 PM PS-Node-2 snmpd[925]: message repeated 8 times: [ Connection from UDP: [172.16.14.58]:64952->[172.16.14.31]:161] Jun 26 9:39:42 PM PS-Node-2 consul[1229]: memberlist: Suspect PS-Node-1 has failed, no acks received Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: 172.16.14.30 local addr 172.16.14.31 -> <null> Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #20 eth0, 10.0.4.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130435 secs Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #21 eth1, 10.0.5.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #22 eth1, fe80::7a2b:cbff:fe26:2fd2%3#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #23 eth0, 172.16.14.27#123, interface stats: received=0, sent=0, dropped=0, active_time=130396 secs Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #24 eth0, 172.16.14.25#123, interface stats: received=0, sent=0, dropped=0, active_time=130378 secs So the interface was shutting down likely causing the back end network to fail as you suggested. Started looking into this of why and found some interesting topics on the Broadcom NIC loosing connection. Came across a few articles and found out the drivers for this chip did have an bug with Linux. The fix was to disable the MSI. How would you do this on your system? Want to make sure before doing any changes.
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20
)
DeviceName: Embedded NIC 2
Subsystem: Dell PowerEdge R410 BCM5716 Gigabit Ethernet
Flags: bus master, fast devsel, latency 0, IRQ 29
Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number d4-ae-52-ff-fe-9d-7f-7e
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting <?>
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2
Kernel modules: bnx2
cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Think we might have found something. In the syslog before the shutdown, this appeared.
Jun | 26 | 9:39:40 PM | PS-Node-2 | kernel: [734541.229868] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down | |||
Jun | 26 | 9:39:40 PM | PS-Node-2 | kernel: [734541.229941] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down | |||
Jun | 26 | 9:38:36 PM | PS-Node-2 | snmpd[925]: message repeated 8 times: [ Connection from UDP: [172.16.14.58]:64952->[172.16.14.31]:161] | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | consul[1229]: memberlist: Suspect PS-Node-1 has failed, no acks received | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: 172.16.14.30 local addr 172.16.14.31 -> <null> | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #20 eth0, 10.0.4.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130435 secs | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #21 eth1, 10.0.5.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #22 eth1, fe80::7a2b:cbff:fe26:2fd2%3#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #23 eth0, 172.16.14.27#123, interface stats: received=0, sent=0, dropped=0, active_time=130396 secs | |||
Jun | 26 | 9:39:42 PM | PS-Node-2 | ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #24 eth0, 172.16.14.25#123, interface stats: received=0, sent=0, dropped=0, active_time=130378 secs |
So the interface was shutting down likely causing the back end network to fail as you suggested. Started looking into this of why and found some interesting topics on the Broadcom NIC loosing connection. Came across a few articles and found out the drivers for this chip did have an bug with Linux. The fix was to disable the MSI. How would you do this on your system? Want to make sure before doing any changes.
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20
)
DeviceName: Embedded NIC 2
Subsystem: Dell PowerEdge R410 BCM5716 Gigabit Ethernet
Flags: bus master, fast devsel, latency 0, IRQ 29
Memory at dc000000 (64-bit, non-prefetchable) [size=32M]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number d4-ae-52-ff-fe-9d-7f-7e
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting <?>
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2
Kernel modules: bnx2
cat /etc/*-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
admin
2,930 Posts
Quote from admin on June 28, 2018, 8:48 pmTo disable msi while loading the bnx2 driver, create a file:
nano /etc/modprobe.d/bnx2.conf
with the line:
options bnx2 disable_msi=1
then reboot.
what is the output of:
ethtool -i ethX
Do you see any other kernel errors in:
dmesg | grep bnx2
dmesg | grep firmwareThe "NIC Copper Link is Down" could also be caused by other factors: cables, switch
To disable msi while loading the bnx2 driver, create a file:
nano /etc/modprobe.d/bnx2.conf
with the line:
options bnx2 disable_msi=1
then reboot.
what is the output of:
ethtool -i ethX
Do you see any other kernel errors in:
dmesg | grep bnx2
dmesg | grep firmware
The "NIC Copper Link is Down" could also be caused by other factors: cables, switch
khopkins
96 Posts
Quote from khopkins on June 29, 2018, 2:15 pmethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: 6.2.15 bc 5.2.3 NCSI 2.0.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: nodmesg | grep bnx2
[ 1.203530] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.204312] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.205056] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 35.488525] bnx2 0000:01:00.0 eth0: using MSIX
[ 37.839123] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 37.839125] bnx2:
[ 39.884288] bnx2 0000:01:00.1 eth1: using MSIX
[ 42.465421] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 42.465424] bnx2:dmesg | grep firmware
[ 0.431078] acpi PNP0A08:00: PCIe AER handled by firmware
[ 1.059801] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.modinfo bnx2
filename: /lib/modules/4.4.92-09-petasan/kernel/drivers/net/ethernet/broadcom/bnx2.ko
firmware: bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware: bnx2/bnx2-rv2p-09-6.0.17.fw
firmware: bnx2/bnx2-mips-09-6.2.1b.fw
firmware: bnx2/bnx2-rv2p-06-6.0.15.fw
firmware: bnx2/bnx2-mips-06-6.2.3.fw
version: 2.2.6
license: GPL
description: QLogic BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: DAEFDB682746C4E3AE27475
alias: pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias: pci:v000014E4d00001639sv*sd*bc*sc*i*
alias: pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias: pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
intree: Y
vermagic: 4.4.92-09-petasan SMP mod_unload modversions
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)All cables and switch have been replaced. Everything has been normalized and will see how it does over the weekend. Will let you know how it goes and appreciate the help in this, really want to use this for our systems, thanks.
ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: 6.2.15 bc 5.2.3 NCSI 2.0.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
dmesg | grep bnx2
[ 1.203530] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.204312] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.205056] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 35.488525] bnx2 0000:01:00.0 eth0: using MSIX
[ 37.839123] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 37.839125] bnx2:
[ 39.884288] bnx2 0000:01:00.1 eth1: using MSIX
[ 42.465421] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 42.465424] bnx2:
dmesg | grep firmware
[ 0.431078] acpi PNP0A08:00: PCIe AER handled by firmware
[ 1.059801] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.
modinfo bnx2
filename: /lib/modules/4.4.92-09-petasan/kernel/drivers/net/ethernet/broadcom/bnx2.ko
firmware: bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware: bnx2/bnx2-rv2p-09-6.0.17.fw
firmware: bnx2/bnx2-mips-09-6.2.1b.fw
firmware: bnx2/bnx2-rv2p-06-6.0.15.fw
firmware: bnx2/bnx2-mips-06-6.2.3.fw
version: 2.2.6
license: GPL
description: QLogic BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: DAEFDB682746C4E3AE27475
alias: pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias: pci:v000014E4d00001639sv*sd*bc*sc*i*
alias: pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias: pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
intree: Y
vermagic: 4.4.92-09-petasan SMP mod_unload modversions
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)
All cables and switch have been replaced. Everything has been normalized and will see how it does over the weekend. Will let you know how it goes and appreciate the help in this, really want to use this for our systems, thanks.
khopkins
96 Posts
Quote from khopkins on July 11, 2018, 1:55 pmWell, have reloaded stack and put in the options for the bnx2 driver and it crashed one of the servers again. Turned of the fencing and ran it a while and its staying up so the driver qlich is triggering the shutdown . Below is the log from kern concerning the bnx2
dmesg | grep bnx2
[ 1.303861] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.304627] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.305295] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 28.244185] bnx2 0000:01:00.0 eth0: using MSIX
[ 30.594935] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 30.594937] bnx2:
[ 32.972159] bnx2 0000:01:00.1 eth1: using MSIX
[ 35.344049] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 35.344052] bnx2:
[109192.923660] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
[109195.604253] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
[109196.239101] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[109196.239104] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
[109198.723257] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[109198.723261] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:The NIC doesn't seem to be down long so what is the default for timeout on fencing, and can the time be extending for a little while longer?
Well, have reloaded stack and put in the options for the bnx2 driver and it crashed one of the servers again. Turned of the fencing and ran it a while and its staying up so the driver qlich is triggering the shutdown . Below is the log from kern concerning the bnx2
dmesg | grep bnx2
[ 1.303861] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.304627] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.305295] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 28.244185] bnx2 0000:01:00.0 eth0: using MSIX
[ 30.594935] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 30.594937] bnx2:
[ 32.972159] bnx2 0000:01:00.1 eth1: using MSIX
[ 35.344049] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 35.344052] bnx2:
[109192.923660] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
[109195.604253] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
[109196.239101] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[109196.239104] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
[109198.723257] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[109198.723261] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
The NIC doesn't seem to be down long so what is the default for timeout on fencing, and can the time be extending for a little while longer?
admin
2,930 Posts
Quote from admin on July 11, 2018, 3:32 pmCan you double check the driver read the param correctly:
cat /sys/module/bnx2/parameters/disable_msi
Can you double check the driver read the param correctly:
cat /sys/module/bnx2/parameters/disable_msi
khopkins
96 Posts
Quote from khopkins on July 11, 2018, 3:48 pmlooks like you might have hit on it. There is a file in "modprobe.d" called "bnx2.conf", with the "options bnx2 disable_msi=1. If the command cat /sys/module/bnx2/parameters/disable_msi is run, the return is "0", if reading it right. Nodes were rebooted when config was made.
root@PS-NODE-1:/etc/modprobe.d# ls
blacklist-ath_pci.conf blacklist-framebuffer.conf blacklist-watchdog.conf bnx2.conf iwlwifi.conf
blacklist-firewire.conf blacklist-rare-network.conf blacklist.conf fbdev-blacklist.conf mlx4.conf
root@PS-NODE-1:/etc/modprobe.d# cat bnx2.conf
options bnx2 disable_msi=1cat /sys/module/bnx2/parameters/disable_msi
0
looks like you might have hit on it. There is a file in "modprobe.d" called "bnx2.conf", with the "options bnx2 disable_msi=1. If the command cat /sys/module/bnx2/parameters/disable_msi is run, the return is "0", if reading it right. Nodes were rebooted when config was made.
root@PS-NODE-1:/etc/modprobe.d# ls
blacklist-ath_pci.conf blacklist-framebuffer.conf blacklist-watchdog.conf bnx2.conf iwlwifi.conf
blacklist-firewire.conf blacklist-rare-network.conf blacklist.conf fbdev-blacklist.conf mlx4.conf
root@PS-NODE-1:/etc/modprobe.d# cat bnx2.conf
options bnx2 disable_msi=1
cat /sys/module/bnx2/parameters/disable_msi
0