Forums - PetaSAN

ForumGeneral DiscussionNodes shutting down
You need to log in to create posts and topics. Login · Register
Nodes shutting down

Pages: 1 2 3 4

admin
2,930 Posts

June 27, 2018, 6:05 pm
Quote from admin on June 27, 2018, 6:05 pm

Can you give more info on your setup: how many OSDs per node, how much RAM /cores per node. Are you using ssds or hdds, for hdds do you have a separate wal/db, do you use a write back cache controller ?

In the maintenance tab can you dis-able fencing and scrub/deep-scrub for a couple of days and see if it has an effect ? If it fixes issues, can you re-enable fencing and keep scrub disabled.

Can you view your historical sats, %cpu, ram., disk busy : do you see any spikes reaching above 90% . Do you have any client load at the time of the problems.

Can you give more info on your setup: how many OSDs per node, how much RAM /cores per node. Are you using ssds or hdds, for hdds do you have a separate wal/db, do you use a write back cache controller ?

In the maintenance tab can you dis-able fencing and scrub/deep-scrub for a couple of days and see if it has an effect ? If it fixes issues, can you re-enable fencing and keep scrub disabled.

Can you view your historical sats, %cpu, ram., disk busy : do you see any spikes reaching above 90% . Do you have any client load at the time of the problems.

#11

khopkins
96 Posts

June 27, 2018, 6:55 pm
Quote from khopkins on June 27, 2018, 6:55 pm
Hello, the setup is using (3) Dell R410 w/2 ethernet ports per server, w/16g for memory. Disks are arranged as a single 2tb and (2) 2tb for the second drive. Hard drives are 7200rpm 2tb drives SATA. Perc6 is setup for drive 0 as a single drive with drive 2 and 3 for RAID 0, don't see anything on cache (no battery on it). Stats are all good as there is nothing using it and have history for it, may have reached 1% in last 7 days. Only 1 OSD per node. Turned off fencing and scrub to see what happens. All ethernet going to a single switch. Should I rebuild taking out the RAID?

Hello, the setup is using (3) Dell R410 w/2 ethernet ports per server, w/16g for memory. Disks are arranged as a single 2tb and (2) 2tb for the second drive. Hard drives are 7200rpm 2tb drives SATA. Perc6 is setup for drive 0 as a single drive with drive 2 and 3 for RAID 0, don't see anything on cache (no battery on it). Stats are all good as there is nothing using it and have history for it, may have reached 1% in last 7 days. Only 1 OSD per node. Turned off fencing and scrub to see what happens. All ethernet going to a single switch. Should I rebuild taking out the RAID?

#12

khopkins
96 Posts

June 27, 2018, 7:59 pm
Quote from khopkins on June 27, 2018, 7:59 pm
Just noticed while doing a "ceph status" it has errors.

ceph status
2018-06-27 14:53:35.018233 7feabd2dc700 -1 Errors while parsing config file!
2018-06-27 14:53:35.018238 7feabd2dc700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018239 7feabd2dc700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018240 7feabd2dc700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

There is no file called "ceph.conf" in the directory. Only has "XenStorage.client.admin.keyring and XenStorage.conf". This was found on all nodes.

Just noticed while doing a "ceph status" it has errors.

ceph status
2018-06-27 14:53:35.018233 7feabd2dc700 -1 Errors while parsing config file!
2018-06-27 14:53:35.018238 7feabd2dc700 -1 parse_file: cannot open /etc/ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018239 7feabd2dc700 -1 parse_file: cannot open ~/.ceph/ceph.conf: (2) No such file or directory
2018-06-27 14:53:35.018240 7feabd2dc700 -1 parse_file: cannot open ceph.conf: (2) No such file or directory
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)

There is no file called "ceph.conf" in the directory. Only has "XenStorage.client.admin.keyring and XenStorage.conf". This was found on all nodes.

#13

admin
2,930 Posts

June 27, 2018, 9:36 pm
Quote from admin on June 27, 2018, 9:36 pm
Your cluster name is XenStorage, many ceph commands need the extra parameter --cluster XenStorage

so for status

ceph status --cluster XenStorage

I would test the fencing scrub/deep-scrub first. What we think is happening is for some reason (hardware/scrub/load) a node is not able to communicate with the rest of the cluster ( in your case the other 2 nodes) and is not able to respond to heartbeats, it clears all resources and gets killed via fencing.

After this you can try to break up the raid into raid0 single disks. Do you see any kernel messages in dmesg that could point to hardware/driver issues ? do you have any bios settings that cause power save to switch off hardware ?

Your cluster name is XenStorage, many ceph commands need the extra parameter --cluster XenStorage

so for status

ceph status --cluster XenStorage

I would test the fencing scrub/deep-scrub first. What we think is happening is for some reason (hardware/scrub/load) a node is not able to communicate with the rest of the cluster ( in your case the other 2 nodes) and is not able to respond to heartbeats, it clears all resources and gets killed via fencing.

After this you can try to break up the raid into raid0 single disks. Do you see any kernel messages in dmesg that could point to hardware/driver issues ? do you have any bios settings that cause power save to switch off hardware ?

Last edited on June 27, 2018, 9:37 pm by admin · #14

khopkins
96 Posts

June 28, 2018, 6:40 pm
Quote from khopkins on June 28, 2018, 6:40 pm
Think we might have found something. In the syslog before the shutdown, this appeared.

Jun 26 9:39:40 PM PS-Node-2 kernel: [734541.229868] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down

Jun 26 9:39:40 PM PS-Node-2 kernel: [734541.229941] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down

Jun 26 9:38:36 PM PS-Node-2 snmpd[925]: message repeated 8 times: [ Connection from UDP: [172.16.14.58]:64952->[172.16.14.31]:161]

Jun 26 9:39:42 PM PS-Node-2 consul[1229]: memberlist: Suspect PS-Node-1 has failed, no acks received

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: 172.16.14.30 local addr 172.16.14.31 -> <null>

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #20 eth0, 10.0.4.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130435 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #21 eth1, 10.0.5.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #22 eth1, fe80::7a2b:cbff:fe26:2fd2%3#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #23 eth0, 172.16.14.27#123, interface stats: received=0, sent=0, dropped=0, active_time=130396 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #24 eth0, 172.16.14.25#123, interface stats: received=0, sent=0, dropped=0, active_time=130378 secs

So the interface was shutting down likely causing the back end network to fail as you suggested. Started looking into this of why and found some interesting topics on the Broadcom NIC loosing connection. Came across a few articles and found out the drivers for this chip did have an bug with Linux. The fix was to disable the MSI. How would you do this on your system? Want to make sure before doing any changes.

01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20

)

DeviceName: Embedded NIC 2

Subsystem: Dell PowerEdge R410 BCM5716 Gigabit Ethernet

Flags: bus master, fast devsel, latency 0, IRQ 29

Memory at dc000000 (64-bit, non-prefetchable) [size=32M]

Capabilities: [48] Power Management version 3

Capabilities: [50] Vital Product Data

Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+

Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-

Capabilities: [ac] Express Endpoint, MSI 00

Capabilities: [100] Device Serial Number d4-ae-52-ff-fe-9d-7f-7e

Capabilities: [110] Advanced Error Reporting

Capabilities: [150] Power Budgeting <?>

Capabilities: [160] Virtual Channel

Kernel driver in use: bnx2

Kernel modules: bnx2

cat /etc/-release

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=16.04

DISTRIB_CODENAME=xenial

DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

NAME="Ubuntu"

VERSION="16.04.3 LTS (Xenial Xerus)"

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME="Ubuntu 16.04.3 LTS"

VERSION_ID="16.04"

HOME_URL="http://www.ubuntu.com/"

SUPPORT_URL="http://help.ubuntu.com/"

BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

VERSION_CODENAME=xenial

UBUNTU_CODENAME=xenial

Think we might have found something. In the syslog before the shutdown, this appeared.

Jun 26 9:39:40 PM PS-Node-2 kernel: [734541.229868] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down

Jun 26 9:39:40 PM PS-Node-2 kernel: [734541.229941] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down

Jun 26 9:38:36 PM PS-Node-2 snmpd[925]: message repeated 8 times: [ Connection from UDP: [172.16.14.58]:64952->[172.16.14.31]:161]

Jun 26 9:39:42 PM PS-Node-2 consul[1229]: memberlist: Suspect PS-Node-1 has failed, no acks received

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #19 eth0, 172.16.14.31#123, interface stats: received=2734, sent=2742, dropped=0, active_time=130435 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: 172.16.14.30 local addr 172.16.14.31 -> <null>

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #20 eth0, 10.0.4.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130435 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #21 eth1, 10.0.5.31#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #22 eth1, fe80::7a2b:cbff:fe26:2fd2%3#123, interface stats: received=0, sent=0, dropped=0, active_time=130424 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #23 eth0, 172.16.14.27#123, interface stats: received=0, sent=0, dropped=0, active_time=130396 secs

Jun 26 9:39:42 PM PS-Node-2 ntpd[1219]: 26 Jun 21:39:42 ntpd[1219]: Deleting interface #24 eth0, 172.16.14.25#123, interface stats: received=0, sent=0, dropped=0, active_time=130378 secs

So the interface was shutting down likely causing the back end network to fail as you suggested. Started looking into this of why and found some interesting topics on the Broadcom NIC loosing connection. Came across a few articles and found out the drivers for this chip did have an bug with Linux. The fix was to disable the MSI. How would you do this on your system? Want to make sure before doing any changes.

01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5716 Gigabit Ethernet (rev 20

)

DeviceName: Embedded NIC 2

Subsystem: Dell PowerEdge R410 BCM5716 Gigabit Ethernet

Flags: bus master, fast devsel, latency 0, IRQ 29

Memory at dc000000 (64-bit, non-prefetchable) [size=32M]

Capabilities: [48] Power Management version 3

Capabilities: [50] Vital Product Data

Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+

Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-

Capabilities: [ac] Express Endpoint, MSI 00

Capabilities: [100] Device Serial Number d4-ae-52-ff-fe-9d-7f-7e

Capabilities: [110] Advanced Error Reporting

Capabilities: [150] Power Budgeting <?>

Capabilities: [160] Virtual Channel

Kernel driver in use: bnx2

Kernel modules: bnx2

cat /etc/-release

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=16.04

DISTRIB_CODENAME=xenial

DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"

NAME="Ubuntu"

VERSION="16.04.3 LTS (Xenial Xerus)"

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME="Ubuntu 16.04.3 LTS"

VERSION_ID="16.04"

HOME_URL="http://www.ubuntu.com/"

SUPPORT_URL="http://help.ubuntu.com/"

BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"

VERSION_CODENAME=xenial

UBUNTU_CODENAME=xenial

#15

admin
2,930 Posts

June 28, 2018, 8:48 pm
Quote from admin on June 28, 2018, 8:48 pm
To disable msi while loading the bnx2 driver, create a file:

nano /etc/modprobe.d/bnx2.conf

with the line:

options bnx2 disable_msi=1

then reboot.

what is the output of:

ethtool -i ethX

Do you see any other kernel errors in:

dmesg | grep bnx2
dmesg | grep firmware

The "NIC Copper Link is Down" could also be caused by other factors: cables, switch

To disable msi while loading the bnx2 driver, create a file:

nano /etc/modprobe.d/bnx2.conf

with the line:

options bnx2 disable_msi=1

then reboot.

what is the output of:

ethtool -i ethX

Do you see any other kernel errors in:

dmesg | grep bnx2
dmesg | grep firmware

The "NIC Copper Link is Down" could also be caused by other factors: cables, switch

Last edited on June 28, 2018, 8:52 pm by admin · #16

khopkins
96 Posts

June 29, 2018, 2:15 pm
Quote from khopkins on June 29, 2018, 2:15 pm
ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: 6.2.15 bc 5.2.3 NCSI 2.0.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

dmesg | grep bnx2
[ 1.203530] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.204312] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.205056] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 35.488525] bnx2 0000:01:00.0 eth0: using MSIX
[ 37.839123] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 37.839125] bnx2:
[ 39.884288] bnx2 0000:01:00.1 eth1: using MSIX
[ 42.465421] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 42.465424] bnx2:

dmesg | grep firmware
[ 0.431078] acpi PNP0A08:00: PCIe AER handled by firmware
[ 1.059801] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

modinfo bnx2
filename: /lib/modules/4.4.92-09-petasan/kernel/drivers/net/ethernet/broadcom/bnx2.ko
firmware: bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware: bnx2/bnx2-rv2p-09-6.0.17.fw
firmware: bnx2/bnx2-mips-09-6.2.1b.fw
firmware: bnx2/bnx2-rv2p-06-6.0.15.fw
firmware: bnx2/bnx2-mips-06-6.2.3.fw
version: 2.2.6
license: GPL
description: QLogic BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: DAEFDB682746C4E3AE27475
alias: pci:v000014E4d0000163Csvsdbcsci*
alias: pci:v000014E4d0000163Bsvsdbcsci*
alias: pci:v000014E4d0000163Asvsdbcsci*
alias: pci:v000014E4d00001639svsdbcsci*
alias: pci:v000014E4d000016ACsvsdbcsci*
alias: pci:v000014E4d000016AAsvsdbcsci*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bcsci*
alias: pci:v000014E4d0000164Csvsdbcsci*
alias: pci:v000014E4d0000164Asvsdbcsci*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bcsci*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bcsci*
depends:
intree: Y
vermagic: 4.4.92-09-petasan SMP mod_unload modversions
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)

All cables and switch have been replaced. Everything has been normalized and will see how it does over the weekend. Will let you know how it goes and appreciate the help in this, really want to use this for our systems, thanks.

ethtool -i eth0
driver: bnx2
version: 2.2.6
firmware-version: 6.2.15 bc 5.2.3 NCSI 2.0.11
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

dmesg | grep bnx2
[ 1.203530] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.204312] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.205056] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 35.488525] bnx2 0000:01:00.0 eth0: using MSIX
[ 37.839123] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 37.839125] bnx2:
[ 39.884288] bnx2 0000:01:00.1 eth1: using MSIX
[ 42.465421] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 42.465424] bnx2:

dmesg | grep firmware
[ 0.431078] acpi PNP0A08:00: PCIe AER handled by firmware
[ 1.059801] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC.

modinfo bnx2
filename: /lib/modules/4.4.92-09-petasan/kernel/drivers/net/ethernet/broadcom/bnx2.ko
firmware: bnx2/bnx2-rv2p-09ax-6.0.17.fw
firmware: bnx2/bnx2-rv2p-09-6.0.17.fw
firmware: bnx2/bnx2-mips-09-6.2.1b.fw
firmware: bnx2/bnx2-rv2p-06-6.0.15.fw
firmware: bnx2/bnx2-mips-06-6.2.3.fw
version: 2.2.6
license: GPL
description: QLogic BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: DAEFDB682746C4E3AE27475
alias: pci:v000014E4d0000163Csvsdbcsci*
alias: pci:v000014E4d0000163Bsvsdbcsci*
alias: pci:v000014E4d0000163Asvsdbcsci*
alias: pci:v000014E4d00001639svsdbcsci*
alias: pci:v000014E4d000016ACsvsdbcsci*
alias: pci:v000014E4d000016AAsvsdbcsci*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bcsci*
alias: pci:v000014E4d0000164Csvsdbcsci*
alias: pci:v000014E4d0000164Asvsdbcsci*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bcsci*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bcsci*
depends:
intree: Y
vermagic: 4.4.92-09-petasan SMP mod_unload modversions
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)

All cables and switch have been replaced. Everything has been normalized and will see how it does over the weekend. Will let you know how it goes and appreciate the help in this, really want to use this for our systems, thanks.

#17

khopkins
96 Posts

July 11, 2018, 1:55 pm
Quote from khopkins on July 11, 2018, 1:55 pm
Well, have reloaded stack and put in the options for the bnx2 driver and it crashed one of the servers again. Turned of the fencing and ran it a while and its staying up so the driver qlich is triggering the shutdown . Below is the log from kern concerning the bnx2

dmesg | grep bnx2
[ 1.303861] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.304627] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.305295] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 28.244185] bnx2 0000:01:00.0 eth0: using MSIX
[ 30.594935] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 30.594937] bnx2:
[ 32.972159] bnx2 0000:01:00.1 eth1: using MSIX
[ 35.344049] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 35.344052] bnx2:
[109192.923660] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
[109195.604253] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
[109196.239101] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[109196.239104] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
[109198.723257] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[109198.723261] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:

The NIC doesn't seem to be down long so what is the default for timeout on fencing, and can the time be extending for a little while longer?

Well, have reloaded stack and put in the options for the bnx2 driver and it crashed one of the servers again. Turned of the fencing and ran it a while and its staying up so the driver qlich is triggering the shutdown . Below is the log from kern concerning the bnx2

dmesg | grep bnx2
[ 1.303861] bnx2: QLogic bnx2 Gigabit Ethernet Driver v2.2.6 (January 29, 2014)
[ 1.304627] bnx2 0000:01:00.0 eth0: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem da000000, IRQ 28, node addr d4:ae:52:9d:7f:7d
[ 1.305295] bnx2 0000:01:00.1 eth1: Broadcom NetXtreme II BCM5716 1000Base-T (C0) PCI Express found at mem dc000000, IRQ 29, node addr d4:ae:52:9d:7f:7e
[ 28.244185] bnx2 0000:01:00.0 eth0: using MSIX
[ 30.594935] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[ 30.594937] bnx2:
[ 32.972159] bnx2 0000:01:00.1 eth1: using MSIX
[ 35.344049] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[ 35.344052] bnx2:
[109192.923660] bnx2 0000:01:00.0 eth0: NIC Copper Link is Down
[109195.604253] bnx2 0000:01:00.1 eth1: NIC Copper Link is Down
[109196.239101] bnx2 0000:01:00.0 eth0: NIC Copper Link is Up, 1000 Mbps full duplex
[109196.239104] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:
[109198.723257] bnx2 0000:01:00.1 eth1: NIC Copper Link is Up, 1000 Mbps full duplex
[109198.723261] bnx2: , receive bnx2: & transmit bnx2: flow control ONbnx2:

The NIC doesn't seem to be down long so what is the default for timeout on fencing, and can the time be extending for a little while longer?

#18

admin
2,930 Posts

July 11, 2018, 3:32 pm
Quote from admin on July 11, 2018, 3:32 pm
Can you double check the driver read the param correctly:

cat /sys/module/bnx2/parameters/disable_msi

Can you double check the driver read the param correctly:

cat /sys/module/bnx2/parameters/disable_msi

#19

khopkins
96 Posts

July 11, 2018, 3:48 pm
Quote from khopkins on July 11, 2018, 3:48 pm
looks like you might have hit on it. There is a file in "modprobe.d" called "bnx2.conf", with the "options bnx2 disable_msi=1. If the command cat /sys/module/bnx2/parameters/disable_msi is run, the return is "0", if reading it right. Nodes were rebooted when config was made.

root@PS-NODE-1:/etc/modprobe.d# ls
blacklist-ath_pci.conf blacklist-framebuffer.conf blacklist-watchdog.conf bnx2.conf iwlwifi.conf
blacklist-firewire.conf blacklist-rare-network.conf blacklist.conf fbdev-blacklist.conf mlx4.conf
root@PS-NODE-1:/etc/modprobe.d# cat bnx2.conf
options bnx2 disable_msi=1

cat /sys/module/bnx2/parameters/disable_msi
0

looks like you might have hit on it. There is a file in "modprobe.d" called "bnx2.conf", with the "options bnx2 disable_msi=1. If the command cat /sys/module/bnx2/parameters/disable_msi is run, the return is "0", if reading it right. Nodes were rebooted when config was made.

root@PS-NODE-1:/etc/modprobe.d# ls
blacklist-ath_pci.conf blacklist-framebuffer.conf blacklist-watchdog.conf bnx2.conf iwlwifi.conf
blacklist-firewire.conf blacklist-rare-network.conf blacklist.conf fbdev-blacklist.conf mlx4.conf
root@PS-NODE-1:/etc/modprobe.d# cat bnx2.conf
options bnx2 disable_msi=1

cat /sys/module/bnx2/parameters/disable_msi
0

#20

Post Reply: Nodes shutting down

Cancel

Pages: 1 2 3 4