Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSDs stay up, can't fail them

Pages: 1 2 3

Doing some testing to understand how things work on disk failures and hit my first snag.

If I physically remove one OSD from a node, the cluster never sees the OSD go offline. All OSDs stay "up" and online, and ceph health status reports all OK.

In dmesg, there's a bug trace logged:

[75107.115834] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

[75107.115836] Leftover inexact backtrace:

[75107.115838] [<ffffffff81099fd0>] ? kthread_park+0x50/0x50
[75107.115839] Code: c6 55 75 a4 81 e8 a9 74 02 00 48 89 df 5b e9 e0 30 ec ff 48 85 ff 74 16 53 f6 47 3c 01 48 89 fb 74 14 48 8d 7b 38 f0 83 6b 38 01 <74> 03 5b f3 c3 5b e9 bd 00 00 00 48 8b 0f 49 89 f8 48 c7 c2 00
[75135.115245] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u80:1:389893]
[75135.115749] Modules linked in: af_packet(N) fuse(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) xfs(N) libcrc32c(N) bonding(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) ipmi_ssif(N) lpc_ich(N) ipmi_devintf(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) joydev(N) ablk_helper(N) cryptd(N) sg(N) mfd_core(N) mei_me(N) mei(N) ioatdma(N) ipmi_si(N) ipmi_msghandler(N) shpchp(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) sd_mod(N) enclosure(N)
[75135.115783] crc32c_intel(N) mgag200(N) ttm(N) drm_kms_helper(N) syscopyarea(N) ixgbe(N) sysfillrect(N) sysimgblt(N) vxlan(N) fb_sys_fops(N) ip6_udp_tunnel(N) ehci_pci(N) udp_tunnel(N) ehci_hcd(N) mdio(N) ahci(N) isci(N) igb(N) dca(N) libsas(N) libahci(N) mpt3sas(N) ptp(N) raid_class(N) pps_core(N) usbcore(N) drm(N) i2c_algo_bit(N) libata(N) usb_common(N) scsi_transport_sas(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[75135.115800] Supported: No, Unsupported modules are loaded
[75135.115803] CPU: 2 PID: 389893 Comm: kworker/u80:1 Tainted: G W L N 4.4.92-09-petasan #1
[75135.115804] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[75135.115813] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[75135.115814] task: ffff8820242841c0 ti: ffff88202703c000 task.ti: ffff88202703c000
[75135.115815] RIP: 0010:[<ffffffff81608be1>] [<ffffffff81608be1>] _raw_spin_unlock_irqrestore+0x11/0x20
[75135.115820] RSP: 0018:ffff88202703fbc8 EFLAGS: 00000206
[75135.115821] RAX: ffff881039ee9180 RBX: ffff881039f02c00 RCX: 0000000000000001
[75135.115822] RDX: 0000000000000004 RSI: 0000000000000206 RDI: 0000000000000206
[75135.115823] RBP: ffff88103b326000 R08: 0000000000000101 R09: 0000000000000001
[75135.115824] R10: 0000000000000aa8 R11: 000000000000d76d R12: ffff88103b326010
[75135.115824] R13: ffff88103bcccc00 R14: ffff88103b326000 R15: ffff881039ee9000
[75135.115826] FS: 0000000000000000(0000) GS:ffff88103f880000(0000) knlGS:0000000000000000
[75135.115827] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75135.115827] CR2: 00007feae6870720 CR3: 0000000001e0b000 CR4: 00000000001406e0
[75135.115828] Stack:
[75135.115829] ffffffffa003d598 0000000000000206 ffff881039ee9180 ffff88103bcccc00
[75135.115831] ffff88103b3267f0 ffff881039ed7800 ffff881039ee6a88 ffff881039ee6a88
[75135.115832] 000000000000000c ffffffffa001701a ffff88103bcccc00 ffffffffa0018c8a
[75135.115834] Call Trace:
[75135.115847] [<ffffffffa003d598>] scsi_remove_target+0x168/0x1f0 [scsi_mod]
[75135.115853] [<ffffffffa001701a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[75135.115857] [<ffffffffa0018c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[75135.115862] [<ffffffffa0346d41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[75135.115868] [<ffffffffa03392f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[75135.115873] [<ffffffffa033ae47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[75135.115877] [<ffffffffa0341775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[75135.115881] [<ffffffff810940c1>] process_one_work+0x161/0x4a0
[75135.115884] [<ffffffff8109444a>] worker_thread+0x4a/0x4c0
[75135.115887] [<ffffffff8109a097>] kthread+0xc7/0xe0
[75135.115889] [<ffffffff816094bf>] ret_from_fork+0x3f/0x70
[75135.117193] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70

[75135.117195] Leftover inexact backtrace:

[75135.117197] [<ffffffff81099fd0>] ? kthread_park+0x50/0x50
[75135.117198] Code: c3 48 89 df c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00 eb d6 31 c0 eb da 90 90 0f 1f 44 00 00 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 c6

There is no new info logged to PetaSAN.log.

Any suggestions on what I should be looking for? My nodes are using LSI9211-8i flashed to IT mode for the HBA. Is this not ideal?

Thanks!

Hi there,

We do similar tests ( fail hosts, nics , disks ) so it should be handled.  For this case, ceph should report the osd as down within 25 sec.

How many osds do you have  ? Do they all show as up ? Do you have maintenance mode running ? is the osd daemon for the failed disk running ?

systemctl status ceph-osd@OSD_ID --cluster CLUSTER_NAME

The cluster has been sitting inactive for about 2 hours with the drive removed, and all 16 OSDs are still showing as up/online and healthy, so it seems the daemon never reported as being down.  I can't seem to load the node disk list in PetaSAN webUI on any of the nodes, it just keeps spinning.

How would I go about identifying which OSD corresponds to each disk from CLI? Then I can send status of the above command. Since they're all showing as "online" in WebUI and ceph status, not sure how to tell.

Maintenance mode is off.

 

EDIT: it seems that trying to load the node disk lists from the WebUI made SOMETHING happen, but I don;t know what. Now, this is what ceph is reporting. Not sure why this is the case, as I pulled only one disk...

status
cluster:
id: 23546039-9602-49b8-a106-ba8487035584
health: HEALTH_WARN
4 osds down
1 host (4 osds) down
4/410 objects unfound (0.976%)
Reduced data availability: 14 pgs inactive, 14 pgs down, 48 pgs stale
Degraded data redundancy: 250/820 objects degraded (30.488%), 626 pgs unclean, 612 pgs degraded, 612 pgs undersized
8 slow requests are blocked > 32 sec

services:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active), standbys: bd-ceph-sd2
osd: 16 osds: 9 up, 13 in

data:
pools: 1 pools, 1024 pgs
objects: 410 objects, 1579 MB
usage: 274 GB used, 36312 GB / 36587 GB avail
pgs: 1.367% pgs not active
250/820 objects degraded (30.488%)
4/410 objects unfound (0.976%)
561 active+undersized+degraded
398 active+clean
48 stale+active+undersized+degraded
14 down
3 active+recovery_wait+undersized+degraded

ceph> osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 43.97583 root default
-5 10.99396 host bd-ceph-sd1
4 hdd 2.74849 osd.4 up 1.00000 1.00000
5 hdd 2.74849 osd.5 up 1.00000 1.00000
6 hdd 2.74849 osd.6 up 1.00000 1.00000
7 hdd 2.74849 osd.7 up 1.00000 1.00000
-7 10.99396 host bd-ceph-sd2
8 hdd 2.74849 osd.8 up 1.00000 1.00000
9 hdd 2.74849 osd.9 up 1.00000 1.00000
10 hdd 2.74849 osd.10 up 1.00000 1.00000
11 hdd 2.74849 osd.11 up 1.00000 1.00000
-3 10.99396 host bd-ceph-sd3
0 hdd 2.74849 osd.0 down 1.00000 1.00000
1 hdd 2.74849 osd.1 down 1.00000 1.00000
2 hdd 2.74849 osd.2 down 1.00000 1.00000
3 hdd 2.74849 osd.3 down 1.00000 1.00000
-9 10.99396 host bd-ceph-sd4
12 hdd 2.74849 osd.12 down 0 1.00000
13 hdd 2.74849 osd.13 down 0 1.00000
14 hdd 2.74849 osd.14 up 1.00000 1.00000
15 hdd 2.74849 osd.15 down 0 1.00000

If the SAS backplane is mapping disks by ID correctly, then i believe it was OSD-0 that I pulled from Node 3.

Tried to run systemctl status ceph-osd@OSD_0 --cluster BD-Ceph-Cl1 , but it says --cluster is invalid option. when running just systemctl status ceph-osd@OSD_0, this is the output:

root@bd-ceph-sd1:~# systemctl status ceph-osd@OSD_0
● ceph-osd@OSD_0.service - Ceph object storage daemon osd.OSD_0
Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
Drop-In: /etc/systemd/system/ceph-osd@.service.d
└─override.conf
Active: inactive (dead)

This seems to be the same output for any OSD i try to run that command for, regardless of whether online or not.

It feels like this is related to the original dmesg bug call I posted. I rebooted both nodes 3 and 4. node 4's osds came back and synced up, and all but the disk that was pulled came back for node 3. I removed the OSD from PetaSAN, re-added a new disk, and waited for ceph to do it's thing, however I have 4 PGs that remain stale+active+undersized+degraded

I decided to start over yet again from scratch and re-test this whole scenario, as the hands that pulled the disk couldn't guarantee me that they only pulled the one disk originally. Uhgh.

Assuming the hands pulled a disk in both nodes, it seems like when a disk is removed, the whole HBA (LSI9211-81 IT) freaks out, and doesn't get a chance to report the daemon down to the monitors. Will confirm this after rebuilding again.

This seems to be the same output for any OSD i try to run that command for, regardless of whether online or not.

you mean systemctl status ceph-osd@OSD_ID (without the --cluster as you mentioned) shows the services as inactive(dead) for all osds, on all nodes(!) then we likely have something unrelated to the disk you pulled on node 3. Maybe related to the earlier ceph error of having most of the PGs down.

on bd-ceph-sd1, try to start osd 4

systemctl start ceph-osd@4
systemctl status ceph-osd@4

If it does start successfuly, do this on all OSDs on all nodes

If it does not start, try manually

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph

Do you get any console error ? Do you have any errors in log

cat /var/log/ceph/CLUSTER_NAME-osd.OSD_ID.log

If you run the following commands on different nodes, do you get the same output

ceph status --cluster CLUSTER_NAME
rbd ls --cluster CLUSTER_NAME

Last to answer your question on mapping of osds to disks via cli

ceph-disk list

Can you double check all nodes can ping each other on backend network, it is not normal for ceph to fail on all nodes in this way.

I centuple checked networking is not an issue here. Free comms between all IPs on all nodes, and no conflicts.

Using "systemctl status(or start/stop) ceph-osd@[number]" works on the new install. It must have been incorrect syntax i was using before.

I proceeded to re-test pulling a disk from node 4 which is storage only. (ONLY 1 disk for sure this time!)

Again, ceph monitors DO NOT see the OSD as down. From any of the nodes, they all show 16 OSDs online still. Again, in dmesg on the node the disk was pulled, i see the bug report originally posted.

Running ceph status on any of the 4 nodes shows everything still online:

root@bd-ceph-sd4:~# ceph status --cluster BD-Ceph-Cl1
cluster:
id: 07997f7b-910e-4f3f-985f-594b0d837a2a
health: HEALTH_OK

services:
mon: 3 daemons, quorum bd-ceph-sd1,bd-ceph-sd2,bd-ceph-sd3
mgr: bd-ceph-sd1(active), standbys: bd-ceph-sd2
osd: 16 osds: 16 up, 16 in

data:
pools: 1 pools, 1024 pgs
objects: 2 objects, 19 bytes
usage: 336 GB used, 44694 GB / 45030 GB avail
pgs: 1024 active+clean

running ceph-disk list on the node that the disk was removed from just hangs. It never produces output, i killed the command after waiting ~20 minutes. This command took only a few seconds to run on the other nodes.

Going to hold off on doing anything further or making any other changes until I hear back from you.

 

 

There are 2 issues: the kernel crash when you pull a disk on node 4, the other is the state of the ceph cluster when this happens.

On any of the cluster nodes, if you run

ceph status --cluster CLUSTER_NAME
rbd ls --cluster CLUSTER_NAME

Is it responsive ?

If you run systemctl status ceph-osd@OSD_ID on node 4 osds, are they also running ?
If you start writing io, you can use the benchmark page to do this, does it run successfully ? does the cluster status still shows all osds as up after the test  or does it detect the failed osds ?

The kernel crash issue when you pull a disk is probably more difficult to solve. It could be a firmware issue or could be a bug in the kernel itself.
On a good node, can you run

lspci -v
dmesg | grep firmware

Would it be possible to perform an install of SUSE Enterprise Linux 12 SP 3
https://www.suse.com/download-linux/
on same hardware and perform the disk pull test and see if you hit the same issue, if it works can you also run the same 2 commands above. If it fails it may be an upstream kernel issue which will take time to address.

There is a relatively recent patch in main kernel that we suspect may be related:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/scsi_sysfs.c?h=v4.15.16&id=81b6c999897919d5a16fedc018fe375dbab091c5

We will apply it  and send you updated kernel. We will also add some log tracing around the crash area so in case it is not solved we will have more info.

It the meantime if you can get back on the other issues, this will be great

Sorry for delay. The cluster sat all night in the state I left it after removing the one disk. When I came back to it, all 4 OSDs were down and out from the dashboard, all four disks from Node4, so obviously pulling a disk is causing all disks on the node to drop.

Since all four are now reporting down on that node, I can't tell you if at the time the OSD status was up or down.

Currently, running either ceph status or rbd ls both are responsive as the cluster caught up with itself and noticed the OSDs offline.

I'm confident this is not a ceph or petasan issue and is tied to the kernel panic. When I do pull the disk, I can't even soft-reboot the node, the CPUs hang, and it has to be hard-reset.

Here's the HBA entry from lspci -v:

03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
Subsystem: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]
Flags: bus master, fast devsel, latency 0, IRQ 30
I/O ports at 8000 [size=256]
Memory at df600000 (64-bit, non-prefetchable) [size=16K]
Memory at df580000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at df100000 [disabled] [size=512K]
Capabilities: [50] Power Management version 3
Capabilities: [68] Express Endpoint, MSI 00
Capabilities: [d0] Vital Product Data
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [c0] MSI-X: Enable+ Count=15 Masked-
Capabilities: [100] Advanced Error Reporting
Capabilities: [138] Power Budgeting <?>
Capabilities: [150] Single Root I/O Virtualization (SR-IOV)
Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
Kernel driver in use: mpt3sas
Kernel modules: mpt3sas

and dmesg | grep firmware after a disk pull:

root@bd-ceph-sd4:~# dmesg | grep firmware
[ 1.656114] GHES: APEI firmware first mode is enabled by WHEA _OSC.
[ 1.821745] isci 0000:05:00.0: [0]: invalid oem parameters detected, falling back to firmware
[ 1.821764] isci 0000:05:00.0: OEM SAS parameters (version: 1.3) loaded (firmware)
[ 906.217403] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 906.217503] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 906.217630] [<ffffffffa034f775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 932.567756] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 932.567830] [<ffffffffa034f775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]

I'll give pulling a disk a try on SUSE and Ubuntu 16.04.4 as I believe I have a few of these HBAs in other ubuntu nodes without this issue. Will report back once tested.

Also happy to try anything you send me 🙂

 

This issue is not present with Ubuntu 16.04.4. Trying Suse next.

 

EDIT: also not an issue on SUSE 12sp3.

Have you ever tested updating PetaSAN with apt-get? Wondering if that would be the easiest way out here

Pages: 1 2 3