OSDs stay up, can't fail them
admin
2,930 Posts
April 12, 2018, 4:51 pmQuote from admin on April 12, 2018, 4:51 pmRegarding SUSE SLE12 SP3, what is kernel version (4.4.x use uname -a) ?
Can you please dowload and try the followig 2 files:
https://drive.google.com/drive/folders/1oUsFSEVMoCX7KqcdFl-fmC71B5d8ZXlC?usp=sharing
scsi_mod.ko
This has the kernel patch mentioned earlier, it may fix the issue. Else we put a lot of log traces to help identify the problem more. Place it in:
/lib/modules/4.4.92-09-petasan/kernel/drivers/scsi/scsi_mod.ko
replacing the original module, then reboot
if you still have the issue, please send us the dmesg output so we can look at the extra logs.
firmware.tar.gz
If the problem persists, please replace the firmware in /lib/firmware by deleting existing folder in /lib and untar the firmware.tar.gz file, then reboot
Regarding SUSE SLE12 SP3, what is kernel version (4.4.x use uname -a) ?
Can you please dowload and try the followig 2 files:
https://drive.google.com/drive/folders/1oUsFSEVMoCX7KqcdFl-fmC71B5d8ZXlC?usp=sharing
scsi_mod.ko
This has the kernel patch mentioned earlier, it may fix the issue. Else we put a lot of log traces to help identify the problem more. Place it in:
/lib/modules/4.4.92-09-petasan/kernel/drivers/scsi/scsi_mod.ko
replacing the original module, then reboot
if you still have the issue, please send us the dmesg output so we can look at the extra logs.
firmware.tar.gz
If the problem persists, please replace the firmware in /lib/firmware by deleting existing folder in /lib and untar the firmware.tar.gz file, then reboot
protocol6v
85 Posts
April 12, 2018, 8:23 pmQuote from protocol6v on April 12, 2018, 8:23 pmTried both of those independently and together, still have the issue. Here's the dmesg output with both in place:
[ 113.120150] ------------[ cut here ]------------
[ 113.120156] WARNING: CPU: 0 PID: 1954 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 113.120168] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 113.120168] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) ipmi_ssif(N) ipmi_devintf(N) lrw(N) gf128mul(N) glue_helper(N) mei_me(N) ablk_helper(N) joydev(N) lpc_ich(N) ipmi_si(N) cryptd(N) mei(N) sg(N) mfd_core(N) ioatdma(N) ipmi_msghandler(N) shpchp(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 113.120207] mgag200(N) ttm(N) ixgbe(N) drm_kms_helper(N) vxlan(N) syscopyarea(N) ip6_udp_tunnel(N) sysfillrect(N) isci(N) ehci_pci(N) udp_tunnel(N) sysimgblt(N) fb_sys_fops(N) mdio(N) ehci_hcd(N) ahci(N) igb(N) libsas(N) libahci(N) mpt3sas(N) dca(N) raid_class(N) ptp(N) usbcore(N) pps_core(N) drm(N) libata(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 113.120226] Supported: No, Unsupported modules are loaded
[ 113.120228] CPU: 0 PID: 1954 Comm: kworker/u80:3 Tainted: G N 4.4.92-09-petasan #1
[ 113.120229] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 113.120233] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 113.120234] 0000000000000000 ffffffff8131f665 ffff88103383f928 ffffffff81a1c167
[ 113.120237] ffffffff8107c65d ffff88203bd47400 ffff88103383f978 ffffffff8119bd90
[ 113.120239] ffffe8f000001600 ffffffff81f5da40 ffffffff8107c6dc ffffffff81a0c418
[ 113.120241] Call Trace:
[ 113.120250] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 113.120253] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 113.120255] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 113.120259] [<ffffffff8131f665>] dump_stack+0x5c/0x77
[ 113.120263] [<ffffffff8107c65d>] warn_slowpath_common+0x7d/0xb0
[ 113.120266] [<ffffffff8107c6dc>] warn_slowpath_fmt+0x4c/0x50
[ 113.120268] [<ffffffff810916e2>] check_flush_dependency+0x112/0x130
[ 113.120270] [<ffffffff81094f95>] flush_work+0x65/0x190
[ 113.120273] [<ffffffff8119c08a>] lru_add_drain_all+0x13a/0x180
[ 113.120276] [<ffffffff8123aeab>] invalidate_bdev+0x3b/0x50
[ 113.120279] [<ffffffff8123c1c7>] __invalidate_device+0x47/0x60
[ 113.120283] [<ffffffff813007ab>] invalidate_partition+0x2b/0x40
[ 113.120285] [<ffffffff813017cb>] del_gendisk+0xab/0x240
[ 113.120289] [<ffffffffa0014fbc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 113.120296] [<ffffffff8146963a>] __device_release_driver+0x9a/0x140
[ 113.120299] [<ffffffff814696fe>] device_release_driver+0x1e/0x30
[ 113.120301] [<ffffffff81468d19>] bus_remove_device+0xf9/0x170
[ 113.120304] [<ffffffff81465497>] device_del+0x127/0x250
[ 113.120313] [<ffffffffa00643c6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 113.120320] [<ffffffffa00643f1>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 113.120325] [<ffffffffa00645a0>] scsi_remove_target+0x170/0x1f0 [scsi_mod]
[ 113.120329] [<ffffffffa002c01a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 113.120333] [<ffffffffa002dc8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 113.120337] [<ffffffffa033cd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 113.120342] [<ffffffffa032f2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 113.120347] [<ffffffffa0330e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 113.120350] [<ffffffffa0337775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 113.120354] [<ffffffff810940c1>] process_one_work+0x161/0x4a0
[ 113.120356] [<ffffffff8109444a>] worker_thread+0x4a/0x4c0
[ 113.120358] [<ffffffff8109a097>] kthread+0xc7/0xe0
[ 113.120361] [<ffffffff816094bf>] ret_from_fork+0x3f/0x70
[ 113.121666] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[ 113.121667] Leftover inexact backtrace:
[ 113.121669] [<ffffffff81099fd0>] ? kthread_park+0x50/0x50
[ 113.121670] ---[ end trace 5537a1ae12f446c1 ]---
[ 113.751451] XFS (sdc1): metadata I/O error: block 0x19093 ("xlog_iodone") error 5 numblks 64
[ 113.751454] XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa06e835d
[ 113.751459] XFS (sdc1): Log I/O Error Detected. Shutting down filesystem
[ 113.751460] XFS (sdc1): Please umount the filesystem and rectify the problem(s)
Suse kernel version was 4.4.73-5-default
Tried both of those independently and together, still have the issue. Here's the dmesg output with both in place:
[ 113.120150] ------------[ cut here ]------------
[ 113.120156] WARNING: CPU: 0 PID: 1954 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 113.120168] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 113.120168] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) ipmi_ssif(N) ipmi_devintf(N) lrw(N) gf128mul(N) glue_helper(N) mei_me(N) ablk_helper(N) joydev(N) lpc_ich(N) ipmi_si(N) cryptd(N) mei(N) sg(N) mfd_core(N) ioatdma(N) ipmi_msghandler(N) shpchp(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 113.120207] mgag200(N) ttm(N) ixgbe(N) drm_kms_helper(N) vxlan(N) syscopyarea(N) ip6_udp_tunnel(N) sysfillrect(N) isci(N) ehci_pci(N) udp_tunnel(N) sysimgblt(N) fb_sys_fops(N) mdio(N) ehci_hcd(N) ahci(N) igb(N) libsas(N) libahci(N) mpt3sas(N) dca(N) raid_class(N) ptp(N) usbcore(N) pps_core(N) drm(N) libata(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 113.120226] Supported: No, Unsupported modules are loaded
[ 113.120228] CPU: 0 PID: 1954 Comm: kworker/u80:3 Tainted: G N 4.4.92-09-petasan #1
[ 113.120229] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 113.120233] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 113.120234] 0000000000000000 ffffffff8131f665 ffff88103383f928 ffffffff81a1c167
[ 113.120237] ffffffff8107c65d ffff88203bd47400 ffff88103383f978 ffffffff8119bd90
[ 113.120239] ffffe8f000001600 ffffffff81f5da40 ffffffff8107c6dc ffffffff81a0c418
[ 113.120241] Call Trace:
[ 113.120250] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 113.120253] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 113.120255] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 113.120259] [<ffffffff8131f665>] dump_stack+0x5c/0x77
[ 113.120263] [<ffffffff8107c65d>] warn_slowpath_common+0x7d/0xb0
[ 113.120266] [<ffffffff8107c6dc>] warn_slowpath_fmt+0x4c/0x50
[ 113.120268] [<ffffffff810916e2>] check_flush_dependency+0x112/0x130
[ 113.120270] [<ffffffff81094f95>] flush_work+0x65/0x190
[ 113.120273] [<ffffffff8119c08a>] lru_add_drain_all+0x13a/0x180
[ 113.120276] [<ffffffff8123aeab>] invalidate_bdev+0x3b/0x50
[ 113.120279] [<ffffffff8123c1c7>] __invalidate_device+0x47/0x60
[ 113.120283] [<ffffffff813007ab>] invalidate_partition+0x2b/0x40
[ 113.120285] [<ffffffff813017cb>] del_gendisk+0xab/0x240
[ 113.120289] [<ffffffffa0014fbc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 113.120296] [<ffffffff8146963a>] __device_release_driver+0x9a/0x140
[ 113.120299] [<ffffffff814696fe>] device_release_driver+0x1e/0x30
[ 113.120301] [<ffffffff81468d19>] bus_remove_device+0xf9/0x170
[ 113.120304] [<ffffffff81465497>] device_del+0x127/0x250
[ 113.120313] [<ffffffffa00643c6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 113.120320] [<ffffffffa00643f1>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 113.120325] [<ffffffffa00645a0>] scsi_remove_target+0x170/0x1f0 [scsi_mod]
[ 113.120329] [<ffffffffa002c01a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 113.120333] [<ffffffffa002dc8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 113.120337] [<ffffffffa033cd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 113.120342] [<ffffffffa032f2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 113.120347] [<ffffffffa0330e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 113.120350] [<ffffffffa0337775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 113.120354] [<ffffffff810940c1>] process_one_work+0x161/0x4a0
[ 113.120356] [<ffffffff8109444a>] worker_thread+0x4a/0x4c0
[ 113.120358] [<ffffffff8109a097>] kthread+0xc7/0xe0
[ 113.120361] [<ffffffff816094bf>] ret_from_fork+0x3f/0x70
[ 113.121666] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[ 113.121667] Leftover inexact backtrace:
[ 113.121669] [<ffffffff81099fd0>] ? kthread_park+0x50/0x50
[ 113.121670] ---[ end trace 5537a1ae12f446c1 ]---
[ 113.751451] XFS (sdc1): metadata I/O error: block 0x19093 ("xlog_iodone") error 5 numblks 64
[ 113.751454] XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa06e835d
[ 113.751459] XFS (sdc1): Log I/O Error Detected. Shutting down filesystem
[ 113.751460] XFS (sdc1): Please umount the filesystem and rectify the problem(s)
Suse kernel version was 4.4.73-5-default
admin
2,930 Posts
April 12, 2018, 9:10 pmQuote from admin on April 12, 2018, 9:10 pmThe trace is a bit different that before, do you get any logs if you do:
dmesg | grep -i petasan
PetaSAN boots in bios mode, does it make a difference of you boot the SLE kernel in bios vs uefi modes ?
I will try to get you a PetaSAN build based on SLE 4.4.73 kernel within the next couple of days, hopefully this should work.
If you have time to install PetaSAN v 1.4 on a node and test this, it is based on 4.4.38 kernel, this is will also help. else you can wait till i send you the 4.4.73 based one
The trace is a bit different that before, do you get any logs if you do:
dmesg | grep -i petasan
PetaSAN boots in bios mode, does it make a difference of you boot the SLE kernel in bios vs uefi modes ?
I will try to get you a PetaSAN build based on SLE 4.4.73 kernel within the next couple of days, hopefully this should work.
If you have time to install PetaSAN v 1.4 on a node and test this, it is based on 4.4.38 kernel, this is will also help. else you can wait till i send you the 4.4.73 based one
protocol6v
85 Posts
April 13, 2018, 12:43 pmQuote from protocol6v on April 13, 2018, 12:43 pmTried SLE in both BIOS and UEFI, no difference, no issue.
Can't seem to find a download link for 1.4, where do I find?
Tried SLE in both BIOS and UEFI, no difference, no issue.
Can't seem to find a download link for 1.4, where do I find?
admin
2,930 Posts
April 13, 2018, 1:10 pmQuote from admin on April 13, 2018, 1:10 pmIt is on our download page
http://www.petasan.org/downloads/
I will have a 4.4.73 based release ready for your testing by Monday. It should work like the SUSE kernel. Thanks for your help to make PetaSAN better.
It is on our download page
http://www.petasan.org/downloads/
I will have a 4.4.73 based release ready for your testing by Monday. It should work like the SUSE kernel. Thanks for your help to make PetaSAN better.
Last edited on April 13, 2018, 1:10 pm by admin · #15
protocol6v
85 Posts
April 13, 2018, 1:41 pmQuote from protocol6v on April 13, 2018, 1:41 pmHa! total rookie here. Blew right by the "download" link and clicked into the info. Downloading now, will follow up.
Ha! total rookie here. Blew right by the "download" link and clicked into the info. Downloading now, will follow up.
protocol6v
85 Posts
April 13, 2018, 3:07 pmQuote from protocol6v on April 13, 2018, 3:07 pmIssue is NOT present with v1.4.
Looking forward to testing the other kernel.
Thanks and have a good weekend!
Issue is NOT present with v1.4.
Looking forward to testing the other kernel.
Thanks and have a good weekend!
Last edited on April 13, 2018, 3:13 pm by protocol6v · #17
admin
2,930 Posts
April 16, 2018, 11:22 amQuote from admin on April 16, 2018, 11:22 amwe prepared 3 kernel images, will appreciate very much if you can test.
https://drive.google.com/drive/folders/12XrHstPa0LwxYa252WD2bJYwhSuLD4c6?usp=sharing
You do not need to re-install PetaSAN, you just need to install the kernel deb package, one at a time, on any existing node:
dpkg -i linux-image-4.4.XX
then
reboot
then perform the physical disk removal.
When done please repeat the test with the other 2 kernel packages in the same way.
Thanks again fot your help.
we prepared 3 kernel images, will appreciate very much if you can test.
https://drive.google.com/drive/folders/12XrHstPa0LwxYa252WD2bJYwhSuLD4c6?usp=sharing
You do not need to re-install PetaSAN, you just need to install the kernel deb package, one at a time, on any existing node:
dpkg -i linux-image-4.4.XX
then
reboot
then perform the physical disk removal.
When done please repeat the test with the other 2 kernel packages in the same way.
Thanks again fot your help.
protocol6v
85 Posts
April 16, 2018, 11:28 amQuote from protocol6v on April 16, 2018, 11:28 amAwesome! Will test all and follow up. If you don't hear from me today, you will tomorrow.
Thanks for going out of your way with this!
Awesome! Will test all and follow up. If you don't hear from me today, you will tomorrow.
Thanks for going out of your way with this!
protocol6v
85 Posts
April 17, 2018, 3:31 pmQuote from protocol6v on April 17, 2018, 3:31 pmOK, tested all three.
4.4.73-01: bug report still issued to dmesg. OSD is still not downed in Ceph. Am now able to soft-reboot to get the system to see the down OSD, and when plugging drive back in after reboot, OSD syncs back up and health goes green.
Here's the dmesg from this kernel:
[ 3571.753792] sd 0:0:3:0: device_block, handle(0x000d)
[ 3573.753356] sd 0:0:3:0: device_unblock and setting to running, handle(0x000d)
[ 3573.755064] ------------[ cut here ]------------
[ 3573.755071] WARNING: CPU: 0 PID: 18760 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 3573.755084] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 3573.755084] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) mei_me(N) mei(N) lpc_ich(N) ioatdma(N) mfd_core(N) ipmi_si(N) ipmi_msghandler(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 3573.755144] mgag200(N)
[ 3573.755144] ttm(N)
[ 3573.755145] ixgbe(N) drm_kms_helper(N) syscopyarea(N) vxlan(N) sysfillrect(N) ip6_udp_tunnel(N) sysimgblt(N) udp_tunnel(N) fb_sys_fops(N) mdio(N) ehci_pci(N) isci(N) ehci_hcd(N) igb(N) ahci(N) dca(N) libsas(N) libahci(N) ptp(N) drm(N) pps_core(N) mpt3sas(N) usbcore(N) raid_class(N) i2c_algo_bit(N) libata(N) usb_common(N) scsi_transport_sas(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 3573.755165] Supported: No, Unsupported modules are loaded
[ 3573.755167] CPU: 0 PID: 18760 Comm: kworker/u80:2 Tainted: G N 4.4.73-01-petasan #1
[ 3573.755168] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 3573.755172] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 3573.755173] 0000000000000000 ffffffff8131e3f5 ffff8810178b3930 ffffffff81a1ba38
[ 3573.755176] ffffffff8107c4bd ffff88203bd47400 ffff8810178b3980 ffffffff8119b640
[ 3573.755178] ffffe8f000001600 ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 3573.755180] Call Trace:
[ 3573.755189] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 3573.755192] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 3573.755195] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 3573.755198] [<ffffffff8131e3f5>] dump_stack+0x5c/0x77
[ 3573.755202] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 3573.755204] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 3573.755207] [<ffffffff810913d2>] check_flush_dependency+0x112/0x130
[ 3573.755209] [<ffffffff81094c85>] flush_work+0x65/0x190
[ 3573.755212] [<ffffffff8119b92a>] lru_add_drain_all+0x13a/0x180
[ 3573.755215] [<ffffffff81239fbb>] invalidate_bdev+0x3b/0x50
[ 3573.755219] [<ffffffff8123b2b7>] __invalidate_device+0x47/0x60
[ 3573.755222] [<ffffffff812ff63b>] invalidate_partition+0x2b/0x40
[ 3573.755225] [<ffffffff8130060b>] del_gendisk+0xab/0x210
[ 3573.755229] [<ffffffffa006dfdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 3573.755239] [<ffffffff8146846a>] __device_release_driver+0x9a/0x140
[ 3573.755242] [<ffffffff8146852e>] device_release_driver+0x1e/0x30
[ 3573.755244] [<ffffffff81467b49>] bus_remove_device+0xf9/0x170
[ 3573.755247] [<ffffffff81464307>] device_del+0x127/0x250
[ 3573.755258] [<ffffffffa003d3d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 3573.755265] [<ffffffffa003d401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 3573.755270] [<ffffffffa003d597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 3573.755274] [<ffffffffa001701a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 3573.755278] [<ffffffffa0018c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 3573.755283] [<ffffffffa022ad41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 3573.755288] [<ffffffffa021d2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 3573.755293] [<ffffffffa021ee47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 3573.755298] [<ffffffffa0225775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 3573.755301] [<ffffffff81093db1>] process_one_work+0x161/0x4a0
[ 3573.755304] [<ffffffff8109413a>] worker_thread+0x4a/0x4c0
[ 3573.755306] [<ffffffff81099d67>] kthread+0xc7/0xe0
[ 3573.755308] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 3573.756607] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[ 3573.756609] Leftover inexact backtrace:
[ 3573.756611] [<ffffffff81099ca0>] ? kthread_park+0x50/0x50
[ 3573.756612] ---[ end trace 178e66e781443fe0 ]---
[ 3573.777449] mpt2sas_cm0: removing handle(0x000d), sas_addr(0x5000c50056ba5a99)
[ 3573.777451] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(3)
4.4.73-02: same symptoms as above. Here's the dmesg:
[ 323.633153] sd 0:0:1:0: device_block, handle(0x000b)
[ 325.882687] sd 0:0:1:0: device_unblock and setting to running, handle(0x000b)
[ 325.884347] ------------[ cut here ]------------
[ 325.884354] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 325.884367] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 325.884367] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_r bd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) i ntel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel( N) aes_x86_64(N) lrw(N) ipmi_ssif(N) gf128mul(N) ipmi_devintf(N) joydev(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) ipmi_si(N) ipmi_msghandler(N) lpc_ic h(N) mfd_core(N) ioatdma(N) mei_me(N) mei(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure( N) sd_mod(N) crc32c_intel(N)
[ 325.884427] mgag200(N) ixgbe(N) ttm(N) vxlan(N) ip6_udp_tunnel(N) isci(N) udp_tunnel(N) mdio(N) drm_kms_helper(N) ahci(N) syscopyarea(N) sysfillrect(N) s ysimgblt(N) ehci_pci(N) libsas(N) fb_sys_fops(N) libahci(N) ehci_hcd(N) igb(N) dca(N) mpt3sas(N) drm(N) ptp(N) raid_class(N) usbcore(N) libata(N) pps_core(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 325.884447] Supported: No, Unsupported modules are loaded
[ 325.884449] CPU: 0 PID: 6 Comm: kworker/u80:0 Tainted: G N 4.4.73-02-petasan #1
[ 325.884450] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 325.884454] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 325.884455] 0000000000000000 ffffffff8131e3c5 ffff88018c51f930 ffffffff81a1ba38
[ 325.884458] ffffffff8107c4bd
[ 325.884459] ffff88203bd47400 ffff88018c51f980 ffffffff8119b610
[ 325.884461] ffffe8f000001600
[ 325.884462] ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 325.884463] Call Trace:
[ 325.884472] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 325.884476] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 325.884478] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 325.884481] [<ffffffff8131e3c5>] dump_stack+0x5c/0x77
[ 325.884485] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 325.884488] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 325.884490] [<ffffffff810913a2>] check_flush_dependency+0x112/0x130
[ 325.884494] [<ffffffff81094c55>] flush_work+0x65/0x190
[ 325.884497] [<ffffffff8119b8fa>] lru_add_drain_all+0x13a/0x180
[ 325.884500] [<ffffffff81239f8b>] invalidate_bdev+0x3b/0x50
[ 325.884504] [<ffffffff8123b287>] __invalidate_device+0x47/0x60
[ 325.884507] [<ffffffff812ff60b>] invalidate_partition+0x2b/0x40
[ 325.884510] [<ffffffff813005db>] del_gendisk+0xab/0x210
[ 325.884514] [<ffffffffa0014fdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 325.884523] [<ffffffff8146843a>] __device_release_driver+0x9a/0x140
[ 325.884526] [<ffffffff814684fe>] device_release_driver+0x1e/0x30
[ 325.884529] [<ffffffff81467b19>] bus_remove_device+0xf9/0x170
[ 325.884532] [<ffffffff814642d7>] device_del+0x127/0x250
[ 325.884543] [<ffffffffa00643d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 325.884551] [<ffffffffa0064401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 325.884557] [<ffffffffa0064597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 325.884561] [<ffffffffa00d101a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 325.884565] [<ffffffffa00d2c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 325.884570] [<ffffffffa032dd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 325.884576] [<ffffffffa03202f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 325.884581] [<ffffffffa0321e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 325.884585] [<ffffffffa0328775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 325.884588] [<ffffffff81093d81>] process_one_work+0x161/0x4a0
[ 325.884591] [<ffffffff8109410a>] worker_thread+0x4a/0x4c0
[ 325.884593] [<ffffffff81099d37>] kthread+0xc7/0xe0
[ 325.884595] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 325.886002] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[ 325.886003] Leftover inexact backtrace:
[ 325.886006] [<ffffffff81099c70>] ? kthread_park+0x50/0x50
[ 325.886007] ---[ end trace 7abe0a50a8247368 ]---
[ 325.906049] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x5000cca03e999ab5)
[ 325.906051] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(1)
4.4.120-1: no bug report issued to dmesg. OSD still does not go down in Ceph. Again, rebooting will bring the OSD down and plugging it back in syncs everything back up OK.
Is there a "supported" HBA i should be looking at purchasing instead of going through this? I haven't really seen many other HBAs other than the LSI's (Currently using LSI 9211-8i in IT mode, maybe I should try flashing the IR firmware?) Should I upgrade to the newer SAS3 HBAs? Although I do want to help getting this figured out, as I'm sure other people will be using the 9211's. They're extremely common, especially people coming from ZFS or freenas.
Anyway, the notable difference when using the three new kernels was the fact that the system did not hard lock, requiring a power off and power on. I was able to soft-reboot the system to get everything flowing again.
Let me know what you think! Thanks!
OK, tested all three.
4.4.73-01: bug report still issued to dmesg. OSD is still not downed in Ceph. Am now able to soft-reboot to get the system to see the down OSD, and when plugging drive back in after reboot, OSD syncs back up and health goes green.
Here's the dmesg from this kernel:
[ 3571.753792] sd 0:0:3:0: device_block, handle(0x000d)
[ 3573.753356] sd 0:0:3:0: device_unblock and setting to running, handle(0x000d)
[ 3573.755064] ------------[ cut here ]------------
[ 3573.755071] WARNING: CPU: 0 PID: 18760 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 3573.755084] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 3573.755084] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) mei_me(N) mei(N) lpc_ich(N) ioatdma(N) mfd_core(N) ipmi_si(N) ipmi_msghandler(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 3573.755144] mgag200(N)
[ 3573.755144] ttm(N)
[ 3573.755145] ixgbe(N) drm_kms_helper(N) syscopyarea(N) vxlan(N) sysfillrect(N) ip6_udp_tunnel(N) sysimgblt(N) udp_tunnel(N) fb_sys_fops(N) mdio(N) ehci_pci(N) isci(N) ehci_hcd(N) igb(N) ahci(N) dca(N) libsas(N) libahci(N) ptp(N) drm(N) pps_core(N) mpt3sas(N) usbcore(N) raid_class(N) i2c_algo_bit(N) libata(N) usb_common(N) scsi_transport_sas(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 3573.755165] Supported: No, Unsupported modules are loaded
[ 3573.755167] CPU: 0 PID: 18760 Comm: kworker/u80:2 Tainted: G N 4.4.73-01-petasan #1
[ 3573.755168] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 3573.755172] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 3573.755173] 0000000000000000 ffffffff8131e3f5 ffff8810178b3930 ffffffff81a1ba38
[ 3573.755176] ffffffff8107c4bd ffff88203bd47400 ffff8810178b3980 ffffffff8119b640
[ 3573.755178] ffffe8f000001600 ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 3573.755180] Call Trace:
[ 3573.755189] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 3573.755192] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 3573.755195] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 3573.755198] [<ffffffff8131e3f5>] dump_stack+0x5c/0x77
[ 3573.755202] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 3573.755204] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 3573.755207] [<ffffffff810913d2>] check_flush_dependency+0x112/0x130
[ 3573.755209] [<ffffffff81094c85>] flush_work+0x65/0x190
[ 3573.755212] [<ffffffff8119b92a>] lru_add_drain_all+0x13a/0x180
[ 3573.755215] [<ffffffff81239fbb>] invalidate_bdev+0x3b/0x50
[ 3573.755219] [<ffffffff8123b2b7>] __invalidate_device+0x47/0x60
[ 3573.755222] [<ffffffff812ff63b>] invalidate_partition+0x2b/0x40
[ 3573.755225] [<ffffffff8130060b>] del_gendisk+0xab/0x210
[ 3573.755229] [<ffffffffa006dfdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 3573.755239] [<ffffffff8146846a>] __device_release_driver+0x9a/0x140
[ 3573.755242] [<ffffffff8146852e>] device_release_driver+0x1e/0x30
[ 3573.755244] [<ffffffff81467b49>] bus_remove_device+0xf9/0x170
[ 3573.755247] [<ffffffff81464307>] device_del+0x127/0x250
[ 3573.755258] [<ffffffffa003d3d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 3573.755265] [<ffffffffa003d401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 3573.755270] [<ffffffffa003d597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 3573.755274] [<ffffffffa001701a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 3573.755278] [<ffffffffa0018c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 3573.755283] [<ffffffffa022ad41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 3573.755288] [<ffffffffa021d2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 3573.755293] [<ffffffffa021ee47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 3573.755298] [<ffffffffa0225775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 3573.755301] [<ffffffff81093db1>] process_one_work+0x161/0x4a0
[ 3573.755304] [<ffffffff8109413a>] worker_thread+0x4a/0x4c0
[ 3573.755306] [<ffffffff81099d67>] kthread+0xc7/0xe0
[ 3573.755308] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 3573.756607] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[ 3573.756609] Leftover inexact backtrace:
[ 3573.756611] [<ffffffff81099ca0>] ? kthread_park+0x50/0x50
[ 3573.756612] ---[ end trace 178e66e781443fe0 ]---
[ 3573.777449] mpt2sas_cm0: removing handle(0x000d), sas_addr(0x5000c50056ba5a99)
[ 3573.777451] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(3)
4.4.73-02: same symptoms as above. Here's the dmesg:
[ 323.633153] sd 0:0:1:0: device_block, handle(0x000b)
[ 325.882687] sd 0:0:1:0: device_unblock and setting to running, handle(0x000b)
[ 325.884347] ------------[ cut here ]------------
[ 325.884354] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 325.884367] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 325.884367] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_r bd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) i ntel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel( N) aes_x86_64(N) lrw(N) ipmi_ssif(N) gf128mul(N) ipmi_devintf(N) joydev(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) ipmi_si(N) ipmi_msghandler(N) lpc_ic h(N) mfd_core(N) ioatdma(N) mei_me(N) mei(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure( N) sd_mod(N) crc32c_intel(N)
[ 325.884427] mgag200(N) ixgbe(N) ttm(N) vxlan(N) ip6_udp_tunnel(N) isci(N) udp_tunnel(N) mdio(N) drm_kms_helper(N) ahci(N) syscopyarea(N) sysfillrect(N) s ysimgblt(N) ehci_pci(N) libsas(N) fb_sys_fops(N) libahci(N) ehci_hcd(N) igb(N) dca(N) mpt3sas(N) drm(N) ptp(N) raid_class(N) usbcore(N) libata(N) pps_core(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 325.884447] Supported: No, Unsupported modules are loaded
[ 325.884449] CPU: 0 PID: 6 Comm: kworker/u80:0 Tainted: G N 4.4.73-02-petasan #1
[ 325.884450] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 325.884454] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 325.884455] 0000000000000000 ffffffff8131e3c5 ffff88018c51f930 ffffffff81a1ba38
[ 325.884458] ffffffff8107c4bd
[ 325.884459] ffff88203bd47400 ffff88018c51f980 ffffffff8119b610
[ 325.884461] ffffe8f000001600
[ 325.884462] ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 325.884463] Call Trace:
[ 325.884472] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 325.884476] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 325.884478] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 325.884481] [<ffffffff8131e3c5>] dump_stack+0x5c/0x77
[ 325.884485] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 325.884488] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 325.884490] [<ffffffff810913a2>] check_flush_dependency+0x112/0x130
[ 325.884494] [<ffffffff81094c55>] flush_work+0x65/0x190
[ 325.884497] [<ffffffff8119b8fa>] lru_add_drain_all+0x13a/0x180
[ 325.884500] [<ffffffff81239f8b>] invalidate_bdev+0x3b/0x50
[ 325.884504] [<ffffffff8123b287>] __invalidate_device+0x47/0x60
[ 325.884507] [<ffffffff812ff60b>] invalidate_partition+0x2b/0x40
[ 325.884510] [<ffffffff813005db>] del_gendisk+0xab/0x210
[ 325.884514] [<ffffffffa0014fdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 325.884523] [<ffffffff8146843a>] __device_release_driver+0x9a/0x140
[ 325.884526] [<ffffffff814684fe>] device_release_driver+0x1e/0x30
[ 325.884529] [<ffffffff81467b19>] bus_remove_device+0xf9/0x170
[ 325.884532] [<ffffffff814642d7>] device_del+0x127/0x250
[ 325.884543] [<ffffffffa00643d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 325.884551] [<ffffffffa0064401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 325.884557] [<ffffffffa0064597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 325.884561] [<ffffffffa00d101a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 325.884565] [<ffffffffa00d2c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 325.884570] [<ffffffffa032dd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 325.884576] [<ffffffffa03202f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 325.884581] [<ffffffffa0321e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 325.884585] [<ffffffffa0328775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 325.884588] [<ffffffff81093d81>] process_one_work+0x161/0x4a0
[ 325.884591] [<ffffffff8109410a>] worker_thread+0x4a/0x4c0
[ 325.884593] [<ffffffff81099d37>] kthread+0xc7/0xe0
[ 325.884595] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 325.886002] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
[ 325.886003] Leftover inexact backtrace:
[ 325.886006] [<ffffffff81099c70>] ? kthread_park+0x50/0x50
[ 325.886007] ---[ end trace 7abe0a50a8247368 ]---
[ 325.906049] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x5000cca03e999ab5)
[ 325.906051] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(1)
4.4.120-1: no bug report issued to dmesg. OSD still does not go down in Ceph. Again, rebooting will bring the OSD down and plugging it back in syncs everything back up OK.
Is there a "supported" HBA i should be looking at purchasing instead of going through this? I haven't really seen many other HBAs other than the LSI's (Currently using LSI 9211-8i in IT mode, maybe I should try flashing the IR firmware?) Should I upgrade to the newer SAS3 HBAs? Although I do want to help getting this figured out, as I'm sure other people will be using the 9211's. They're extremely common, especially people coming from ZFS or freenas.
Anyway, the notable difference when using the three new kernels was the fact that the system did not hard lock, requiring a power off and power on. I was able to soft-reboot the system to get everything flowing again.
Let me know what you think! Thanks!
OSDs stay up, can't fail them
admin
2,930 Posts
Quote from admin on April 12, 2018, 4:51 pmRegarding SUSE SLE12 SP3, what is kernel version (4.4.x use uname -a) ?
Can you please dowload and try the followig 2 files:
https://drive.google.com/drive/folders/1oUsFSEVMoCX7KqcdFl-fmC71B5d8ZXlC?usp=sharingscsi_mod.ko
This has the kernel patch mentioned earlier, it may fix the issue. Else we put a lot of log traces to help identify the problem more. Place it in:
/lib/modules/4.4.92-09-petasan/kernel/drivers/scsi/scsi_mod.ko
replacing the original module, then reboot
if you still have the issue, please send us the dmesg output so we can look at the extra logs.firmware.tar.gz
If the problem persists, please replace the firmware in /lib/firmware by deleting existing folder in /lib and untar the firmware.tar.gz file, then reboot
Regarding SUSE SLE12 SP3, what is kernel version (4.4.x use uname -a) ?
Can you please dowload and try the followig 2 files:
https://drive.google.com/drive/folders/1oUsFSEVMoCX7KqcdFl-fmC71B5d8ZXlC?usp=sharing
scsi_mod.ko
This has the kernel patch mentioned earlier, it may fix the issue. Else we put a lot of log traces to help identify the problem more. Place it in:
/lib/modules/4.4.92-09-petasan/kernel/drivers/scsi/scsi_mod.ko
replacing the original module, then reboot
if you still have the issue, please send us the dmesg output so we can look at the extra logs.
firmware.tar.gz
If the problem persists, please replace the firmware in /lib/firmware by deleting existing folder in /lib and untar the firmware.tar.gz file, then reboot
protocol6v
85 Posts
Quote from protocol6v on April 12, 2018, 8:23 pmTried both of those independently and together, still have the issue. Here's the dmesg output with both in place:
[ 113.120150] ------------[ cut here ]------------
[ 113.120156] WARNING: CPU: 0 PID: 1954 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 113.120168] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 113.120168] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) ipmi_ssif(N) ipmi_devintf(N) lrw(N) gf128mul(N) glue_helper(N) mei_me(N) ablk_helper(N) joydev(N) lpc_ich(N) ipmi_si(N) cryptd(N) mei(N) sg(N) mfd_core(N) ioatdma(N) ipmi_msghandler(N) shpchp(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 113.120207] mgag200(N) ttm(N) ixgbe(N) drm_kms_helper(N) vxlan(N) syscopyarea(N) ip6_udp_tunnel(N) sysfillrect(N) isci(N) ehci_pci(N) udp_tunnel(N) sysimgblt(N) fb_sys_fops(N) mdio(N) ehci_hcd(N) ahci(N) igb(N) libsas(N) libahci(N) mpt3sas(N) dca(N) raid_class(N) ptp(N) usbcore(N) pps_core(N) drm(N) libata(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 113.120226] Supported: No, Unsupported modules are loaded
[ 113.120228] CPU: 0 PID: 1954 Comm: kworker/u80:3 Tainted: G N 4.4.92-09-petasan #1
[ 113.120229] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 113.120233] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 113.120234] 0000000000000000 ffffffff8131f665 ffff88103383f928 ffffffff81a1c167
[ 113.120237] ffffffff8107c65d ffff88203bd47400 ffff88103383f978 ffffffff8119bd90
[ 113.120239] ffffe8f000001600 ffffffff81f5da40 ffffffff8107c6dc ffffffff81a0c418
[ 113.120241] Call Trace:
[ 113.120250] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 113.120253] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 113.120255] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 113.120259] [<ffffffff8131f665>] dump_stack+0x5c/0x77
[ 113.120263] [<ffffffff8107c65d>] warn_slowpath_common+0x7d/0xb0
[ 113.120266] [<ffffffff8107c6dc>] warn_slowpath_fmt+0x4c/0x50
[ 113.120268] [<ffffffff810916e2>] check_flush_dependency+0x112/0x130
[ 113.120270] [<ffffffff81094f95>] flush_work+0x65/0x190
[ 113.120273] [<ffffffff8119c08a>] lru_add_drain_all+0x13a/0x180
[ 113.120276] [<ffffffff8123aeab>] invalidate_bdev+0x3b/0x50
[ 113.120279] [<ffffffff8123c1c7>] __invalidate_device+0x47/0x60
[ 113.120283] [<ffffffff813007ab>] invalidate_partition+0x2b/0x40
[ 113.120285] [<ffffffff813017cb>] del_gendisk+0xab/0x240
[ 113.120289] [<ffffffffa0014fbc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 113.120296] [<ffffffff8146963a>] __device_release_driver+0x9a/0x140
[ 113.120299] [<ffffffff814696fe>] device_release_driver+0x1e/0x30
[ 113.120301] [<ffffffff81468d19>] bus_remove_device+0xf9/0x170
[ 113.120304] [<ffffffff81465497>] device_del+0x127/0x250
[ 113.120313] [<ffffffffa00643c6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 113.120320] [<ffffffffa00643f1>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 113.120325] [<ffffffffa00645a0>] scsi_remove_target+0x170/0x1f0 [scsi_mod]
[ 113.120329] [<ffffffffa002c01a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 113.120333] [<ffffffffa002dc8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 113.120337] [<ffffffffa033cd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 113.120342] [<ffffffffa032f2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 113.120347] [<ffffffffa0330e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 113.120350] [<ffffffffa0337775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 113.120354] [<ffffffff810940c1>] process_one_work+0x161/0x4a0
[ 113.120356] [<ffffffff8109444a>] worker_thread+0x4a/0x4c0
[ 113.120358] [<ffffffff8109a097>] kthread+0xc7/0xe0
[ 113.120361] [<ffffffff816094bf>] ret_from_fork+0x3f/0x70
[ 113.121666] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70[ 113.121667] Leftover inexact backtrace:
[ 113.121669] [<ffffffff81099fd0>] ? kthread_park+0x50/0x50
[ 113.121670] ---[ end trace 5537a1ae12f446c1 ]---
[ 113.751451] XFS (sdc1): metadata I/O error: block 0x19093 ("xlog_iodone") error 5 numblks 64
[ 113.751454] XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa06e835d
[ 113.751459] XFS (sdc1): Log I/O Error Detected. Shutting down filesystem
[ 113.751460] XFS (sdc1): Please umount the filesystem and rectify the problem(s)Suse kernel version was 4.4.73-5-default
Tried both of those independently and together, still have the issue. Here's the dmesg output with both in place:
[ 113.120150] ------------[ cut here ]------------
[ 113.120156] WARNING: CPU: 0 PID: 1954 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 113.120168] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 113.120168] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) ipmi_ssif(N) ipmi_devintf(N) lrw(N) gf128mul(N) glue_helper(N) mei_me(N) ablk_helper(N) joydev(N) lpc_ich(N) ipmi_si(N) cryptd(N) mei(N) sg(N) mfd_core(N) ioatdma(N) ipmi_msghandler(N) shpchp(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 113.120207] mgag200(N) ttm(N) ixgbe(N) drm_kms_helper(N) vxlan(N) syscopyarea(N) ip6_udp_tunnel(N) sysfillrect(N) isci(N) ehci_pci(N) udp_tunnel(N) sysimgblt(N) fb_sys_fops(N) mdio(N) ehci_hcd(N) ahci(N) igb(N) libsas(N) libahci(N) mpt3sas(N) dca(N) raid_class(N) ptp(N) usbcore(N) pps_core(N) drm(N) libata(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 113.120226] Supported: No, Unsupported modules are loaded
[ 113.120228] CPU: 0 PID: 1954 Comm: kworker/u80:3 Tainted: G N 4.4.92-09-petasan #1
[ 113.120229] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 113.120233] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 113.120234] 0000000000000000 ffffffff8131f665 ffff88103383f928 ffffffff81a1c167
[ 113.120237] ffffffff8107c65d ffff88203bd47400 ffff88103383f978 ffffffff8119bd90
[ 113.120239] ffffe8f000001600 ffffffff81f5da40 ffffffff8107c6dc ffffffff81a0c418
[ 113.120241] Call Trace:
[ 113.120250] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 113.120253] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 113.120255] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 113.120259] [<ffffffff8131f665>] dump_stack+0x5c/0x77
[ 113.120263] [<ffffffff8107c65d>] warn_slowpath_common+0x7d/0xb0
[ 113.120266] [<ffffffff8107c6dc>] warn_slowpath_fmt+0x4c/0x50
[ 113.120268] [<ffffffff810916e2>] check_flush_dependency+0x112/0x130
[ 113.120270] [<ffffffff81094f95>] flush_work+0x65/0x190
[ 113.120273] [<ffffffff8119c08a>] lru_add_drain_all+0x13a/0x180
[ 113.120276] [<ffffffff8123aeab>] invalidate_bdev+0x3b/0x50
[ 113.120279] [<ffffffff8123c1c7>] __invalidate_device+0x47/0x60
[ 113.120283] [<ffffffff813007ab>] invalidate_partition+0x2b/0x40
[ 113.120285] [<ffffffff813017cb>] del_gendisk+0xab/0x240
[ 113.120289] [<ffffffffa0014fbc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 113.120296] [<ffffffff8146963a>] __device_release_driver+0x9a/0x140
[ 113.120299] [<ffffffff814696fe>] device_release_driver+0x1e/0x30
[ 113.120301] [<ffffffff81468d19>] bus_remove_device+0xf9/0x170
[ 113.120304] [<ffffffff81465497>] device_del+0x127/0x250
[ 113.120313] [<ffffffffa00643c6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 113.120320] [<ffffffffa00643f1>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 113.120325] [<ffffffffa00645a0>] scsi_remove_target+0x170/0x1f0 [scsi_mod]
[ 113.120329] [<ffffffffa002c01a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 113.120333] [<ffffffffa002dc8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 113.120337] [<ffffffffa033cd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 113.120342] [<ffffffffa032f2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 113.120347] [<ffffffffa0330e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 113.120350] [<ffffffffa0337775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 113.120354] [<ffffffff810940c1>] process_one_work+0x161/0x4a0
[ 113.120356] [<ffffffff8109444a>] worker_thread+0x4a/0x4c0
[ 113.120358] [<ffffffff8109a097>] kthread+0xc7/0xe0
[ 113.120361] [<ffffffff816094bf>] ret_from_fork+0x3f/0x70
[ 113.121666] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70[ 113.121667] Leftover inexact backtrace:
[ 113.121669] [<ffffffff81099fd0>] ? kthread_park+0x50/0x50
[ 113.121670] ---[ end trace 5537a1ae12f446c1 ]---
[ 113.751451] XFS (sdc1): metadata I/O error: block 0x19093 ("xlog_iodone") error 5 numblks 64
[ 113.751454] XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa06e835d
[ 113.751459] XFS (sdc1): Log I/O Error Detected. Shutting down filesystem
[ 113.751460] XFS (sdc1): Please umount the filesystem and rectify the problem(s)
Suse kernel version was 4.4.73-5-default
admin
2,930 Posts
Quote from admin on April 12, 2018, 9:10 pmThe trace is a bit different that before, do you get any logs if you do:
dmesg | grep -i petasan
PetaSAN boots in bios mode, does it make a difference of you boot the SLE kernel in bios vs uefi modes ?
I will try to get you a PetaSAN build based on SLE 4.4.73 kernel within the next couple of days, hopefully this should work.
If you have time to install PetaSAN v 1.4 on a node and test this, it is based on 4.4.38 kernel, this is will also help. else you can wait till i send you the 4.4.73 based one
The trace is a bit different that before, do you get any logs if you do:
dmesg | grep -i petasan
PetaSAN boots in bios mode, does it make a difference of you boot the SLE kernel in bios vs uefi modes ?
I will try to get you a PetaSAN build based on SLE 4.4.73 kernel within the next couple of days, hopefully this should work.
If you have time to install PetaSAN v 1.4 on a node and test this, it is based on 4.4.38 kernel, this is will also help. else you can wait till i send you the 4.4.73 based one
protocol6v
85 Posts
Quote from protocol6v on April 13, 2018, 12:43 pmTried SLE in both BIOS and UEFI, no difference, no issue.
Can't seem to find a download link for 1.4, where do I find?
Tried SLE in both BIOS and UEFI, no difference, no issue.
Can't seem to find a download link for 1.4, where do I find?
admin
2,930 Posts
Quote from admin on April 13, 2018, 1:10 pmIt is on our download page
http://www.petasan.org/downloads/
I will have a 4.4.73 based release ready for your testing by Monday. It should work like the SUSE kernel. Thanks for your help to make PetaSAN better.
It is on our download page
http://www.petasan.org/downloads/
I will have a 4.4.73 based release ready for your testing by Monday. It should work like the SUSE kernel. Thanks for your help to make PetaSAN better.
protocol6v
85 Posts
Quote from protocol6v on April 13, 2018, 1:41 pmHa! total rookie here. Blew right by the "download" link and clicked into the info. Downloading now, will follow up.
Ha! total rookie here. Blew right by the "download" link and clicked into the info. Downloading now, will follow up.
protocol6v
85 Posts
Quote from protocol6v on April 13, 2018, 3:07 pmIssue is NOT present with v1.4.
Looking forward to testing the other kernel.
Thanks and have a good weekend!
Issue is NOT present with v1.4.
Looking forward to testing the other kernel.
Thanks and have a good weekend!
admin
2,930 Posts
Quote from admin on April 16, 2018, 11:22 amwe prepared 3 kernel images, will appreciate very much if you can test.
https://drive.google.com/drive/folders/12XrHstPa0LwxYa252WD2bJYwhSuLD4c6?usp=sharingYou do not need to re-install PetaSAN, you just need to install the kernel deb package, one at a time, on any existing node:
dpkg -i linux-image-4.4.XX
then
reboot
then perform the physical disk removal.
When done please repeat the test with the other 2 kernel packages in the same way.Thanks again fot your help.
we prepared 3 kernel images, will appreciate very much if you can test.
https://drive.google.com/drive/folders/12XrHstPa0LwxYa252WD2bJYwhSuLD4c6?usp=sharing
You do not need to re-install PetaSAN, you just need to install the kernel deb package, one at a time, on any existing node:
dpkg -i linux-image-4.4.XX
then
reboot
then perform the physical disk removal.
When done please repeat the test with the other 2 kernel packages in the same way.
Thanks again fot your help.
protocol6v
85 Posts
Quote from protocol6v on April 16, 2018, 11:28 amAwesome! Will test all and follow up. If you don't hear from me today, you will tomorrow.
Thanks for going out of your way with this!
Awesome! Will test all and follow up. If you don't hear from me today, you will tomorrow.
Thanks for going out of your way with this!
protocol6v
85 Posts
Quote from protocol6v on April 17, 2018, 3:31 pmOK, tested all three.
4.4.73-01: bug report still issued to dmesg. OSD is still not downed in Ceph. Am now able to soft-reboot to get the system to see the down OSD, and when plugging drive back in after reboot, OSD syncs back up and health goes green.
Here's the dmesg from this kernel:
[ 3571.753792] sd 0:0:3:0: device_block, handle(0x000d)
[ 3573.753356] sd 0:0:3:0: device_unblock and setting to running, handle(0x000d)
[ 3573.755064] ------------[ cut here ]------------
[ 3573.755071] WARNING: CPU: 0 PID: 18760 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 3573.755084] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 3573.755084] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) mei_me(N) mei(N) lpc_ich(N) ioatdma(N) mfd_core(N) ipmi_si(N) ipmi_msghandler(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 3573.755144] mgag200(N)
[ 3573.755144] ttm(N)
[ 3573.755145] ixgbe(N) drm_kms_helper(N) syscopyarea(N) vxlan(N) sysfillrect(N) ip6_udp_tunnel(N) sysimgblt(N) udp_tunnel(N) fb_sys_fops(N) mdio(N) ehci_pci(N) isci(N) ehci_hcd(N) igb(N) ahci(N) dca(N) libsas(N) libahci(N) ptp(N) drm(N) pps_core(N) mpt3sas(N) usbcore(N) raid_class(N) i2c_algo_bit(N) libata(N) usb_common(N) scsi_transport_sas(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 3573.755165] Supported: No, Unsupported modules are loaded
[ 3573.755167] CPU: 0 PID: 18760 Comm: kworker/u80:2 Tainted: G N 4.4.73-01-petasan #1
[ 3573.755168] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 3573.755172] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 3573.755173] 0000000000000000 ffffffff8131e3f5 ffff8810178b3930 ffffffff81a1ba38
[ 3573.755176] ffffffff8107c4bd ffff88203bd47400 ffff8810178b3980 ffffffff8119b640
[ 3573.755178] ffffe8f000001600 ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 3573.755180] Call Trace:
[ 3573.755189] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 3573.755192] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 3573.755195] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 3573.755198] [<ffffffff8131e3f5>] dump_stack+0x5c/0x77
[ 3573.755202] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 3573.755204] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 3573.755207] [<ffffffff810913d2>] check_flush_dependency+0x112/0x130
[ 3573.755209] [<ffffffff81094c85>] flush_work+0x65/0x190
[ 3573.755212] [<ffffffff8119b92a>] lru_add_drain_all+0x13a/0x180
[ 3573.755215] [<ffffffff81239fbb>] invalidate_bdev+0x3b/0x50
[ 3573.755219] [<ffffffff8123b2b7>] __invalidate_device+0x47/0x60
[ 3573.755222] [<ffffffff812ff63b>] invalidate_partition+0x2b/0x40
[ 3573.755225] [<ffffffff8130060b>] del_gendisk+0xab/0x210
[ 3573.755229] [<ffffffffa006dfdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 3573.755239] [<ffffffff8146846a>] __device_release_driver+0x9a/0x140
[ 3573.755242] [<ffffffff8146852e>] device_release_driver+0x1e/0x30
[ 3573.755244] [<ffffffff81467b49>] bus_remove_device+0xf9/0x170
[ 3573.755247] [<ffffffff81464307>] device_del+0x127/0x250
[ 3573.755258] [<ffffffffa003d3d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 3573.755265] [<ffffffffa003d401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 3573.755270] [<ffffffffa003d597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 3573.755274] [<ffffffffa001701a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 3573.755278] [<ffffffffa0018c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 3573.755283] [<ffffffffa022ad41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 3573.755288] [<ffffffffa021d2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 3573.755293] [<ffffffffa021ee47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 3573.755298] [<ffffffffa0225775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 3573.755301] [<ffffffff81093db1>] process_one_work+0x161/0x4a0
[ 3573.755304] [<ffffffff8109413a>] worker_thread+0x4a/0x4c0
[ 3573.755306] [<ffffffff81099d67>] kthread+0xc7/0xe0
[ 3573.755308] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 3573.756607] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70[ 3573.756609] Leftover inexact backtrace:
[ 3573.756611] [<ffffffff81099ca0>] ? kthread_park+0x50/0x50
[ 3573.756612] ---[ end trace 178e66e781443fe0 ]---
[ 3573.777449] mpt2sas_cm0: removing handle(0x000d), sas_addr(0x5000c50056ba5a99)
[ 3573.777451] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(3)
4.4.73-02: same symptoms as above. Here's the dmesg:
[ 323.633153] sd 0:0:1:0: device_block, handle(0x000b)
[ 325.882687] sd 0:0:1:0: device_unblock and setting to running, handle(0x000b)
[ 325.884347] ------------[ cut here ]------------
[ 325.884354] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 325.884367] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 325.884367] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_r bd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) i ntel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel( N) aes_x86_64(N) lrw(N) ipmi_ssif(N) gf128mul(N) ipmi_devintf(N) joydev(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) ipmi_si(N) ipmi_msghandler(N) lpc_ic h(N) mfd_core(N) ioatdma(N) mei_me(N) mei(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure( N) sd_mod(N) crc32c_intel(N)
[ 325.884427] mgag200(N) ixgbe(N) ttm(N) vxlan(N) ip6_udp_tunnel(N) isci(N) udp_tunnel(N) mdio(N) drm_kms_helper(N) ahci(N) syscopyarea(N) sysfillrect(N) s ysimgblt(N) ehci_pci(N) libsas(N) fb_sys_fops(N) libahci(N) ehci_hcd(N) igb(N) dca(N) mpt3sas(N) drm(N) ptp(N) raid_class(N) usbcore(N) libata(N) pps_core(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 325.884447] Supported: No, Unsupported modules are loaded
[ 325.884449] CPU: 0 PID: 6 Comm: kworker/u80:0 Tainted: G N 4.4.73-02-petasan #1
[ 325.884450] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 325.884454] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 325.884455] 0000000000000000 ffffffff8131e3c5 ffff88018c51f930 ffffffff81a1ba38
[ 325.884458] ffffffff8107c4bd
[ 325.884459] ffff88203bd47400 ffff88018c51f980 ffffffff8119b610
[ 325.884461] ffffe8f000001600
[ 325.884462] ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 325.884463] Call Trace:
[ 325.884472] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 325.884476] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 325.884478] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 325.884481] [<ffffffff8131e3c5>] dump_stack+0x5c/0x77
[ 325.884485] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 325.884488] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 325.884490] [<ffffffff810913a2>] check_flush_dependency+0x112/0x130
[ 325.884494] [<ffffffff81094c55>] flush_work+0x65/0x190
[ 325.884497] [<ffffffff8119b8fa>] lru_add_drain_all+0x13a/0x180
[ 325.884500] [<ffffffff81239f8b>] invalidate_bdev+0x3b/0x50
[ 325.884504] [<ffffffff8123b287>] __invalidate_device+0x47/0x60
[ 325.884507] [<ffffffff812ff60b>] invalidate_partition+0x2b/0x40
[ 325.884510] [<ffffffff813005db>] del_gendisk+0xab/0x210
[ 325.884514] [<ffffffffa0014fdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 325.884523] [<ffffffff8146843a>] __device_release_driver+0x9a/0x140
[ 325.884526] [<ffffffff814684fe>] device_release_driver+0x1e/0x30
[ 325.884529] [<ffffffff81467b19>] bus_remove_device+0xf9/0x170
[ 325.884532] [<ffffffff814642d7>] device_del+0x127/0x250
[ 325.884543] [<ffffffffa00643d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 325.884551] [<ffffffffa0064401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 325.884557] [<ffffffffa0064597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 325.884561] [<ffffffffa00d101a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 325.884565] [<ffffffffa00d2c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 325.884570] [<ffffffffa032dd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 325.884576] [<ffffffffa03202f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 325.884581] [<ffffffffa0321e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 325.884585] [<ffffffffa0328775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 325.884588] [<ffffffff81093d81>] process_one_work+0x161/0x4a0
[ 325.884591] [<ffffffff8109410a>] worker_thread+0x4a/0x4c0
[ 325.884593] [<ffffffff81099d37>] kthread+0xc7/0xe0
[ 325.884595] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 325.886002] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70[ 325.886003] Leftover inexact backtrace:
[ 325.886006] [<ffffffff81099c70>] ? kthread_park+0x50/0x50
[ 325.886007] ---[ end trace 7abe0a50a8247368 ]---
[ 325.906049] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x5000cca03e999ab5)
[ 325.906051] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(1)
4.4.120-1: no bug report issued to dmesg. OSD still does not go down in Ceph. Again, rebooting will bring the OSD down and plugging it back in syncs everything back up OK.
Is there a "supported" HBA i should be looking at purchasing instead of going through this? I haven't really seen many other HBAs other than the LSI's (Currently using LSI 9211-8i in IT mode, maybe I should try flashing the IR firmware?) Should I upgrade to the newer SAS3 HBAs? Although I do want to help getting this figured out, as I'm sure other people will be using the 9211's. They're extremely common, especially people coming from ZFS or freenas.
Anyway, the notable difference when using the three new kernels was the fact that the system did not hard lock, requiring a power off and power on. I was able to soft-reboot the system to get everything flowing again.
Let me know what you think! Thanks!
OK, tested all three.
4.4.73-01: bug report still issued to dmesg. OSD is still not downed in Ceph. Am now able to soft-reboot to get the system to see the down OSD, and when plugging drive back in after reboot, OSD syncs back up and health goes green.
Here's the dmesg from this kernel:
[ 3571.753792] sd 0:0:3:0: device_block, handle(0x000d)
[ 3573.753356] sd 0:0:3:0: device_unblock and setting to running, handle(0x000d)
[ 3573.755064] ------------[ cut here ]------------
[ 3573.755071] WARNING: CPU: 0 PID: 18760 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 3573.755084] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 3573.755084] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) mei_me(N) mei(N) lpc_ich(N) ioatdma(N) mfd_core(N) ipmi_si(N) ipmi_msghandler(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure(N) sd_mod(N) crc32c_intel(N)
[ 3573.755144] mgag200(N)
[ 3573.755144] ttm(N)
[ 3573.755145] ixgbe(N) drm_kms_helper(N) syscopyarea(N) vxlan(N) sysfillrect(N) ip6_udp_tunnel(N) sysimgblt(N) udp_tunnel(N) fb_sys_fops(N) mdio(N) ehci_pci(N) isci(N) ehci_hcd(N) igb(N) ahci(N) dca(N) libsas(N) libahci(N) ptp(N) drm(N) pps_core(N) mpt3sas(N) usbcore(N) raid_class(N) i2c_algo_bit(N) libata(N) usb_common(N) scsi_transport_sas(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 3573.755165] Supported: No, Unsupported modules are loaded
[ 3573.755167] CPU: 0 PID: 18760 Comm: kworker/u80:2 Tainted: G N 4.4.73-01-petasan #1
[ 3573.755168] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 3573.755172] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 3573.755173] 0000000000000000 ffffffff8131e3f5 ffff8810178b3930 ffffffff81a1ba38
[ 3573.755176] ffffffff8107c4bd ffff88203bd47400 ffff8810178b3980 ffffffff8119b640
[ 3573.755178] ffffe8f000001600 ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 3573.755180] Call Trace:
[ 3573.755189] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 3573.755192] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 3573.755195] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 3573.755198] [<ffffffff8131e3f5>] dump_stack+0x5c/0x77
[ 3573.755202] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 3573.755204] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 3573.755207] [<ffffffff810913d2>] check_flush_dependency+0x112/0x130
[ 3573.755209] [<ffffffff81094c85>] flush_work+0x65/0x190
[ 3573.755212] [<ffffffff8119b92a>] lru_add_drain_all+0x13a/0x180
[ 3573.755215] [<ffffffff81239fbb>] invalidate_bdev+0x3b/0x50
[ 3573.755219] [<ffffffff8123b2b7>] __invalidate_device+0x47/0x60
[ 3573.755222] [<ffffffff812ff63b>] invalidate_partition+0x2b/0x40
[ 3573.755225] [<ffffffff8130060b>] del_gendisk+0xab/0x210
[ 3573.755229] [<ffffffffa006dfdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 3573.755239] [<ffffffff8146846a>] __device_release_driver+0x9a/0x140
[ 3573.755242] [<ffffffff8146852e>] device_release_driver+0x1e/0x30
[ 3573.755244] [<ffffffff81467b49>] bus_remove_device+0xf9/0x170
[ 3573.755247] [<ffffffff81464307>] device_del+0x127/0x250
[ 3573.755258] [<ffffffffa003d3d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 3573.755265] [<ffffffffa003d401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 3573.755270] [<ffffffffa003d597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 3573.755274] [<ffffffffa001701a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 3573.755278] [<ffffffffa0018c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 3573.755283] [<ffffffffa022ad41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 3573.755288] [<ffffffffa021d2f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 3573.755293] [<ffffffffa021ee47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 3573.755298] [<ffffffffa0225775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 3573.755301] [<ffffffff81093db1>] process_one_work+0x161/0x4a0
[ 3573.755304] [<ffffffff8109413a>] worker_thread+0x4a/0x4c0
[ 3573.755306] [<ffffffff81099d67>] kthread+0xc7/0xe0
[ 3573.755308] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 3573.756607] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70[ 3573.756609] Leftover inexact backtrace:
[ 3573.756611] [<ffffffff81099ca0>] ? kthread_park+0x50/0x50
[ 3573.756612] ---[ end trace 178e66e781443fe0 ]---
[ 3573.777449] mpt2sas_cm0: removing handle(0x000d), sas_addr(0x5000c50056ba5a99)
[ 3573.777451] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(3)
4.4.73-02: same symptoms as above. Here's the dmesg:
[ 323.633153] sd 0:0:1:0: device_block, handle(0x000b)
[ 325.882687] sd 0:0:1:0: device_unblock and setting to running, handle(0x000b)
[ 325.884347] ------------[ cut here ]------------
[ 325.884354] WARNING: CPU: 0 PID: 6 at kernel/workqueue.c:2462 check_flush_dependency+0x112/0x130()
[ 325.884367] workqueue: WQ_MEM_RECLAIM fw_event_mpt2sas0:_firmware_event_work [mpt3sas] is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
[ 325.884367] Modules linked in: target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_r bd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) sb_edac(N) edac_core(N) x86_pkg_temp_thermal(N) i ntel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel( N) aes_x86_64(N) lrw(N) ipmi_ssif(N) gf128mul(N) ipmi_devintf(N) joydev(N) glue_helper(N) ablk_helper(N) cryptd(N) sg(N) ipmi_si(N) ipmi_msghandler(N) lpc_ic h(N) mfd_core(N) ioatdma(N) mei_me(N) mei(N) processor(N) shpchp(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) ses(N) enclosure( N) sd_mod(N) crc32c_intel(N)
[ 325.884427] mgag200(N) ixgbe(N) ttm(N) vxlan(N) ip6_udp_tunnel(N) isci(N) udp_tunnel(N) mdio(N) drm_kms_helper(N) ahci(N) syscopyarea(N) sysfillrect(N) s ysimgblt(N) ehci_pci(N) libsas(N) fb_sys_fops(N) libahci(N) ehci_hcd(N) igb(N) dca(N) mpt3sas(N) drm(N) ptp(N) raid_class(N) usbcore(N) libata(N) pps_core(N) i2c_algo_bit(N) scsi_transport_sas(N) usb_common(N) scsi_mod(N) wmi(N) fjes(N) button(N)
[ 325.884447] Supported: No, Unsupported modules are loaded
[ 325.884449] CPU: 0 PID: 6 Comm: kworker/u80:0 Tainted: G N 4.4.73-02-petasan #1
[ 325.884450] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[ 325.884454] Workqueue: fw_event_mpt2sas0 _firmware_event_work [mpt3sas]
[ 325.884455] 0000000000000000 ffffffff8131e3c5 ffff88018c51f930 ffffffff81a1ba38
[ 325.884458] ffffffff8107c4bd
[ 325.884459] ffff88203bd47400 ffff88018c51f980 ffffffff8119b610
[ 325.884461] ffffe8f000001600
[ 325.884462] ffffffff81f558c0 ffffffff8107c53c ffffffff81a0bd98
[ 325.884463] Call Trace:
[ 325.884472] [<ffffffff81018a0e>] dump_trace+0x5e/0x310
[ 325.884476] [<ffffffff81018dbc>] show_stack_log_lvl+0xfc/0x160
[ 325.884478] [<ffffffff81019a91>] show_stack+0x21/0x40
[ 325.884481] [<ffffffff8131e3c5>] dump_stack+0x5c/0x77
[ 325.884485] [<ffffffff8107c4bd>] warn_slowpath_common+0x7d/0xb0
[ 325.884488] [<ffffffff8107c53c>] warn_slowpath_fmt+0x4c/0x50
[ 325.884490] [<ffffffff810913a2>] check_flush_dependency+0x112/0x130
[ 325.884494] [<ffffffff81094c55>] flush_work+0x65/0x190
[ 325.884497] [<ffffffff8119b8fa>] lru_add_drain_all+0x13a/0x180
[ 325.884500] [<ffffffff81239f8b>] invalidate_bdev+0x3b/0x50
[ 325.884504] [<ffffffff8123b287>] __invalidate_device+0x47/0x60
[ 325.884507] [<ffffffff812ff60b>] invalidate_partition+0x2b/0x40
[ 325.884510] [<ffffffff813005db>] del_gendisk+0xab/0x210
[ 325.884514] [<ffffffffa0014fdc>] sd_remove+0x5c/0xc0 [sd_mod]
[ 325.884523] [<ffffffff8146843a>] __device_release_driver+0x9a/0x140
[ 325.884526] [<ffffffff814684fe>] device_release_driver+0x1e/0x30
[ 325.884529] [<ffffffff81467b19>] bus_remove_device+0xf9/0x170
[ 325.884532] [<ffffffff814642d7>] device_del+0x127/0x250
[ 325.884543] [<ffffffffa00643d6>] __scsi_remove_device+0xc6/0xd0 [scsi_mod]
[ 325.884551] [<ffffffffa0064401>] scsi_remove_device+0x21/0x30 [scsi_mod]
[ 325.884557] [<ffffffffa0064597>] scsi_remove_target+0x157/0x1d0 [scsi_mod]
[ 325.884561] [<ffffffffa00d101a>] sas_rphy_remove+0x5a/0x70 [scsi_transport_sas]
[ 325.884565] [<ffffffffa00d2c8a>] sas_port_delete+0x2a/0x160 [scsi_transport_sas]
[ 325.884570] [<ffffffffa032dd41>] mpt3sas_transport_port_remove+0x1b1/0x1d0 [mpt3sas]
[ 325.884576] [<ffffffffa03202f9>] _scsih_remove_device+0x1f9/0x300 [mpt3sas]
[ 325.884581] [<ffffffffa0321e47>] _scsih_device_remove_by_handle.part.27+0x67/0xb0 [mpt3sas]
[ 325.884585] [<ffffffffa0328775>] _firmware_event_work+0x1595/0x1cf0 [mpt3sas]
[ 325.884588] [<ffffffff81093d81>] process_one_work+0x161/0x4a0
[ 325.884591] [<ffffffff8109410a>] worker_thread+0x4a/0x4c0
[ 325.884593] [<ffffffff81099d37>] kthread+0xc7/0xe0
[ 325.884595] [<ffffffff816082bf>] ret_from_fork+0x3f/0x70
[ 325.886002] DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70[ 325.886003] Leftover inexact backtrace:
[ 325.886006] [<ffffffff81099c70>] ? kthread_park+0x50/0x50
[ 325.886007] ---[ end trace 7abe0a50a8247368 ]---
[ 325.906049] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x5000cca03e999ab5)
[ 325.906051] mpt2sas_cm0: removing : enclosure logical id(0x5003048001bea4ff), slot(1)
4.4.120-1: no bug report issued to dmesg. OSD still does not go down in Ceph. Again, rebooting will bring the OSD down and plugging it back in syncs everything back up OK.
Is there a "supported" HBA i should be looking at purchasing instead of going through this? I haven't really seen many other HBAs other than the LSI's (Currently using LSI 9211-8i in IT mode, maybe I should try flashing the IR firmware?) Should I upgrade to the newer SAS3 HBAs? Although I do want to help getting this figured out, as I'm sure other people will be using the 9211's. They're extremely common, especially people coming from ZFS or freenas.
Anyway, the notable difference when using the three new kernels was the fact that the system did not hard lock, requiring a power off and power on. I was able to soft-reboot the system to get everything flowing again.
Let me know what you think! Thanks!