OSD crashed and restarted
admin
2,930 Posts
July 18, 2018, 7:18 amQuote from admin on July 18, 2018, 7:18 amYou can test the 4.4.126-04 from: https://drive.google.com/open?id=12XrHstPa0LwxYa252WD2bJYwhSuLD4c6 It is the same as the -03 build + the 2 tx related patches above applied. Hopefully this should solve the resets + give you the same performance as -03.
Can you give us more info on what hardware you used to reach the 800 MB/s disk copy, type of disks, number of nodes/OSDs + what is your total cluster speed ?
You can test the 4.4.126-04 from: https://drive.google.com/open?id=12XrHstPa0LwxYa252WD2bJYwhSuLD4c6 It is the same as the -03 build + the 2 tx related patches above applied. Hopefully this should solve the resets + give you the same performance as -03.
Can you give us more info on what hardware you used to reach the 800 MB/s disk copy, type of disks, number of nodes/OSDs + what is your total cluster speed ?
BonsaiJoe
53 Posts
July 18, 2018, 11:50 amQuote from BonsaiJoe on July 18, 2018, 11:50 amthanks for the kernel we have now updated our 3 node cluster with kernel -4
each node is
20x HGST 1,8 TB SAS 10K
4x SSD 400 GB
Areca Raid card in Jbod mode with 8GB write cache (this makes a lot improvment)
128 GB Mem and 12 CPU (24threads)
4x 10G intel NIC
cluster speed from benchmark is 876 MB/s write and 2628 MB/s read
at starting the first load test we got this kernel error on 2 nodes at the same time any idea why this happed ?
node 2:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761040] ------------[ cut here ]------------
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761050] WARNING: CPU: 12 PID: 967 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761056] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) ioatdma(N) lpc_ich(N) joydev(N) mei(N) mfd_core(N) sg(N) shpchp(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) arcmsr(N) ast(N) i2c_algo_bit(N) ttm(N) ahci(N) drm_kms_helper(N) syscopyarea(N) libahci(N) sysfillrect(N) i40e(N) sysimgblt(N) fb_sys_fops(N) vxlan(N) xhci_pci(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) xhci_hcd(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761169] Supported: No, Unsupported modules are loaded
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761174] CPU: 12 PID: 967 Comm: kworker/12:2 Tainted: G N 4.4.126-04-petasan #1
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761178] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761196] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761197] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598d
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761207] ffffffff8107f37d ffff881ff9957840 0000000000004726 00000000b83f2501
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761215] 000000000000000f 0000000000000006 ffffffff815660fd ffffffff8156b72b
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761223] Call Trace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761237] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761243] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761248] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761255] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761261] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761267] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761272] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761279] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761289] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761296] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761302] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761308] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761313] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761317] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761326] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761332] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761338] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761345] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761353] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764894] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764895]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899] Leftover inexact backtrace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764913] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764916] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764923] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764928] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764935] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764950] [<ffffffffa07e6e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764959] [<ffffffffa07e8c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764967] [<ffffffffa07ea2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764974] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764979] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764983] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764988] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764992] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764996] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765001] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765005] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765009] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765014] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765018] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765025] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765030] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765034] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765039] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765043] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765047] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765052] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765056] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765066] ---[ end trace cda239b3a423effb ]---
Jul 18 13:25:01 ps02-node02 CRON[26445]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
root@ps02-node02:~#
node 3:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616245] ------------[ cut here ]------------
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616260] WARNING: CPU: 10 PID: 59 at net/ipv4/tcp_input.c:2481 tcp_cwnd_redu ction+0xcd/0xe0()
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616266] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_c ore_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N ) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_th ermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_cl mulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cry ptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) joydev(N) lpc_ich(N) sg(N) ioatdma(N) mfd_core(N) mei(N) shpchp(N) dca(N ) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16( N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) ttm(N) arcmsr(N) drm_k ms_helper(N) i40e(N) syscopyarea(N) ahci(N) sysfillrect(N) libahci(N) sysimgblt(N) xhci_pci(N) fb_sys_fops(N) vxlan(N) ip6_udp_tunnel(N) xhci_hcd(N) udp_tunnel(N) ptp(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616410] Supported: No, Unsupported modules are loaded
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616417] CPU: 10 PID: 59 Comm: kworker/10:0 Tainted: G N 4.4 .126-04-petasan #1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616421] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2 018
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616447] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616451] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598 d
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616459] ffffffff8107f37d ffff881d83ce1840 0000000000004526 000000001795bc0 1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616468] 0000000000000013 0000000000000007 ffffffff815660fd ffffffff8156b72 b
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616477] Call Trace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616498] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616511] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616520] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616532] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616542] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616552] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616563] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616572] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616594] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616606] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616617] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616627] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616636] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616642] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616657] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616670] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616682] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616694] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616708] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625095] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625102]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104] Leftover inexact backtrace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625122] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625131] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625141] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625149] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625158] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625179] [<ffffffffa07d3e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625196] [<ffffffffa07d5c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625214] [<ffffffffa07d72ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625224] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625231] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625238] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625245] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625252] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625259] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625266] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625272] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625279] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625286] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625293] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625303] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625310] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625317] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625323] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625330] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625336] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625343] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625349] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625363] ---[ end trace 3186ab5a72fa342c ]---
Jul 18 13:25:01 ps02-node03 CRON[30086]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: 0 rbd image-00001 - /dev/rbd0
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: message repeated 135 times: [ 0 rbd image-00001 - /dev/rbd0]
Jul 18 13:35:01 ps02-node03 CRON[35939]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
thanks for the kernel we have now updated our 3 node cluster with kernel -4
each node is
20x HGST 1,8 TB SAS 10K
4x SSD 400 GB
Areca Raid card in Jbod mode with 8GB write cache (this makes a lot improvment)
128 GB Mem and 12 CPU (24threads)
4x 10G intel NIC
cluster speed from benchmark is 876 MB/s write and 2628 MB/s read
at starting the first load test we got this kernel error on 2 nodes at the same time any idea why this happed ?
node 2:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761040] ------------[ cut here ]------------
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761050] WARNING: CPU: 12 PID: 967 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761056] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) ioatdma(N) lpc_ich(N) joydev(N) mei(N) mfd_core(N) sg(N) shpchp(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) arcmsr(N) ast(N) i2c_algo_bit(N) ttm(N) ahci(N) drm_kms_helper(N) syscopyarea(N) libahci(N) sysfillrect(N) i40e(N) sysimgblt(N) fb_sys_fops(N) vxlan(N) xhci_pci(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) xhci_hcd(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761169] Supported: No, Unsupported modules are loaded
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761174] CPU: 12 PID: 967 Comm: kworker/12:2 Tainted: G N 4.4.126-04-petasan #1
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761178] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761196] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761197] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598d
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761207] ffffffff8107f37d ffff881ff9957840 0000000000004726 00000000b83f2501
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761215] 000000000000000f 0000000000000006 ffffffff815660fd ffffffff8156b72b
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761223] Call Trace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761237] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761243] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761248] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761255] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761261] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761267] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761272] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761279] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761289] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761296] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761302] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761308] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761313] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761317] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761326] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761332] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761338] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761345] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761353] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764894] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764895]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899] Leftover inexact backtrace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764913] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764916] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764923] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764928] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764935] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764950] [<ffffffffa07e6e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764959] [<ffffffffa07e8c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764967] [<ffffffffa07ea2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764974] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764979] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764983] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764988] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764992] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764996] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765001] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765005] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765009] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765014] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765018] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765025] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765030] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765034] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765039] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765043] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765047] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765052] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765056] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765066] ---[ end trace cda239b3a423effb ]---
Jul 18 13:25:01 ps02-node02 CRON[26445]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
root@ps02-node02:~#
node 3:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616245] ------------[ cut here ]------------
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616260] WARNING: CPU: 10 PID: 59 at net/ipv4/tcp_input.c:2481 tcp_cwnd_redu ction+0xcd/0xe0()
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616266] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_c ore_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N ) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_th ermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_cl mulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cry ptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) joydev(N) lpc_ich(N) sg(N) ioatdma(N) mfd_core(N) mei(N) shpchp(N) dca(N ) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16( N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) ttm(N) arcmsr(N) drm_k ms_helper(N) i40e(N) syscopyarea(N) ahci(N) sysfillrect(N) libahci(N) sysimgblt(N) xhci_pci(N) fb_sys_fops(N) vxlan(N) ip6_udp_tunnel(N) xhci_hcd(N) udp_tunnel(N) ptp(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616410] Supported: No, Unsupported modules are loaded
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616417] CPU: 10 PID: 59 Comm: kworker/10:0 Tainted: G N 4.4 .126-04-petasan #1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616421] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2 018
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616447] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616451] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598 d
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616459] ffffffff8107f37d ffff881d83ce1840 0000000000004526 000000001795bc0 1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616468] 0000000000000013 0000000000000007 ffffffff815660fd ffffffff8156b72 b
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616477] Call Trace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616498] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616511] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616520] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616532] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616542] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616552] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616563] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616572] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616594] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616606] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616617] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616627] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616636] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616642] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616657] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616670] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616682] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616694] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616708] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625095] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625102]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104] Leftover inexact backtrace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625122] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625131] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625141] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625149] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625158] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625179] [<ffffffffa07d3e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625196] [<ffffffffa07d5c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625214] [<ffffffffa07d72ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625224] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625231] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625238] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625245] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625252] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625259] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625266] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625272] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625279] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625286] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625293] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625303] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625310] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625317] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625323] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625330] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625336] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625343] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625349] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625363] ---[ end trace 3186ab5a72fa342c ]---
Jul 18 13:25:01 ps02-node03 CRON[30086]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: 0 rbd image-00001 - /dev/rbd0
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: message repeated 135 times: [ 0 rbd image-00001 - /dev/rbd0]
Jul 18 13:35:01 ps02-node03 CRON[35939]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
admin
2,930 Posts
July 18, 2018, 1:04 pmQuote from admin on July 18, 2018, 1:04 pmThe first thing is to revert to -03
dpkg -r linux-image-4.4.126-03-petasan
dpkg -r linux-image-4.4.126-04-petasan
dpkg -i linux-image-4.4.126-03-petasan_amd64.deb
It does look like a tcp crash, most likely caused by applying the patches.
The first thing is to revert to -03
dpkg -r linux-image-4.4.126-03-petasan
dpkg -r linux-image-4.4.126-04-petasan
dpkg -i linux-image-4.4.126-03-petasan_amd64.deb
It does look like a tcp crash, most likely caused by applying the patches.
Last edited on July 18, 2018, 1:04 pm by admin · #23
BonsaiJoe
53 Posts
July 18, 2018, 1:24 pmQuote from BonsaiJoe on July 18, 2018, 1:24 pmlooks like this is also in the -3 we could find this messages below 1 minute after yesterday nic reset happened (this cluster is still on -3) on node 1 in the kernel log on node 5 no other node has this message until now on the 5 node cluster
on the 3 node cluster this did not happen again within the last 2h strange this is that ceph report everything as "health" and also there is no error in ceph logs
this is from the kernel.log on node5 of the 5 nodes cluster:
Jul 17 03:24:07 node05 kernel: [205376.722607] libceph: osd21 down
Jul 17 03:24:07 node05 kernel: [205376.722611] libceph: osd23 down
Jul 17 03:24:07 node05 kernel: [205376.722613] libceph: osd27 down
Jul 17 03:24:07 node05 kernel: [205376.722614] libceph: osd32 down
Jul 17 03:24:07 node05 kernel: [205376.722616] libceph: osd33 down
Jul 17 03:24:07 node05 kernel: [205376.722617] libceph: osd36 down
Jul 17 03:24:07 node05 kernel: [205376.722619] libceph: osd38 down
Jul 17 03:24:08 node05 kernel: [205377.951481] libceph: osd20 down
Jul 17 03:24:08 node05 kernel: [205377.951482] libceph: osd22 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd24 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd25 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd26 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd28 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd29 down
Jul 17 03:24:08 node05 kernel: [205377.951485] libceph: osd30 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd31 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd34 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd35 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd37 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd39 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd21 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd23 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd32 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd36 up
Jul 17 03:24:08 node05 kernel: [205377.973141] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2315533312
Jul 17 03:24:08 node05 kernel: [205377.973151] ABORT_TASK: Found referenced iSCSI task_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973157] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973162] ABORT_TASK: Found referenced iSCSI task_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973165] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973170] ABORT_TASK: Found referenced iSCSI task_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973173] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973178] ABORT_TASK: Found referenced iSCSI task_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973192] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973197] ABORT_TASK: Found referenced iSCSI task_tag: 101033216
Jul 17 03:24:08 node05 kernel: [205377.974639] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 101033216
Jul 17 03:24:11 node05 kernel: [205381.407854] libceph: osd20 up
Jul 17 03:24:11 node05 kernel: [205381.407860] libceph: osd24 up
Jul 17 03:24:11 node05 kernel: [205381.407861] libceph: osd25 up
Jul 17 03:24:11 node05 kernel: [205381.407862] libceph: osd27 up
Jul 17 03:24:11 node05 kernel: [205381.408925] libceph: osd26 up
Jul 17 03:24:11 node05 kernel: [205381.408929] libceph: osd31 up
Jul 17 03:24:17 node05 kernel: [205387.305779] libceph: mon1 192.168.42.11:6789 session lost, hunting for new mon
Jul 17 03:24:25 node05 kernel: [205394.632835] libceph: mon2 192.168.42.13:6789 session established
Jul 17 03:24:25 node05 kernel: [205394.949068] ABORT_TASK: Found referenced iSCSI task_tag: 1275439616
Jul 17 03:24:26 node05 kernel: [205396.346171] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1275439616
Jul 17 03:24:27 node05 kernel: [205397.383376] libceph: osd33 up
Jul 17 03:24:33 node05 kernel: [205403.273717] iSCSI Login timeout on Network Portal 192.168.241.111:3260
Jul 17 03:24:55 node05 kernel: [205425.077825] ABORT_TASK: Found referenced iSCSI task_tag: 906252032
Jul 17 03:25:13 node05 kernel: [205443.006011] Unable to locate ITT: 0xbc049100 on CID: 1
Jul 17 03:25:13 node05 kernel: [205443.006012] Unable to locate RefTaskTag: 0xbc049100 on CID: 1.
Jul 17 03:25:15 node05 kernel: [205444.562587] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:15 node05 kernel: [205444.562608] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562622] ABORT_TASK: Found referenced iSCSI task_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562625] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562630] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 4127547648
Jul 17 03:25:15 node05 kernel: [205444.562636] ABORT_TASK: Found referenced iSCSI task_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562638] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562644] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562648] ABORT_TASK: Found referenced iSCSI task_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562650] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562655] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3976468224
Jul 17 03:25:15 node05 kernel: [205444.562659] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258196
Jul 17 03:25:15 node05 kernel: [205444.562683] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562688] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562703] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562713] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562718] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285502208
Jul 17 03:25:15 node05 kernel: [205444.562725] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205445.129618] iSCSI Login timeout on Network Portal 192.168.241.116:3260
Jul 17 03:25:23 node05 kernel: [205453.257320] libceph: osd20 down
Jul 17 03:25:23 node05 kernel: [205453.257323] libceph: osd21 down
Jul 17 03:25:23 node05 kernel: [205453.257324] libceph: osd23 down
Jul 17 03:25:23 node05 kernel: [205453.257325] libceph: osd24 down
Jul 17 03:25:23 node05 kernel: [205453.257326] libceph: osd25 down
Jul 17 03:25:23 node05 kernel: [205453.257327] libceph: osd26 down
Jul 17 03:25:23 node05 kernel: [205453.257328] libceph: osd27 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd31 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd32 down
Jul 17 03:25:23 node05 kernel: [205453.257330] libceph: osd36 down
Jul 17 03:25:24 node05 kernel: [205454.365408] libceph: osd20 up
Jul 17 03:25:25 node05 kernel: [205454.501450] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.503070] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.506432] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.508427] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509152] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509169] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509582] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509746] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.511204] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512362] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512565] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.521600] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.522417] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.524542] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526303] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526692] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.527982] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531018] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531284] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.535175] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538703] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538994] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.546172] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547076] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547315] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.549319] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.551513] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.552010] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.561022] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.563844] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.565809] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.566184] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.638090] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.640258] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690356] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690562] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.693983] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.698110] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205455.317039] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:38 node05 kernel: [205467.625963] ABORT_TASK: Found referenced iSCSI task_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.839916] libceph: osd22 up
Jul 17 03:25:38 node05 kernel: [205467.839921] libceph: osd28 up
Jul 17 03:25:38 node05 kernel: [205467.839922] libceph: osd29 up
Jul 17 03:25:38 node05 kernel: [205467.839923] libceph: osd30 up
Jul 17 03:25:38 node05 kernel: [205467.839924] libceph: osd34 up
Jul 17 03:25:38 node05 kernel: [205467.839925] libceph: osd35 up
Jul 17 03:25:38 node05 kernel: [205467.839926] libceph: osd37 up
Jul 17 03:25:38 node05 kernel: [205467.839928] libceph: osd38 up
Jul 17 03:25:38 node05 kernel: [205467.839935] libceph: osd39 up
Jul 17 03:25:38 node05 kernel: [205467.891932] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.894517] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.894844] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.996844] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 906252032
Jul 17 03:25:38 node05 kernel: [205467.996856] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3204813312
Jul 17 03:25:38 node05 kernel: [205467.996860] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 184814848
Jul 17 03:25:38 node05 kernel: [205467.996865] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 302374912
Jul 17 03:25:38 node05 kernel: [205468.155352] libceph: osd20 down
Jul 17 03:25:38 node05 kernel: [205468.155355] libceph: osd21 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd23 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd24 up
Jul 17 03:25:38 node05 kernel: [205468.155357] libceph: osd25 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd26 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd27 up
Jul 17 03:25:38 node05 kernel: [205468.155359] libceph: osd31 up
Jul 17 03:25:38 node05 kernel: [205468.155363] libceph: osd32 up
Jul 17 03:25:38 node05 kernel: [205468.155375] libceph: osd36 up
Jul 17 03:25:38 node05 kernel: [205468.306612] ABORT_TASK: Found referenced iSCSI task_tag: 1577336576
Jul 17 03:25:38 node05 kernel: [205468.369450] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205468.653810] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1577336576
Jul 17 03:25:39 node05 kernel: [205468.653827] ABORT_TASK: Found referenced iSCSI task_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.653832] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.778610] ABORT_TASK: Found referenced iSCSI task_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785838] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785973] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205469.057341] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:39 node05 kernel: [205469.068293] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.648594] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.713882] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205469.714041] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205470.286678] libceph: osd20 up
Jul 17 03:25:41 node05 kernel: [205470.670242] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:43 node05 kernel: [205472.503129] ------------[ cut here ]------------
Jul 17 03:25:43 node05 kernel: [205472.503139] WARNING: CPU: 0 PID: 1350286 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 17 03:25:43 node05 kernel: [205472.503144] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) sg(N) lpc_ich(N) mfd_core(N) mei_me(N) shpchp(N) mei(N) ioatdma(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) arcmsr(N) ttm(N) ahci(N) libahci(N) drm_kms_helper(N) syscopyarea(N) sysfillrect(N) i40e(N) xhci_pci(N) sysimgblt(N) vxlan(N) libata(N) fb_sys_fops(N) xhci_hcd(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) pps_core(N) usbcore(N) drm(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 17 03:25:43 node05 kernel: [205472.503229] Supported: No, Unsupported modules are loaded
Jul 17 03:25:43 node05 kernel: [205472.503234] CPU: 0 PID: 1350286 Comm: kworker/0:1 Tainted: G N 4.4.126-03-petasan #1
Jul 17 03:25:43 node05 kernel: [205472.503237] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 17 03:25:43 node05 kernel: [205472.503252] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 17 03:25:43 node05 kernel: [205472.503255] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8594d
Jul 17 03:25:43 node05 kernel: [205472.503262] ffffffff8107f37d ffff8818835bf040 0000000000004526 0000000048323b01
Jul 17 03:25:43 node05 kernel: [205472.503268] 0000000000000017 0000000000000003 ffffffff815660fd ffffffff8156b72b
Jul 17 03:25:43 node05 kernel: [205472.503275] Call Trace:
Jul 17 03:25:43 node05 kernel: [205472.503291] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 17 03:25:43 node05 kernel: [205472.503301] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 17 03:25:43 node05 kernel: [205472.503306] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 17 03:25:43 node05 kernel: [205472.503314] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 17 03:25:43 node05 kernel: [205472.503321] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 17 03:25:43 node05 kernel: [205472.503327] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 17 03:25:43 node05 kernel: [205472.503334] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 17 03:25:43 node05 kernel: [205472.503340] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 17 03:25:43 node05 kernel: [205472.503355] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 17 03:25:43 node05 kernel: [205472.503365] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 17 03:25:43 node05 kernel: [205472.503371] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 17 03:25:43 node05 kernel: [205472.503379] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 17 03:25:43 node05 kernel: [205472.503384] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 17 03:25:43 node05 kernel: [205472.503390] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 17 03:25:43 node05 kernel: [205472.503401] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 17 03:25:43 node05 kernel: [205472.503410] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 17 03:25:43 node05 kernel: [205472.503416] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 17 03:25:43 node05 kernel: [205472.503426] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 17 03:25:43 node05 kernel: [205472.503434] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507010] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507013]
Jul 17 03:25:43 node05 kernel: [205472.507015] Leftover inexact backtrace:
Jul 17 03:25:43 node05 kernel: [205472.507015]
Jul 17 03:25:43 node05 kernel: [205472.507019] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 17 03:25:43 node05 kernel: [205472.507030] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 17 03:25:43 node05 kernel: [205472.507034] [<ffffffff81564c41>] ? tcp_sendmsg+0xe1/0xb50
Jul 17 03:25:43 node05 kernel: [205472.507037] [<ffffffff81572680>] ? tcp_tsq_handler.part.34+0x30/0x30
Jul 17 03:25:43 node05 kernel: [205472.507043] [<ffffffff814fdb46>] ? sock_sendmsg+0x36/0x40
Jul 17 03:25:43 node05 kernel: [205472.507056] [<ffffffffa0788d6b>] ? try_write+0x26b/0xea0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507060] [<ffffffff8158e8a3>] ? inet_recvmsg+0x73/0x90
Jul 17 03:25:43 node05 kernel: [205472.507068] [<ffffffffa078a2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507074] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507078] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507082] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507086] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507089] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507093] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507097] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507101] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507105] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507109] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507112] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507118] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507122] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 17 03:25:43 node05 kernel: [205472.507126] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507129] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 17 03:25:43 node05 kernel: [205472.507133] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507137] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507141] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 17 03:25:43 node05 kernel: [205472.507144] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507171] ---[ end trace 34e5436c0363311d ]---
looks like this is also in the -3 we could find this messages below 1 minute after yesterday nic reset happened (this cluster is still on -3) on node 1 in the kernel log on node 5 no other node has this message until now on the 5 node cluster
on the 3 node cluster this did not happen again within the last 2h strange this is that ceph report everything as "health" and also there is no error in ceph logs
this is from the kernel.log on node5 of the 5 nodes cluster:
Jul 17 03:24:07 node05 kernel: [205376.722607] libceph: osd21 down
Jul 17 03:24:07 node05 kernel: [205376.722611] libceph: osd23 down
Jul 17 03:24:07 node05 kernel: [205376.722613] libceph: osd27 down
Jul 17 03:24:07 node05 kernel: [205376.722614] libceph: osd32 down
Jul 17 03:24:07 node05 kernel: [205376.722616] libceph: osd33 down
Jul 17 03:24:07 node05 kernel: [205376.722617] libceph: osd36 down
Jul 17 03:24:07 node05 kernel: [205376.722619] libceph: osd38 down
Jul 17 03:24:08 node05 kernel: [205377.951481] libceph: osd20 down
Jul 17 03:24:08 node05 kernel: [205377.951482] libceph: osd22 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd24 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd25 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd26 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd28 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd29 down
Jul 17 03:24:08 node05 kernel: [205377.951485] libceph: osd30 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd31 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd34 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd35 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd37 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd39 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd21 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd23 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd32 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd36 up
Jul 17 03:24:08 node05 kernel: [205377.973141] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2315533312
Jul 17 03:24:08 node05 kernel: [205377.973151] ABORT_TASK: Found referenced iSCSI task_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973157] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973162] ABORT_TASK: Found referenced iSCSI task_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973165] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973170] ABORT_TASK: Found referenced iSCSI task_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973173] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973178] ABORT_TASK: Found referenced iSCSI task_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973192] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973197] ABORT_TASK: Found referenced iSCSI task_tag: 101033216
Jul 17 03:24:08 node05 kernel: [205377.974639] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 101033216
Jul 17 03:24:11 node05 kernel: [205381.407854] libceph: osd20 up
Jul 17 03:24:11 node05 kernel: [205381.407860] libceph: osd24 up
Jul 17 03:24:11 node05 kernel: [205381.407861] libceph: osd25 up
Jul 17 03:24:11 node05 kernel: [205381.407862] libceph: osd27 up
Jul 17 03:24:11 node05 kernel: [205381.408925] libceph: osd26 up
Jul 17 03:24:11 node05 kernel: [205381.408929] libceph: osd31 up
Jul 17 03:24:17 node05 kernel: [205387.305779] libceph: mon1 192.168.42.11:6789 session lost, hunting for new mon
Jul 17 03:24:25 node05 kernel: [205394.632835] libceph: mon2 192.168.42.13:6789 session established
Jul 17 03:24:25 node05 kernel: [205394.949068] ABORT_TASK: Found referenced iSCSI task_tag: 1275439616
Jul 17 03:24:26 node05 kernel: [205396.346171] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1275439616
Jul 17 03:24:27 node05 kernel: [205397.383376] libceph: osd33 up
Jul 17 03:24:33 node05 kernel: [205403.273717] iSCSI Login timeout on Network Portal 192.168.241.111:3260
Jul 17 03:24:55 node05 kernel: [205425.077825] ABORT_TASK: Found referenced iSCSI task_tag: 906252032
Jul 17 03:25:13 node05 kernel: [205443.006011] Unable to locate ITT: 0xbc049100 on CID: 1
Jul 17 03:25:13 node05 kernel: [205443.006012] Unable to locate RefTaskTag: 0xbc049100 on CID: 1.
Jul 17 03:25:15 node05 kernel: [205444.562587] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:15 node05 kernel: [205444.562608] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562622] ABORT_TASK: Found referenced iSCSI task_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562625] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562630] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 4127547648
Jul 17 03:25:15 node05 kernel: [205444.562636] ABORT_TASK: Found referenced iSCSI task_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562638] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562644] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562648] ABORT_TASK: Found referenced iSCSI task_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562650] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562655] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3976468224
Jul 17 03:25:15 node05 kernel: [205444.562659] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258196
Jul 17 03:25:15 node05 kernel: [205444.562683] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562688] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562703] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562713] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562718] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285502208
Jul 17 03:25:15 node05 kernel: [205444.562725] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205445.129618] iSCSI Login timeout on Network Portal 192.168.241.116:3260
Jul 17 03:25:23 node05 kernel: [205453.257320] libceph: osd20 down
Jul 17 03:25:23 node05 kernel: [205453.257323] libceph: osd21 down
Jul 17 03:25:23 node05 kernel: [205453.257324] libceph: osd23 down
Jul 17 03:25:23 node05 kernel: [205453.257325] libceph: osd24 down
Jul 17 03:25:23 node05 kernel: [205453.257326] libceph: osd25 down
Jul 17 03:25:23 node05 kernel: [205453.257327] libceph: osd26 down
Jul 17 03:25:23 node05 kernel: [205453.257328] libceph: osd27 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd31 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd32 down
Jul 17 03:25:23 node05 kernel: [205453.257330] libceph: osd36 down
Jul 17 03:25:24 node05 kernel: [205454.365408] libceph: osd20 up
Jul 17 03:25:25 node05 kernel: [205454.501450] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.503070] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.506432] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.508427] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509152] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509169] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509582] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509746] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.511204] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512362] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512565] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.521600] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.522417] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.524542] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526303] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526692] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.527982] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531018] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531284] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.535175] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538703] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538994] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.546172] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547076] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547315] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.549319] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.551513] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.552010] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.561022] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.563844] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.565809] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.566184] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.638090] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.640258] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690356] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690562] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.693983] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.698110] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205455.317039] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:38 node05 kernel: [205467.625963] ABORT_TASK: Found referenced iSCSI task_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.839916] libceph: osd22 up
Jul 17 03:25:38 node05 kernel: [205467.839921] libceph: osd28 up
Jul 17 03:25:38 node05 kernel: [205467.839922] libceph: osd29 up
Jul 17 03:25:38 node05 kernel: [205467.839923] libceph: osd30 up
Jul 17 03:25:38 node05 kernel: [205467.839924] libceph: osd34 up
Jul 17 03:25:38 node05 kernel: [205467.839925] libceph: osd35 up
Jul 17 03:25:38 node05 kernel: [205467.839926] libceph: osd37 up
Jul 17 03:25:38 node05 kernel: [205467.839928] libceph: osd38 up
Jul 17 03:25:38 node05 kernel: [205467.839935] libceph: osd39 up
Jul 17 03:25:38 node05 kernel: [205467.891932] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.894517] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.894844] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.996844] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 906252032
Jul 17 03:25:38 node05 kernel: [205467.996856] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3204813312
Jul 17 03:25:38 node05 kernel: [205467.996860] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 184814848
Jul 17 03:25:38 node05 kernel: [205467.996865] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 302374912
Jul 17 03:25:38 node05 kernel: [205468.155352] libceph: osd20 down
Jul 17 03:25:38 node05 kernel: [205468.155355] libceph: osd21 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd23 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd24 up
Jul 17 03:25:38 node05 kernel: [205468.155357] libceph: osd25 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd26 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd27 up
Jul 17 03:25:38 node05 kernel: [205468.155359] libceph: osd31 up
Jul 17 03:25:38 node05 kernel: [205468.155363] libceph: osd32 up
Jul 17 03:25:38 node05 kernel: [205468.155375] libceph: osd36 up
Jul 17 03:25:38 node05 kernel: [205468.306612] ABORT_TASK: Found referenced iSCSI task_tag: 1577336576
Jul 17 03:25:38 node05 kernel: [205468.369450] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205468.653810] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1577336576
Jul 17 03:25:39 node05 kernel: [205468.653827] ABORT_TASK: Found referenced iSCSI task_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.653832] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.778610] ABORT_TASK: Found referenced iSCSI task_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785838] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785973] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205469.057341] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:39 node05 kernel: [205469.068293] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.648594] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.713882] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205469.714041] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205470.286678] libceph: osd20 up
Jul 17 03:25:41 node05 kernel: [205470.670242] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:43 node05 kernel: [205472.503129] ------------[ cut here ]------------
Jul 17 03:25:43 node05 kernel: [205472.503139] WARNING: CPU: 0 PID: 1350286 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 17 03:25:43 node05 kernel: [205472.503144] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) sg(N) lpc_ich(N) mfd_core(N) mei_me(N) shpchp(N) mei(N) ioatdma(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) arcmsr(N) ttm(N) ahci(N) libahci(N) drm_kms_helper(N) syscopyarea(N) sysfillrect(N) i40e(N) xhci_pci(N) sysimgblt(N) vxlan(N) libata(N) fb_sys_fops(N) xhci_hcd(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) pps_core(N) usbcore(N) drm(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 17 03:25:43 node05 kernel: [205472.503229] Supported: No, Unsupported modules are loaded
Jul 17 03:25:43 node05 kernel: [205472.503234] CPU: 0 PID: 1350286 Comm: kworker/0:1 Tainted: G N 4.4.126-03-petasan #1
Jul 17 03:25:43 node05 kernel: [205472.503237] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 17 03:25:43 node05 kernel: [205472.503252] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 17 03:25:43 node05 kernel: [205472.503255] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8594d
Jul 17 03:25:43 node05 kernel: [205472.503262] ffffffff8107f37d ffff8818835bf040 0000000000004526 0000000048323b01
Jul 17 03:25:43 node05 kernel: [205472.503268] 0000000000000017 0000000000000003 ffffffff815660fd ffffffff8156b72b
Jul 17 03:25:43 node05 kernel: [205472.503275] Call Trace:
Jul 17 03:25:43 node05 kernel: [205472.503291] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 17 03:25:43 node05 kernel: [205472.503301] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 17 03:25:43 node05 kernel: [205472.503306] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 17 03:25:43 node05 kernel: [205472.503314] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 17 03:25:43 node05 kernel: [205472.503321] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 17 03:25:43 node05 kernel: [205472.503327] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 17 03:25:43 node05 kernel: [205472.503334] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 17 03:25:43 node05 kernel: [205472.503340] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 17 03:25:43 node05 kernel: [205472.503355] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 17 03:25:43 node05 kernel: [205472.503365] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 17 03:25:43 node05 kernel: [205472.503371] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 17 03:25:43 node05 kernel: [205472.503379] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 17 03:25:43 node05 kernel: [205472.503384] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 17 03:25:43 node05 kernel: [205472.503390] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 17 03:25:43 node05 kernel: [205472.503401] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 17 03:25:43 node05 kernel: [205472.503410] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 17 03:25:43 node05 kernel: [205472.503416] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 17 03:25:43 node05 kernel: [205472.503426] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 17 03:25:43 node05 kernel: [205472.503434] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507010] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507013]
Jul 17 03:25:43 node05 kernel: [205472.507015] Leftover inexact backtrace:
Jul 17 03:25:43 node05 kernel: [205472.507015]
Jul 17 03:25:43 node05 kernel: [205472.507019] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 17 03:25:43 node05 kernel: [205472.507030] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 17 03:25:43 node05 kernel: [205472.507034] [<ffffffff81564c41>] ? tcp_sendmsg+0xe1/0xb50
Jul 17 03:25:43 node05 kernel: [205472.507037] [<ffffffff81572680>] ? tcp_tsq_handler.part.34+0x30/0x30
Jul 17 03:25:43 node05 kernel: [205472.507043] [<ffffffff814fdb46>] ? sock_sendmsg+0x36/0x40
Jul 17 03:25:43 node05 kernel: [205472.507056] [<ffffffffa0788d6b>] ? try_write+0x26b/0xea0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507060] [<ffffffff8158e8a3>] ? inet_recvmsg+0x73/0x90
Jul 17 03:25:43 node05 kernel: [205472.507068] [<ffffffffa078a2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507074] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507078] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507082] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507086] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507089] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507093] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507097] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507101] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507105] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507109] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507112] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507118] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507122] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 17 03:25:43 node05 kernel: [205472.507126] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507129] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 17 03:25:43 node05 kernel: [205472.507133] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507137] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507141] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 17 03:25:43 node05 kernel: [205472.507144] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507171] ---[ end trace 34e5436c0363311d ]---
Last edited on July 18, 2018, 1:27 pm by BonsaiJoe · #24
admin
2,930 Posts
July 18, 2018, 1:45 pmQuote from admin on July 18, 2018, 1:45 pmThe bad thing is the i40e driver support in the 4.4.x kernel is lagging main stream fixes. We will try to get the 4.12 out soon. If there is any chance you could use other nic type then it may be a fix. I will update you when 4.12 is ready.
The bad thing is the i40e driver support in the 4.4.x kernel is lagging main stream fixes. We will try to get the 4.12 out soon. If there is any chance you could use other nic type then it may be a fix. I will update you when 4.12 is ready.
BonsaiJoe
53 Posts
July 18, 2018, 2:12 pmQuote from BonsaiJoe on July 18, 2018, 2:12 pmthanks for the fast update
we have done a second load test on the 3 nodes cluster with -4
now we did not get the kernel message again but cluster speed is again much better then the 4.4.92 kernel version.
test was copy 3 vm´s from petasan to petasan (same cluster) at the same time speed was up to 800 MB/s RW
do you think the kernel error comes from the NIC driver or is there maybe any other problem with the 4.4.126 kernel?
thanks for the fast update
we have done a second load test on the 3 nodes cluster with -4
now we did not get the kernel message again but cluster speed is again much better then the 4.4.92 kernel version.
test was copy 3 vm´s from petasan to petasan (same cluster) at the same time speed was up to 800 MB/s RW
do you think the kernel error comes from the NIC driver or is there maybe any other problem with the 4.4.126 kernel?
Last edited on July 18, 2018, 2:30 pm by BonsaiJoe · #26
admin
2,930 Posts
July 18, 2018, 3:35 pmQuote from admin on July 18, 2018, 3:35 pmThe best approach for us is to get you a 4.12.x soon which has many i40e fixes, the 4.4 lags a lot of fixes relating to this driver. if you are willing to do one more test, we can send you another 4.4.126 test kernel with some more fixes that may or may not solve the issue, let me know and we can have a shot at it, but really we do not want to spend too much time fixing 4.4 now
Re the load test, what do you use to test ?
The best approach for us is to get you a 4.12.x soon which has many i40e fixes, the 4.4 lags a lot of fixes relating to this driver. if you are willing to do one more test, we can send you another 4.4.126 test kernel with some more fixes that may or may not solve the issue, let me know and we can have a shot at it, but really we do not want to spend too much time fixing 4.4 now
Re the load test, what do you use to test ?
Last edited on July 18, 2018, 3:39 pm by admin · #27
BonsaiJoe
53 Posts
July 18, 2018, 4:48 pmQuote from BonsaiJoe on July 18, 2018, 4:48 pmThanks, I think we will wait until you are ready with 4.12 can you estimate how long this takes?
We have done just a basic test copy 3 vm`s parallel (each ~80GB) in VMware from petasan to petasan each job gots approx 270MB/s (Rw) Speed so we got in total up to 800 MB/s
this is close to the maximum of 876 MB/s write speed from petasan internal benchmark test
Thanks, I think we will wait until you are ready with 4.12 can you estimate how long this takes?
We have done just a basic test copy 3 vm`s parallel (each ~80GB) in VMware from petasan to petasan each job gots approx 270MB/s (Rw) Speed so we got in total up to 800 MB/s
this is close to the maximum of 876 MB/s write speed from petasan internal benchmark test
admin
2,930 Posts
admin
2,930 Posts
July 23, 2018, 11:49 amQuote from admin on July 23, 2018, 11:49 amdownload new kernel and firmware from:
https://drive.google.com/drive/folders/1kZYfW3MAz2fJKBIy57R4dF9h74SCMoNt?usp=sharing
install:
dpkg -i petasan-firmware_20180416.deb
dpkg -i linux-image-4.12.14-02-petasan_amd64.deb
download new kernel and firmware from:
https://drive.google.com/drive/folders/1kZYfW3MAz2fJKBIy57R4dF9h74SCMoNt?usp=sharing
install:
dpkg -i petasan-firmware_20180416.deb
dpkg -i linux-image-4.12.14-02-petasan_amd64.deb
OSD crashed and restarted
admin
2,930 Posts
Quote from admin on July 18, 2018, 7:18 amYou can test the 4.4.126-04 from: https://drive.google.com/open?id=12XrHstPa0LwxYa252WD2bJYwhSuLD4c6 It is the same as the -03 build + the 2 tx related patches above applied. Hopefully this should solve the resets + give you the same performance as -03.
Can you give us more info on what hardware you used to reach the 800 MB/s disk copy, type of disks, number of nodes/OSDs + what is your total cluster speed ?
You can test the 4.4.126-04 from: https://drive.google.com/open?id=12XrHstPa0LwxYa252WD2bJYwhSuLD4c6 It is the same as the -03 build + the 2 tx related patches above applied. Hopefully this should solve the resets + give you the same performance as -03.
Can you give us more info on what hardware you used to reach the 800 MB/s disk copy, type of disks, number of nodes/OSDs + what is your total cluster speed ?
BonsaiJoe
53 Posts
Quote from BonsaiJoe on July 18, 2018, 11:50 amthanks for the kernel we have now updated our 3 node cluster with kernel -4
each node is
20x HGST 1,8 TB SAS 10K
4x SSD 400 GB
Areca Raid card in Jbod mode with 8GB write cache (this makes a lot improvment)
128 GB Mem and 12 CPU (24threads)
4x 10G intel NICcluster speed from benchmark is 876 MB/s write and 2628 MB/s read
at starting the first load test we got this kernel error on 2 nodes at the same time any idea why this happed ?
node 2:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761040] ------------[ cut here ]------------
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761050] WARNING: CPU: 12 PID: 967 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761056] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) ioatdma(N) lpc_ich(N) joydev(N) mei(N) mfd_core(N) sg(N) shpchp(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) arcmsr(N) ast(N) i2c_algo_bit(N) ttm(N) ahci(N) drm_kms_helper(N) syscopyarea(N) libahci(N) sysfillrect(N) i40e(N) sysimgblt(N) fb_sys_fops(N) vxlan(N) xhci_pci(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) xhci_hcd(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761169] Supported: No, Unsupported modules are loaded
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761174] CPU: 12 PID: 967 Comm: kworker/12:2 Tainted: G N 4.4.126-04-petasan #1
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761178] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761196] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761197] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598d
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761207] ffffffff8107f37d ffff881ff9957840 0000000000004726 00000000b83f2501
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761215] 000000000000000f 0000000000000006 ffffffff815660fd ffffffff8156b72b
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761223] Call Trace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761237] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761243] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761248] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761255] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761261] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761267] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761272] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761279] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761289] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761296] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761302] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761308] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761313] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761317] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761326] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761332] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761338] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761345] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761353] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764894] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764895]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899] Leftover inexact backtrace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764913] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764916] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764923] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764928] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764935] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764950] [<ffffffffa07e6e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764959] [<ffffffffa07e8c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764967] [<ffffffffa07ea2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764974] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764979] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764983] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764988] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764992] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764996] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765001] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765005] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765009] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765014] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765018] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765025] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765030] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765034] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765039] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765043] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765047] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765052] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765056] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765066] ---[ end trace cda239b3a423effb ]---
Jul 18 13:25:01 ps02-node02 CRON[26445]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
root@ps02-node02:~#
node 3:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616245] ------------[ cut here ]------------
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616260] WARNING: CPU: 10 PID: 59 at net/ipv4/tcp_input.c:2481 tcp_cwnd_redu ction+0xcd/0xe0()
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616266] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_c ore_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N ) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_th ermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_cl mulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cry ptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) joydev(N) lpc_ich(N) sg(N) ioatdma(N) mfd_core(N) mei(N) shpchp(N) dca(N ) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16( N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) ttm(N) arcmsr(N) drm_k ms_helper(N) i40e(N) syscopyarea(N) ahci(N) sysfillrect(N) libahci(N) sysimgblt(N) xhci_pci(N) fb_sys_fops(N) vxlan(N) ip6_udp_tunnel(N) xhci_hcd(N) udp_tunnel(N) ptp(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616410] Supported: No, Unsupported modules are loaded
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616417] CPU: 10 PID: 59 Comm: kworker/10:0 Tainted: G N 4.4 .126-04-petasan #1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616421] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2 018
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616447] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616451] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598 d
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616459] ffffffff8107f37d ffff881d83ce1840 0000000000004526 000000001795bc0 1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616468] 0000000000000013 0000000000000007 ffffffff815660fd ffffffff8156b72 b
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616477] Call Trace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616498] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616511] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616520] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616532] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616542] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616552] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616563] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616572] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616594] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616606] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616617] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616627] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616636] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616642] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616657] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616670] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616682] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616694] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616708] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625095] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625102]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104] Leftover inexact backtrace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625122] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625131] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625141] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625149] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625158] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625179] [<ffffffffa07d3e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625196] [<ffffffffa07d5c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625214] [<ffffffffa07d72ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625224] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625231] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625238] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625245] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625252] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625259] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625266] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625272] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625279] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625286] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625293] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625303] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625310] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625317] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625323] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625330] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625336] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625343] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625349] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625363] ---[ end trace 3186ab5a72fa342c ]---
Jul 18 13:25:01 ps02-node03 CRON[30086]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: 0 rbd image-00001 - /dev/rbd0
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: message repeated 135 times: [ 0 rbd image-00001 - /dev/rbd0]
Jul 18 13:35:01 ps02-node03 CRON[35939]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
thanks for the kernel we have now updated our 3 node cluster with kernel -4
each node is
20x HGST 1,8 TB SAS 10K
4x SSD 400 GB
Areca Raid card in Jbod mode with 8GB write cache (this makes a lot improvment)
128 GB Mem and 12 CPU (24threads)
4x 10G intel NIC
cluster speed from benchmark is 876 MB/s write and 2628 MB/s read
at starting the first load test we got this kernel error on 2 nodes at the same time any idea why this happed ?
node 2:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761040] ------------[ cut here ]------------
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761050] WARNING: CPU: 12 PID: 967 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761056] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) ioatdma(N) lpc_ich(N) joydev(N) mei(N) mfd_core(N) sg(N) shpchp(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) arcmsr(N) ast(N) i2c_algo_bit(N) ttm(N) ahci(N) drm_kms_helper(N) syscopyarea(N) libahci(N) sysfillrect(N) i40e(N) sysimgblt(N) fb_sys_fops(N) vxlan(N) xhci_pci(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) xhci_hcd(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761169] Supported: No, Unsupported modules are loaded
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761174] CPU: 12 PID: 967 Comm: kworker/12:2 Tainted: G N 4.4.126-04-petasan #1
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761178] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761196] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761197] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598d
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761207] ffffffff8107f37d ffff881ff9957840 0000000000004726 00000000b83f2501
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761215] 000000000000000f 0000000000000006 ffffffff815660fd ffffffff8156b72b
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761223] Call Trace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761237] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761243] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761248] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761255] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761261] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761267] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761272] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761279] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761289] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761296] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761302] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761308] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761313] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761317] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761326] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761332] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761338] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761345] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.761353] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764894] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764895]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899] Leftover inexact backtrace:
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764899]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764913] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764916] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764923] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764928] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764935] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764950] [<ffffffffa07e6e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764959] [<ffffffffa07e8c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764967] [<ffffffffa07ea2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764974] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764979] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764983] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764988] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764992] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.764996] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765001] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765005] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765009] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765014] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765018] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765025] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765030] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765034] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765039] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765043] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765047] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765052] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765056] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:09 ps02-node02 kernel: [ 1260.765066] ---[ end trace cda239b3a423effb ]---
Jul 18 13:25:01 ps02-node02 CRON[26445]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
root@ps02-node02:~#
node 3:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616245] ------------[ cut here ]------------
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616260] WARNING: CPU: 10 PID: 59 at net/ipv4/tcp_input.c:2481 tcp_cwnd_redu ction+0xcd/0xe0()
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616266] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_c ore_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N ) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_th ermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_cl mulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cry ptd(N) ipmi_ssif(N) ipmi_devintf(N) mei_me(N) joydev(N) lpc_ich(N) sg(N) ioatdma(N) mfd_core(N) mei(N) shpchp(N) dca(N ) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16( N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) ttm(N) arcmsr(N) drm_k ms_helper(N) i40e(N) syscopyarea(N) ahci(N) sysfillrect(N) libahci(N) sysimgblt(N) xhci_pci(N) fb_sys_fops(N) vxlan(N) ip6_udp_tunnel(N) xhci_hcd(N) udp_tunnel(N) ptp(N) libata(N) pps_core(N) drm(N) usbcore(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616410] Supported: No, Unsupported modules are loaded
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616417] CPU: 10 PID: 59 Comm: kworker/10:0 Tainted: G N 4.4 .126-04-petasan #1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616421] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2 018
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616447] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616451] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8598 d
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616459] ffffffff8107f37d ffff881d83ce1840 0000000000004526 000000001795bc0 1
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616468] 0000000000000013 0000000000000007 ffffffff815660fd ffffffff8156b72 b
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616477] Call Trace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616498] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616511] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616520] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616532] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616542] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616552] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616563] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616572] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616594] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616606] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616617] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616627] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616636] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616642] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616657] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616670] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616682] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616694] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.616708] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625095] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625102]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104] Leftover inexact backtrace:
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625104]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625122] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625131] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625141] [<ffffffff815647a3>] ? tcp_sendpage+0x263/0x620
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625149] [<ffffffff8158e9e4>] ? inet_sendpage+0x74/0xd0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625158] [<ffffffff814fd6ba>] ? kernel_sendpage+0x1a/0x30
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625179] [<ffffffffa07d3e41>] ? ceph_tcp_sendpage+0x61/0xc0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625196] [<ffffffffa07d5c37>] ? try_write+0x137/0xea0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625214] [<ffffffffa07d72ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625224] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625231] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625238] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625245] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625252] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625259] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625266] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625272] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625279] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625286] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625293] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625303] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625310] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625317] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625323] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625330] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625336] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625343] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625349] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 18 13:24:10 ps02-node03 kernel: [ 1568.625363] ---[ end trace 3186ab5a72fa342c ]---
Jul 18 13:25:01 ps02-node03 CRON[30086]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: 0 rbd image-00001 - /dev/rbd0
Jul 18 13:29:44 ps02-node03 iscsi_service.py[3369]: message repeated 135 times: [ 0 rbd image-00001 - /dev/rbd0]
Jul 18 13:35:01 ps02-node03 CRON[35939]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
admin
2,930 Posts
Quote from admin on July 18, 2018, 1:04 pmThe first thing is to revert to -03
dpkg -r linux-image-4.4.126-03-petasan
dpkg -r linux-image-4.4.126-04-petasan
dpkg -i linux-image-4.4.126-03-petasan_amd64.debIt does look like a tcp crash, most likely caused by applying the patches.
The first thing is to revert to -03
dpkg -r linux-image-4.4.126-03-petasan
dpkg -r linux-image-4.4.126-04-petasan
dpkg -i linux-image-4.4.126-03-petasan_amd64.deb
It does look like a tcp crash, most likely caused by applying the patches.
BonsaiJoe
53 Posts
Quote from BonsaiJoe on July 18, 2018, 1:24 pmlooks like this is also in the -3 we could find this messages below 1 minute after yesterday nic reset happened (this cluster is still on -3) on node 1 in the kernel log on node 5 no other node has this message until now on the 5 node cluster
on the 3 node cluster this did not happen again within the last 2h strange this is that ceph report everything as "health" and also there is no error in ceph logs
this is from the kernel.log on node5 of the 5 nodes cluster:
Jul 17 03:24:07 node05 kernel: [205376.722607] libceph: osd21 down
Jul 17 03:24:07 node05 kernel: [205376.722611] libceph: osd23 down
Jul 17 03:24:07 node05 kernel: [205376.722613] libceph: osd27 down
Jul 17 03:24:07 node05 kernel: [205376.722614] libceph: osd32 down
Jul 17 03:24:07 node05 kernel: [205376.722616] libceph: osd33 down
Jul 17 03:24:07 node05 kernel: [205376.722617] libceph: osd36 down
Jul 17 03:24:07 node05 kernel: [205376.722619] libceph: osd38 down
Jul 17 03:24:08 node05 kernel: [205377.951481] libceph: osd20 down
Jul 17 03:24:08 node05 kernel: [205377.951482] libceph: osd22 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd24 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd25 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd26 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd28 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd29 down
Jul 17 03:24:08 node05 kernel: [205377.951485] libceph: osd30 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd31 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd34 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd35 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd37 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd39 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd21 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd23 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd32 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd36 up
Jul 17 03:24:08 node05 kernel: [205377.973141] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2315533312
Jul 17 03:24:08 node05 kernel: [205377.973151] ABORT_TASK: Found referenced iSCSI task_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973157] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973162] ABORT_TASK: Found referenced iSCSI task_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973165] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973170] ABORT_TASK: Found referenced iSCSI task_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973173] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973178] ABORT_TASK: Found referenced iSCSI task_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973192] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973197] ABORT_TASK: Found referenced iSCSI task_tag: 101033216
Jul 17 03:24:08 node05 kernel: [205377.974639] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 101033216
Jul 17 03:24:11 node05 kernel: [205381.407854] libceph: osd20 up
Jul 17 03:24:11 node05 kernel: [205381.407860] libceph: osd24 up
Jul 17 03:24:11 node05 kernel: [205381.407861] libceph: osd25 up
Jul 17 03:24:11 node05 kernel: [205381.407862] libceph: osd27 up
Jul 17 03:24:11 node05 kernel: [205381.408925] libceph: osd26 up
Jul 17 03:24:11 node05 kernel: [205381.408929] libceph: osd31 up
Jul 17 03:24:17 node05 kernel: [205387.305779] libceph: mon1 192.168.42.11:6789 session lost, hunting for new mon
Jul 17 03:24:25 node05 kernel: [205394.632835] libceph: mon2 192.168.42.13:6789 session established
Jul 17 03:24:25 node05 kernel: [205394.949068] ABORT_TASK: Found referenced iSCSI task_tag: 1275439616
Jul 17 03:24:26 node05 kernel: [205396.346171] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1275439616
Jul 17 03:24:27 node05 kernel: [205397.383376] libceph: osd33 up
Jul 17 03:24:33 node05 kernel: [205403.273717] iSCSI Login timeout on Network Portal 192.168.241.111:3260
Jul 17 03:24:55 node05 kernel: [205425.077825] ABORT_TASK: Found referenced iSCSI task_tag: 906252032
Jul 17 03:25:13 node05 kernel: [205443.006011] Unable to locate ITT: 0xbc049100 on CID: 1
Jul 17 03:25:13 node05 kernel: [205443.006012] Unable to locate RefTaskTag: 0xbc049100 on CID: 1.
Jul 17 03:25:15 node05 kernel: [205444.562587] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:15 node05 kernel: [205444.562608] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562622] ABORT_TASK: Found referenced iSCSI task_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562625] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562630] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 4127547648
Jul 17 03:25:15 node05 kernel: [205444.562636] ABORT_TASK: Found referenced iSCSI task_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562638] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562644] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562648] ABORT_TASK: Found referenced iSCSI task_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562650] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562655] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3976468224
Jul 17 03:25:15 node05 kernel: [205444.562659] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258196
Jul 17 03:25:15 node05 kernel: [205444.562683] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562688] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562703] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562713] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562718] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285502208
Jul 17 03:25:15 node05 kernel: [205444.562725] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205445.129618] iSCSI Login timeout on Network Portal 192.168.241.116:3260
Jul 17 03:25:23 node05 kernel: [205453.257320] libceph: osd20 down
Jul 17 03:25:23 node05 kernel: [205453.257323] libceph: osd21 down
Jul 17 03:25:23 node05 kernel: [205453.257324] libceph: osd23 down
Jul 17 03:25:23 node05 kernel: [205453.257325] libceph: osd24 down
Jul 17 03:25:23 node05 kernel: [205453.257326] libceph: osd25 down
Jul 17 03:25:23 node05 kernel: [205453.257327] libceph: osd26 down
Jul 17 03:25:23 node05 kernel: [205453.257328] libceph: osd27 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd31 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd32 down
Jul 17 03:25:23 node05 kernel: [205453.257330] libceph: osd36 down
Jul 17 03:25:24 node05 kernel: [205454.365408] libceph: osd20 up
Jul 17 03:25:25 node05 kernel: [205454.501450] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.503070] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.506432] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.508427] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509152] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509169] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509582] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509746] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.511204] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512362] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512565] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.521600] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.522417] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.524542] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526303] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526692] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.527982] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531018] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531284] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.535175] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538703] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538994] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.546172] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547076] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547315] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.549319] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.551513] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.552010] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.561022] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.563844] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.565809] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.566184] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.638090] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.640258] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690356] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690562] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.693983] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.698110] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205455.317039] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:38 node05 kernel: [205467.625963] ABORT_TASK: Found referenced iSCSI task_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.839916] libceph: osd22 up
Jul 17 03:25:38 node05 kernel: [205467.839921] libceph: osd28 up
Jul 17 03:25:38 node05 kernel: [205467.839922] libceph: osd29 up
Jul 17 03:25:38 node05 kernel: [205467.839923] libceph: osd30 up
Jul 17 03:25:38 node05 kernel: [205467.839924] libceph: osd34 up
Jul 17 03:25:38 node05 kernel: [205467.839925] libceph: osd35 up
Jul 17 03:25:38 node05 kernel: [205467.839926] libceph: osd37 up
Jul 17 03:25:38 node05 kernel: [205467.839928] libceph: osd38 up
Jul 17 03:25:38 node05 kernel: [205467.839935] libceph: osd39 up
Jul 17 03:25:38 node05 kernel: [205467.891932] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.894517] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.894844] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.996844] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 906252032
Jul 17 03:25:38 node05 kernel: [205467.996856] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3204813312
Jul 17 03:25:38 node05 kernel: [205467.996860] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 184814848
Jul 17 03:25:38 node05 kernel: [205467.996865] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 302374912
Jul 17 03:25:38 node05 kernel: [205468.155352] libceph: osd20 down
Jul 17 03:25:38 node05 kernel: [205468.155355] libceph: osd21 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd23 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd24 up
Jul 17 03:25:38 node05 kernel: [205468.155357] libceph: osd25 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd26 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd27 up
Jul 17 03:25:38 node05 kernel: [205468.155359] libceph: osd31 up
Jul 17 03:25:38 node05 kernel: [205468.155363] libceph: osd32 up
Jul 17 03:25:38 node05 kernel: [205468.155375] libceph: osd36 up
Jul 17 03:25:38 node05 kernel: [205468.306612] ABORT_TASK: Found referenced iSCSI task_tag: 1577336576
Jul 17 03:25:38 node05 kernel: [205468.369450] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205468.653810] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1577336576
Jul 17 03:25:39 node05 kernel: [205468.653827] ABORT_TASK: Found referenced iSCSI task_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.653832] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.778610] ABORT_TASK: Found referenced iSCSI task_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785838] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785973] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205469.057341] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:39 node05 kernel: [205469.068293] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.648594] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.713882] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205469.714041] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205470.286678] libceph: osd20 up
Jul 17 03:25:41 node05 kernel: [205470.670242] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:43 node05 kernel: [205472.503129] ------------[ cut here ]------------
Jul 17 03:25:43 node05 kernel: [205472.503139] WARNING: CPU: 0 PID: 1350286 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 17 03:25:43 node05 kernel: [205472.503144] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) sg(N) lpc_ich(N) mfd_core(N) mei_me(N) shpchp(N) mei(N) ioatdma(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) arcmsr(N) ttm(N) ahci(N) libahci(N) drm_kms_helper(N) syscopyarea(N) sysfillrect(N) i40e(N) xhci_pci(N) sysimgblt(N) vxlan(N) libata(N) fb_sys_fops(N) xhci_hcd(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) pps_core(N) usbcore(N) drm(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 17 03:25:43 node05 kernel: [205472.503229] Supported: No, Unsupported modules are loaded
Jul 17 03:25:43 node05 kernel: [205472.503234] CPU: 0 PID: 1350286 Comm: kworker/0:1 Tainted: G N 4.4.126-03-petasan #1
Jul 17 03:25:43 node05 kernel: [205472.503237] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 17 03:25:43 node05 kernel: [205472.503252] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 17 03:25:43 node05 kernel: [205472.503255] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8594d
Jul 17 03:25:43 node05 kernel: [205472.503262] ffffffff8107f37d ffff8818835bf040 0000000000004526 0000000048323b01
Jul 17 03:25:43 node05 kernel: [205472.503268] 0000000000000017 0000000000000003 ffffffff815660fd ffffffff8156b72b
Jul 17 03:25:43 node05 kernel: [205472.503275] Call Trace:
Jul 17 03:25:43 node05 kernel: [205472.503291] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 17 03:25:43 node05 kernel: [205472.503301] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 17 03:25:43 node05 kernel: [205472.503306] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 17 03:25:43 node05 kernel: [205472.503314] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 17 03:25:43 node05 kernel: [205472.503321] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 17 03:25:43 node05 kernel: [205472.503327] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 17 03:25:43 node05 kernel: [205472.503334] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 17 03:25:43 node05 kernel: [205472.503340] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 17 03:25:43 node05 kernel: [205472.503355] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 17 03:25:43 node05 kernel: [205472.503365] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 17 03:25:43 node05 kernel: [205472.503371] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 17 03:25:43 node05 kernel: [205472.503379] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 17 03:25:43 node05 kernel: [205472.503384] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 17 03:25:43 node05 kernel: [205472.503390] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 17 03:25:43 node05 kernel: [205472.503401] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 17 03:25:43 node05 kernel: [205472.503410] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 17 03:25:43 node05 kernel: [205472.503416] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 17 03:25:43 node05 kernel: [205472.503426] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 17 03:25:43 node05 kernel: [205472.503434] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507010] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507013]
Jul 17 03:25:43 node05 kernel: [205472.507015] Leftover inexact backtrace:
Jul 17 03:25:43 node05 kernel: [205472.507015]
Jul 17 03:25:43 node05 kernel: [205472.507019] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 17 03:25:43 node05 kernel: [205472.507030] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 17 03:25:43 node05 kernel: [205472.507034] [<ffffffff81564c41>] ? tcp_sendmsg+0xe1/0xb50
Jul 17 03:25:43 node05 kernel: [205472.507037] [<ffffffff81572680>] ? tcp_tsq_handler.part.34+0x30/0x30
Jul 17 03:25:43 node05 kernel: [205472.507043] [<ffffffff814fdb46>] ? sock_sendmsg+0x36/0x40
Jul 17 03:25:43 node05 kernel: [205472.507056] [<ffffffffa0788d6b>] ? try_write+0x26b/0xea0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507060] [<ffffffff8158e8a3>] ? inet_recvmsg+0x73/0x90
Jul 17 03:25:43 node05 kernel: [205472.507068] [<ffffffffa078a2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507074] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507078] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507082] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507086] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507089] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507093] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507097] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507101] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507105] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507109] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507112] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507118] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507122] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 17 03:25:43 node05 kernel: [205472.507126] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507129] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 17 03:25:43 node05 kernel: [205472.507133] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507137] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507141] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 17 03:25:43 node05 kernel: [205472.507144] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507171] ---[ end trace 34e5436c0363311d ]---
looks like this is also in the -3 we could find this messages below 1 minute after yesterday nic reset happened (this cluster is still on -3) on node 1 in the kernel log on node 5 no other node has this message until now on the 5 node cluster
on the 3 node cluster this did not happen again within the last 2h strange this is that ceph report everything as "health" and also there is no error in ceph logs
this is from the kernel.log on node5 of the 5 nodes cluster:
Jul 17 03:24:07 node05 kernel: [205376.722607] libceph: osd21 down
Jul 17 03:24:07 node05 kernel: [205376.722611] libceph: osd23 down
Jul 17 03:24:07 node05 kernel: [205376.722613] libceph: osd27 down
Jul 17 03:24:07 node05 kernel: [205376.722614] libceph: osd32 down
Jul 17 03:24:07 node05 kernel: [205376.722616] libceph: osd33 down
Jul 17 03:24:07 node05 kernel: [205376.722617] libceph: osd36 down
Jul 17 03:24:07 node05 kernel: [205376.722619] libceph: osd38 down
Jul 17 03:24:08 node05 kernel: [205377.951481] libceph: osd20 down
Jul 17 03:24:08 node05 kernel: [205377.951482] libceph: osd22 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd24 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd25 down
Jul 17 03:24:08 node05 kernel: [205377.951483] libceph: osd26 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd28 down
Jul 17 03:24:08 node05 kernel: [205377.951484] libceph: osd29 down
Jul 17 03:24:08 node05 kernel: [205377.951485] libceph: osd30 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd31 down
Jul 17 03:24:08 node05 kernel: [205377.951489] libceph: osd34 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd35 down
Jul 17 03:24:08 node05 kernel: [205377.951490] libceph: osd37 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd39 down
Jul 17 03:24:08 node05 kernel: [205377.951491] libceph: osd21 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd23 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd32 up
Jul 17 03:24:08 node05 kernel: [205377.951492] libceph: osd36 up
Jul 17 03:24:08 node05 kernel: [205377.973141] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2315533312
Jul 17 03:24:08 node05 kernel: [205377.973151] ABORT_TASK: Found referenced iSCSI task_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973157] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1090829824
Jul 17 03:24:08 node05 kernel: [205377.973162] ABORT_TASK: Found referenced iSCSI task_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973165] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3439674880
Jul 17 03:24:08 node05 kernel: [205377.973170] ABORT_TASK: Found referenced iSCSI task_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973173] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1107580928
Jul 17 03:24:08 node05 kernel: [205377.973178] ABORT_TASK: Found referenced iSCSI task_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973192] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365955584
Jul 17 03:24:08 node05 kernel: [205377.973197] ABORT_TASK: Found referenced iSCSI task_tag: 101033216
Jul 17 03:24:08 node05 kernel: [205377.974639] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 101033216
Jul 17 03:24:11 node05 kernel: [205381.407854] libceph: osd20 up
Jul 17 03:24:11 node05 kernel: [205381.407860] libceph: osd24 up
Jul 17 03:24:11 node05 kernel: [205381.407861] libceph: osd25 up
Jul 17 03:24:11 node05 kernel: [205381.407862] libceph: osd27 up
Jul 17 03:24:11 node05 kernel: [205381.408925] libceph: osd26 up
Jul 17 03:24:11 node05 kernel: [205381.408929] libceph: osd31 up
Jul 17 03:24:17 node05 kernel: [205387.305779] libceph: mon1 192.168.42.11:6789 session lost, hunting for new mon
Jul 17 03:24:25 node05 kernel: [205394.632835] libceph: mon2 192.168.42.13:6789 session established
Jul 17 03:24:25 node05 kernel: [205394.949068] ABORT_TASK: Found referenced iSCSI task_tag: 1275439616
Jul 17 03:24:26 node05 kernel: [205396.346171] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1275439616
Jul 17 03:24:27 node05 kernel: [205397.383376] libceph: osd33 up
Jul 17 03:24:33 node05 kernel: [205403.273717] iSCSI Login timeout on Network Portal 192.168.241.111:3260
Jul 17 03:24:55 node05 kernel: [205425.077825] ABORT_TASK: Found referenced iSCSI task_tag: 906252032
Jul 17 03:25:13 node05 kernel: [205443.006011] Unable to locate ITT: 0xbc049100 on CID: 1
Jul 17 03:25:13 node05 kernel: [205443.006012] Unable to locate RefTaskTag: 0xbc049100 on CID: 1.
Jul 17 03:25:15 node05 kernel: [205444.562587] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:15 node05 kernel: [205444.562608] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562622] ABORT_TASK: Found referenced iSCSI task_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562625] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2365857024
Jul 17 03:25:15 node05 kernel: [205444.562630] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 4127547648
Jul 17 03:25:15 node05 kernel: [205444.562636] ABORT_TASK: Found referenced iSCSI task_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562638] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285601024
Jul 17 03:25:15 node05 kernel: [205444.562644] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562648] ABORT_TASK: Found referenced iSCSI task_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562650] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3926185472
Jul 17 03:25:15 node05 kernel: [205444.562655] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3976468224
Jul 17 03:25:15 node05 kernel: [205444.562659] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258196
Jul 17 03:25:15 node05 kernel: [205444.562683] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562688] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562703] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562713] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205444.562718] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 285502208
Jul 17 03:25:15 node05 kernel: [205444.562725] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78258191
Jul 17 03:25:15 node05 kernel: [205445.129618] iSCSI Login timeout on Network Portal 192.168.241.116:3260
Jul 17 03:25:23 node05 kernel: [205453.257320] libceph: osd20 down
Jul 17 03:25:23 node05 kernel: [205453.257323] libceph: osd21 down
Jul 17 03:25:23 node05 kernel: [205453.257324] libceph: osd23 down
Jul 17 03:25:23 node05 kernel: [205453.257325] libceph: osd24 down
Jul 17 03:25:23 node05 kernel: [205453.257326] libceph: osd25 down
Jul 17 03:25:23 node05 kernel: [205453.257327] libceph: osd26 down
Jul 17 03:25:23 node05 kernel: [205453.257328] libceph: osd27 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd31 down
Jul 17 03:25:23 node05 kernel: [205453.257329] libceph: osd32 down
Jul 17 03:25:23 node05 kernel: [205453.257330] libceph: osd36 down
Jul 17 03:25:24 node05 kernel: [205454.365408] libceph: osd20 up
Jul 17 03:25:25 node05 kernel: [205454.501450] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.503070] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.506432] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.508427] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509152] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509169] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509582] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.509746] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.511204] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512362] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.512565] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.521600] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.522417] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.524542] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526303] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.526692] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.527982] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531018] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.531284] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.535175] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538703] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.538994] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.546172] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547076] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.547315] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.549319] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.551513] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.552010] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.561022] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.563844] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.565809] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.566184] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.638090] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.640258] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690356] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.690562] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.693983] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205454.698110] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:25 node05 kernel: [205455.317039] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:38 node05 kernel: [205467.625963] ABORT_TASK: Found referenced iSCSI task_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.839916] libceph: osd22 up
Jul 17 03:25:38 node05 kernel: [205467.839921] libceph: osd28 up
Jul 17 03:25:38 node05 kernel: [205467.839922] libceph: osd29 up
Jul 17 03:25:38 node05 kernel: [205467.839923] libceph: osd30 up
Jul 17 03:25:38 node05 kernel: [205467.839924] libceph: osd34 up
Jul 17 03:25:38 node05 kernel: [205467.839925] libceph: osd35 up
Jul 17 03:25:38 node05 kernel: [205467.839926] libceph: osd37 up
Jul 17 03:25:38 node05 kernel: [205467.839928] libceph: osd38 up
Jul 17 03:25:38 node05 kernel: [205467.839935] libceph: osd39 up
Jul 17 03:25:38 node05 kernel: [205467.891932] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.894517] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 2198133504
Jul 17 03:25:38 node05 kernel: [205467.894844] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:38 node05 kernel: [205467.996844] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 906252032
Jul 17 03:25:38 node05 kernel: [205467.996856] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 3204813312
Jul 17 03:25:38 node05 kernel: [205467.996860] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 184814848
Jul 17 03:25:38 node05 kernel: [205467.996865] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 302374912
Jul 17 03:25:38 node05 kernel: [205468.155352] libceph: osd20 down
Jul 17 03:25:38 node05 kernel: [205468.155355] libceph: osd21 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd23 up
Jul 17 03:25:38 node05 kernel: [205468.155356] libceph: osd24 up
Jul 17 03:25:38 node05 kernel: [205468.155357] libceph: osd25 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd26 up
Jul 17 03:25:38 node05 kernel: [205468.155358] libceph: osd27 up
Jul 17 03:25:38 node05 kernel: [205468.155359] libceph: osd31 up
Jul 17 03:25:38 node05 kernel: [205468.155363] libceph: osd32 up
Jul 17 03:25:38 node05 kernel: [205468.155375] libceph: osd36 up
Jul 17 03:25:38 node05 kernel: [205468.306612] ABORT_TASK: Found referenced iSCSI task_tag: 1577336576
Jul 17 03:25:38 node05 kernel: [205468.369450] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205468.653810] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1577336576
Jul 17 03:25:39 node05 kernel: [205468.653827] ABORT_TASK: Found referenced iSCSI task_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.653832] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 2969832192
Jul 17 03:25:39 node05 kernel: [205468.778610] ABORT_TASK: Found referenced iSCSI task_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785838] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 956691968
Jul 17 03:25:39 node05 kernel: [205468.785973] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:39 node05 kernel: [205469.057341] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:39 node05 kernel: [205469.068293] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.648594] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:40 node05 kernel: [205469.713882] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205469.714041] TARGET_CORE[iSCSI]: Expected Transfer Length: 2048 does not match SCSI CDB Length: 16 for SAM Opcode: 0xa0
Jul 17 03:25:40 node05 kernel: [205470.286678] libceph: osd20 up
Jul 17 03:25:41 node05 kernel: [205470.670242] COMPARE_AND_WRITE: miscompare at offset 0
Jul 17 03:25:43 node05 kernel: [205472.503129] ------------[ cut here ]------------
Jul 17 03:25:43 node05 kernel: [205472.503139] WARNING: CPU: 0 PID: 1350286 at net/ipv4/tcp_input.c:2481 tcp_cwnd_reduction+0xcd/0xe0()
Jul 17 03:25:43 node05 kernel: [205472.503144] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) skx_edac(N) edac_core(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) drbg(N) ansi_cprng(N) aesni_intel(N) aes_x86_64(N) lrw(N) gf128mul(N) glue_helper(N) ablk_helper(N) cryptd(N) ipmi_ssif(N) ipmi_devintf(N) joydev(N) sg(N) lpc_ich(N) mfd_core(N) mei_me(N) shpchp(N) mei(N) ioatdma(N) dca(N) ipmi_si(N) ipmi_msghandler(N) acpi_cpufreq(N) acpi_power_meter(N) acpi_pad(N) processor(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N) arcmsr(N) ttm(N) ahci(N) libahci(N) drm_kms_helper(N) syscopyarea(N) sysfillrect(N) i40e(N) xhci_pci(N) sysimgblt(N) vxlan(N) libata(N) fb_sys_fops(N) xhci_hcd(N) ip6_udp_tunnel(N) udp_tunnel(N) ptp(N) pps_core(N) usbcore(N) drm(N) scsi_mod(N) usb_common(N) wmi(N) fjes(N) button(N)
Jul 17 03:25:43 node05 kernel: [205472.503229] Supported: No, Unsupported modules are loaded
Jul 17 03:25:43 node05 kernel: [205472.503234] CPU: 0 PID: 1350286 Comm: kworker/0:1 Tainted: G N 4.4.126-03-petasan #1
Jul 17 03:25:43 node05 kernel: [205472.503237] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 17 03:25:43 node05 kernel: [205472.503252] Workqueue: ceph-msgr ceph_con_workfn [libceph]
Jul 17 03:25:43 node05 kernel: [205472.503255] 0000000000000000 ffffffff81325be5 0000000000000000 ffffffff81a8594d
Jul 17 03:25:43 node05 kernel: [205472.503262] ffffffff8107f37d ffff8818835bf040 0000000000004526 0000000048323b01
Jul 17 03:25:43 node05 kernel: [205472.503268] 0000000000000017 0000000000000003 ffffffff815660fd ffffffff8156b72b
Jul 17 03:25:43 node05 kernel: [205472.503275] Call Trace:
Jul 17 03:25:43 node05 kernel: [205472.503291] [<ffffffff81018aae>] dump_trace+0x5e/0x340
Jul 17 03:25:43 node05 kernel: [205472.503301] [<ffffffff81018e8c>] show_stack_log_lvl+0xfc/0x160
Jul 17 03:25:43 node05 kernel: [205472.503306] [<ffffffff81019bd1>] show_stack+0x21/0x40
Jul 17 03:25:43 node05 kernel: [205472.503314] [<ffffffff81325be5>] dump_stack+0x5c/0x77
Jul 17 03:25:43 node05 kernel: [205472.503321] [<ffffffff8107f37d>] warn_slowpath_common+0x7d/0xb0
Jul 17 03:25:43 node05 kernel: [205472.503327] [<ffffffff815660fd>] tcp_cwnd_reduction+0xcd/0xe0
Jul 17 03:25:43 node05 kernel: [205472.503334] [<ffffffff8156b72b>] tcp_fastretrans_alert+0x27b/0xab0
Jul 17 03:25:43 node05 kernel: [205472.503340] [<ffffffff8156c45b>] tcp_ack+0x4fb/0x7f0
Jul 17 03:25:43 node05 kernel: [205472.503355] [<ffffffff8156e01c>] tcp_rcv_established+0x1ac/0x730
Jul 17 03:25:43 node05 kernel: [205472.503365] [<ffffffff81577603>] tcp_v4_do_rcv+0x133/0x200
Jul 17 03:25:43 node05 kernel: [205472.503371] [<ffffffff81578f96>] tcp_v4_rcv+0x836/0x9b0
Jul 17 03:25:43 node05 kernel: [205472.503379] [<ffffffff81554d31>] ip_local_deliver_finish+0x91/0x1d0
Jul 17 03:25:43 node05 kernel: [205472.503384] [<ffffffff81554ffb>] ip_local_deliver+0x5b/0xc0
Jul 17 03:25:43 node05 kernel: [205472.503390] [<ffffffff815552cd>] ip_rcv+0x26d/0x380
Jul 17 03:25:43 node05 kernel: [205472.503401] [<ffffffff81519060>] __netif_receive_skb_core+0x6f0/0xa30
Jul 17 03:25:43 node05 kernel: [205472.503410] [<ffffffff8151a43d>] process_backlog+0x9d/0x130
Jul 17 03:25:43 node05 kernel: [205472.503416] [<ffffffff81519bb2>] net_rx_action+0x202/0x340
Jul 17 03:25:43 node05 kernel: [205472.503426] [<ffffffff81083aa1>] __do_softirq+0x111/0x300
Jul 17 03:25:43 node05 kernel: [205472.503434] [<ffffffff8161813c>] do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507010] DWARF2 unwinder stuck at do_softirq_own_stack+0x1c/0x30
Jul 17 03:25:43 node05 kernel: [205472.507013]
Jul 17 03:25:43 node05 kernel: [205472.507015] Leftover inexact backtrace:
Jul 17 03:25:43 node05 kernel: [205472.507015]
Jul 17 03:25:43 node05 kernel: [205472.507019] <IRQ> <EOI> [<ffffffff810834e3>] ? do_softirq.part.14+0x33/0x40
Jul 17 03:25:43 node05 kernel: [205472.507030] [<ffffffff81083568>] ? __local_bh_enable_ip+0x78/0x80
Jul 17 03:25:43 node05 kernel: [205472.507034] [<ffffffff81564c41>] ? tcp_sendmsg+0xe1/0xb50
Jul 17 03:25:43 node05 kernel: [205472.507037] [<ffffffff81572680>] ? tcp_tsq_handler.part.34+0x30/0x30
Jul 17 03:25:43 node05 kernel: [205472.507043] [<ffffffff814fdb46>] ? sock_sendmsg+0x36/0x40
Jul 17 03:25:43 node05 kernel: [205472.507056] [<ffffffffa0788d6b>] ? try_write+0x26b/0xea0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507060] [<ffffffff8158e8a3>] ? inet_recvmsg+0x73/0x90
Jul 17 03:25:43 node05 kernel: [205472.507068] [<ffffffffa078a2ef>] ? ceph_con_workfn+0x7df/0x21c0 [libceph]
Jul 17 03:25:43 node05 kernel: [205472.507074] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507078] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507082] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507086] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507089] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507093] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507097] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507101] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507105] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507109] [<ffffffff81610978>] ? thread_return+0x2f/0x527
Jul 17 03:25:43 node05 kernel: [205472.507112] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507118] [<ffffffff81097077>] ? process_one_work+0x167/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507122] [<ffffffff8109740a>] ? worker_thread+0x4a/0x4c0
Jul 17 03:25:43 node05 kernel: [205472.507126] [<ffffffff810973c0>] ? process_one_work+0x4b0/0x4b0
Jul 17 03:25:43 node05 kernel: [205472.507129] [<ffffffff8109d0f9>] ? kthread+0xc9/0xe0
Jul 17 03:25:43 node05 kernel: [205472.507133] [<ffffffff8161096c>] ? thread_return+0x23/0x527
Jul 17 03:25:43 node05 kernel: [205472.507137] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507141] [<ffffffff81615905>] ? ret_from_fork+0x55/0x80
Jul 17 03:25:43 node05 kernel: [205472.507144] [<ffffffff8109d030>] ? kthread_park+0x50/0x50
Jul 17 03:25:43 node05 kernel: [205472.507171] ---[ end trace 34e5436c0363311d ]---
admin
2,930 Posts
Quote from admin on July 18, 2018, 1:45 pmThe bad thing is the i40e driver support in the 4.4.x kernel is lagging main stream fixes. We will try to get the 4.12 out soon. If there is any chance you could use other nic type then it may be a fix. I will update you when 4.12 is ready.
The bad thing is the i40e driver support in the 4.4.x kernel is lagging main stream fixes. We will try to get the 4.12 out soon. If there is any chance you could use other nic type then it may be a fix. I will update you when 4.12 is ready.
BonsaiJoe
53 Posts
Quote from BonsaiJoe on July 18, 2018, 2:12 pmthanks for the fast update
we have done a second load test on the 3 nodes cluster with -4
now we did not get the kernel message again but cluster speed is again much better then the 4.4.92 kernel version.
test was copy 3 vm´s from petasan to petasan (same cluster) at the same time speed was up to 800 MB/s RW
do you think the kernel error comes from the NIC driver or is there maybe any other problem with the 4.4.126 kernel?
thanks for the fast update
we have done a second load test on the 3 nodes cluster with -4
now we did not get the kernel message again but cluster speed is again much better then the 4.4.92 kernel version.
test was copy 3 vm´s from petasan to petasan (same cluster) at the same time speed was up to 800 MB/s RW
do you think the kernel error comes from the NIC driver or is there maybe any other problem with the 4.4.126 kernel?
admin
2,930 Posts
Quote from admin on July 18, 2018, 3:35 pmThe best approach for us is to get you a 4.12.x soon which has many i40e fixes, the 4.4 lags a lot of fixes relating to this driver. if you are willing to do one more test, we can send you another 4.4.126 test kernel with some more fixes that may or may not solve the issue, let me know and we can have a shot at it, but really we do not want to spend too much time fixing 4.4 now
Re the load test, what do you use to test ?
The best approach for us is to get you a 4.12.x soon which has many i40e fixes, the 4.4 lags a lot of fixes relating to this driver. if you are willing to do one more test, we can send you another 4.4.126 test kernel with some more fixes that may or may not solve the issue, let me know and we can have a shot at it, but really we do not want to spend too much time fixing 4.4 now
Re the load test, what do you use to test ?
BonsaiJoe
53 Posts
Quote from BonsaiJoe on July 18, 2018, 4:48 pmThanks, I think we will wait until you are ready with 4.12 can you estimate how long this takes?
We have done just a basic test copy 3 vm`s parallel (each ~80GB) in VMware from petasan to petasan each job gots approx 270MB/s (Rw) Speed so we got in total up to 800 MB/s
this is close to the maximum of 876 MB/s write speed from petasan internal benchmark test
Thanks, I think we will wait until you are ready with 4.12 can you estimate how long this takes?
We have done just a basic test copy 3 vm`s parallel (each ~80GB) in VMware from petasan to petasan each job gots approx 270MB/s (Rw) Speed so we got in total up to 800 MB/s
this is close to the maximum of 876 MB/s write speed from petasan internal benchmark test
admin
2,930 Posts
admin
2,930 Posts
Quote from admin on July 23, 2018, 11:49 amdownload new kernel and firmware from:
https://drive.google.com/drive/folders/1kZYfW3MAz2fJKBIy57R4dF9h74SCMoNt?usp=sharing
install:
dpkg -i petasan-firmware_20180416.deb
dpkg -i linux-image-4.12.14-02-petasan_amd64.deb
download new kernel and firmware from:
https://drive.google.com/drive/folders/1kZYfW3MAz2fJKBIy57R4dF9h74SCMoNt?usp=sharing
install:
dpkg -i petasan-firmware_20180416.deb
dpkg -i linux-image-4.12.14-02-petasan_amd64.deb