Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD crashed and restarted

Pages: 1 2 3 4

thanks for the update. we have updated our test cluster and it looks pretty good until now. No ceph crash no nic reset

 

the only thing what we can see are :

Jul 24 11:33:18 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:33:18 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:33:18 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:33:19 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:39:26 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:39:59 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:39:59 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe
Jul 24 11:45:40 ps02-node02 admin.py[3234]: error: [Errno 32] Broken pipe

 

at the time a node is joining the cluster after a reboot any idea why this happened?

 

we have done some load test also looking good so far we will continue testing today and if we do not run into trouble we will update our production cluster tomorrow

thanks again for your help

 

The broken pipe should be ok if it is temporary during cluster booting. it indicates a connection error which is ok when you boot machines. If it goes away quickly you can safely ignore it.

after 2 Days with 4.12.14 we still got the same error again :

Jul 25 17:36:51 ps02-node02 kernel: [108755.872975] ------------[ cut here ]------------
Jul 25 17:36:51 ps02-node02 kernel: [108755.872983] WARNING: CPU: 5 PID: 5763 at net/ipv4/tcp_input.c:2503 tcp_cwnd_reduction+0xb9/0xc0
Jul 25 17:36:51 ps02-node02 kernel: [108755.872983] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) intel_rapl(N) s$
Jul 25 17:36:51 ps02-node02 kernel: [108755.873022]  syscopyarea(N) sysfillrect(N) ahci(N) sysimgblt(N) libahci(N) crc32c_intel(N) xhci_pci(N) fb_sys_fops(N) i40e(N) xhci_hcd(N) libata(N) arcmsr(N) ptp(N) drm(N) pps_core(N) drm_panel_orientation_quirks(N) usbcore(N) scsi_mod(N) wmi(N) button(N)
Jul 25 17:36:51 ps02-node02 kernel: [108755.873035] Supported: No, Unsupported modules are loaded
Jul 25 17:36:51 ps02-node02 kernel: [108755.873038] CPU: 5 PID: 5763 Comm: msgr-worker-0 Tainted: G                   4.12.14-02-petasan #1 SLE15
Jul 25 17:36:51 ps02-node02 kernel: [108755.873039] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Jul 25 17:36:51 ps02-node02 kernel: [108755.873040] task: ffff9e0eb9240b00 task.stack: ffffc19424828000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873043] RIP: 0010:tcp_cwnd_reduction+0xb9/0xc0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873044] RSP: 0018:ffff9e0ebbd43c38 EFLAGS: 00010246
Jul 25 17:36:51 ps02-node02 kernel: [108755.873045] RAX: 0000000000000008 RBX: ffff9e0b1cbfe7c0 RCX: 0000000000000003
Jul 25 17:36:51 ps02-node02 kernel: [108755.873046] RDX: 0000000000005726 RSI: 0000000000000010 RDI: ffff9e0b1cbfe7c0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873047] RBP: 0000000000000010 R08: ffff9e0ebbd43c88 R09: 0000000000000000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873048] R10: 0000000000000006 R11: 0000000000000000 R12: 000000008c606784
Jul 25 17:36:51 ps02-node02 kernel: [108755.873049] R13: 000000008c5fe59f R14: 000000000000001d R15: 0000000000000000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873050] FS:  00007f65ea594700(0000) GS:ffff9e0ebbd40000(0000) knlGS:0000000000000000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873052] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 25 17:36:51 ps02-node02 kernel: [108755.873053] CR2: 0000557b00c87000 CR3: 0000001e86728006 CR4: 00000000007606e0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873054] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 25 17:36:51 ps02-node02 kernel: [108755.873055] PKRU: 55555554
Jul 25 17:36:51 ps02-node02 kernel: [108755.873056] Call Trace:
Jul 25 17:36:51 ps02-node02 kernel: [108755.873058]  <IRQ>
Jul 25 17:36:51 ps02-node02 kernel: [108755.873061]  tcp_ack+0x727/0x870
Jul 25 17:36:51 ps02-node02 kernel: [108755.873066]  tcp_rcv_established+0x1bf/0x540
Jul 25 17:36:51 ps02-node02 kernel: [108755.873077]  tcp_v4_do_rcv+0x12f/0x1d0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873082]  tcp_v4_rcv+0x88f/0x990
Jul 25 17:36:51 ps02-node02 kernel: [108755.873108]  ip_local_deliver_finish+0x92/0x1d0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873114]  ip_local_deliver+0x5b/0xc0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873121]  ? lock_timer_base+0x6d/0x90
Jul 25 17:36:51 ps02-node02 kernel: [108755.873127]  ip_rcv+0x266/0x380
Jul 25 17:36:51 ps02-node02 kernel: [108755.873135]  __netif_receive_skb_core+0x545/0xab0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873142]  ? process_backlog+0xad/0x160
Jul 25 17:36:51 ps02-node02 kernel: [108755.873145]  process_backlog+0xad/0x160
Jul 25 17:36:51 ps02-node02 kernel: [108755.873149]  net_rx_action+0x25a/0x3a0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873155]  ? tick_nohz_stop_sched_tick+0x7d/0x2c0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873163]  __do_softirq+0xf5/0x296
Jul 25 17:36:51 ps02-node02 kernel: [108755.873171]  do_softirq_own_stack+0x2a/0x40
Jul 25 17:36:51 ps02-node02 kernel: [108755.873175]  </IRQ>
Jul 25 17:36:51 ps02-node02 kernel: [108755.873181]  do_softirq.part.16+0x3d/0x50
Jul 25 17:36:51 ps02-node02 kernel: [108755.873188]  __local_bh_enable_ip+0x49/0x50
Jul 25 17:36:51 ps02-node02 kernel: [108755.873196]  ip_finish_output2+0x193/0x370
Jul 25 17:36:51 ps02-node02 kernel: [108755.873201]  ? ip_output+0x62/0xc0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873204]  ip_output+0x62/0xc0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873206]  ? ip_local_out+0x17/0x40
Jul 25 17:36:51 ps02-node02 kernel: [108755.873208]  tcp_transmit_skb+0x4e5/0x940
Jul 25 17:36:51 ps02-node02 kernel: [108755.873212]  tcp_rcv_established+0x299/0x540
Jul 25 17:36:51 ps02-node02 kernel: [108755.873214]  tcp_v4_do_rcv+0x12f/0x1d0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873218]  __release_sock+0x7c/0xd0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873220]  release_sock+0x2b/0x90
Jul 25 17:36:51 ps02-node02 kernel: [108755.873222]  tcp_recvmsg+0x2b1/0x8c0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873226]  inet_recvmsg+0x40/0xa0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873228]  sock_read_iter+0x89/0xd0
Jul 25 17:36:51 ps02-node02 kernel: [108755.873232]  __vfs_read+0xd9/0x140
Jul 25 17:36:51 ps02-node02 kernel: [108755.873235]  vfs_read+0x8e/0x130
Jul 25 17:36:51 ps02-node02 kernel: [108755.873238]  SyS_read+0x42/0x90
Jul 25 17:36:51 ps02-node02 kernel: [108755.873242]  do_syscall_64+0x74/0x140
Jul 25 17:36:51 ps02-node02 kernel: [108755.873245]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jul 25 17:36:51 ps02-node02 kernel: [108755.873247] RIP: 0033:0x7f65ed91251d
Jul 25 17:36:51 ps02-node02 kernel: [108755.873247] RSP: 002b:00007f65ea591810 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873249] RAX: ffffffffffffffda RBX: 0000560587740800 RCX: 00007f65ed91251d
Jul 25 17:36:51 ps02-node02 kernel: [108755.873250] RDX: 000000000001c000 RSI: 000056058db1e000 RDI: 000000000000002c
Jul 25 17:36:51 ps02-node02 kernel: [108755.873251] RBP: 000000000001c000 R08: 0000000000000000 R09: 0000000000000001
Jul 25 17:36:51 ps02-node02 kernel: [108755.873252] R10: 00007f65ea591b10 R11: 0000000000000293 R12: 0000560587740800
Jul 25 17:36:51 ps02-node02 kernel: [108755.873253] R13: 000056058db1e000 R14: 0000560587741be0 R15: 000000000001c000
Jul 25 17:36:51 ps02-node02 kernel: [108755.873255] Code: 4c c6 83 c0 01 44 39 c0 41 0f 4f c0 eb cd 49 0f af c2 44 89 ce 31 d2 44 8b 8f 30 06 00 00 48 8d 44 06 ff 48 f7 f6 44 29 c8 eb b0 <0f> 0b c3 0f 1f 40 00 0f 1f 44 00 00 53 0f b6 87 04 06 00 00 48

There are 2 different errors:
the first was the transmit reset that came from the X710 nic driver, this is not occuring in 4.12 or 4.4 we sent you after applying the patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/net/ethernet/intel/i40e?id=248de22e638f10bd5bfc7624a357f940f66ba137

the issue you see now is a tcp receive issue coming from layers above the nic driver
https://vuldb.com/?id.80721

and
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8b8a321ff72c785ed5e8b4cf6eda20b35d427390
It is a warning in that a congestion control window size is 0, this is a warning that prevents a divide by zero error from occurring. As stated this could happen under specific io sequence or even a deliberate denial of service exploiting the divide by zero code.  It is also possible the nic driver is still the problem, but now there could be the io pattern listed above which needs to be looked into before we do further nic driver fixing or report this as a bug to intel since there is no proof it is coming from the driver.

Have you installed any external software that may lead to this ? also what is the net effect, does this cause any services to go down ? Note that the same warning code is in main PetaSAN 2.0 kernel, and you did not have any warnings before.

What i suggest is trying to see if the new issue is nic driver related by either:
-Replace the nics with on a 1 or 2 nodes and see if this warning is coming only from systems with  X710 nics.
or
-Have an isolated test cluster with fresh installed 2.0 and apply the new 4.12 kernel using exiting X710 nics and see if you still have issues.

If the above does prove it is nic related, then you can report the issue to intel, i would also recommend you first update the firmware as suggested before: you have 5.05 but the latest is 6.01
https://downloadcenter.intel.com/download/25791/Ethernet-Non-Volatile-Memory-NVM-Update-Utility-for-Intel-Ethernet-Adapters-710-Series-Linux-?product=82947
you can also try to build the 4.12 kernel with the latest i40e from intel site:
https://downloadcenter.intel.com/download/24411/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connections-Under-Linux-?product=75021

Another option is maybe consider using another nic than X710 if this is possible for you.

Thanks for the info

There are 2 different errors:
the first was the transmit reset that came from the X710 nic driver, this is not occuring in 4.12 or 4.4 we sent you after applying the patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/net/ethernet/intel/i40e?id=248de22e638f10bd5bfc7624a357f940f66ba137

the issue you see now is a tcp receive issue coming from layers above the nic driver
https://vuldb.com/?id.80721

and
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8b8a321ff72c785ed5e8b4cf6eda20b35d427390
It is a warning in that a congestion control window size is 0, this is a warning that prevents a divide by zero error from occurring. As stated this could happen under specific io sequence or even a deliberate denial of service exploiting the divide by zero code.  It is also possible the nic driver is still the problem, but now there could be the io pattern listed above which needs to be looked into before we do further nic driver fixing or report this as a bug to intel since there is no proof it is coming from the driver.

sorry for the missunderstanding what i mean ist that we hat this error also with 4.4 you can see my post here
http://www.petasan.org/forums/?view=thread&id=287&part=3#postid-1901

also we have seen today the same message again with fw 6.01

Aug  1 19:39:54 ps02-node02 kernel: [14944.748940] ------------[ cut here ]------------
Aug  1 19:39:54 ps02-node02 kernel: [14944.748949] WARNING: CPU: 8 PID: 4564 at net/ipv4/tcp_input.c:2503 tcp_cwnd_reduction+0xb9/0xc0
Aug  1 19:39:54 ps02-node02 kernel: [14944.748950] Modules linked in: af_packet(N) target_core_user(N) uio(N) target_core_pscsi(N) target_core_file(N) target_core_iblock(N) iscsi_target_mod(N) target_core_rbd(N) target_core_mod(N) rbd(N) libceph(N) configfs(N) fuse(N) bonding(N) xfs(N) libcrc32c(N) ipmi_ssif(N) intel_rapl(N) skx_edac(N) x86_pkg_temp_thermal(N) intel_powerclamp(N) coretemp(N) kvm_intel(N) kvm(N) irqbypass(N) crct10dif_pclmul(N) crc32_pclmul(N) ghash_clmulni_intel(N) pcbc(N) aesni_intel(N) aes_x86_64(N) crypto_simd(N) lpc_ich(N) mei_me(N) glue_helper(N) ioatdma(N) cryptd(N) joydev(N) sg(N) mei(N) mfd_core(N) shpchp(N) dca(N) ipmi_si(N) ipmi_devintf(N) ipmi_msghandler(N) acpi_power_meter(N) acpi_pad(N) autofs4(N) ext4(N) crc16(N) jbd2(N) mbcache(N) hid_generic(N) usbhid(N) sd_mod(N) crc32c_intel(N) ast(N) i2c_algo_bit(N)
Aug  1 19:39:54 ps02-node02 kernel: [14944.748990]  ttm(N) drm_kms_helper(N) syscopyarea(N) ahci(N) sysfillrect(N) libahci(N) sysimgblt(N) xhci_pci(N) fb_sys_fops(N) i40e(N) arcmsr(N) libata(N) xhci_hcd(N) drm(N) ptp(N) pps_core(N) drm_panel_orientation_quirks(N) usbcore(N) scsi_mod(N) wmi(N) button(N)
Aug  1 19:39:54 ps02-node02 kernel: [14944.749005] Supported: No, Unsupported modules are loaded
Aug  1 19:39:54 ps02-node02 kernel: [14944.749008] CPU: 8 PID: 4564 Comm: msgr-worker-2 Tainted: G                   4.12.14-02-petasan #1 SLE15
Aug  1 19:39:54 ps02-node02 kernel: [14944.749009] Hardware name: Supermicro Super Server/X11SPW-TF, BIOS 2.0b 02/26/2018
Aug  1 19:39:54 ps02-node02 kernel: [14944.749010] task: ffff9ac902ae8b40 task.stack: ffffb801e17f0000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749013] RIP: 0010:tcp_cwnd_reduction+0xb9/0xc0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749014] RSP: 0018:ffff9ac9bbe03c38 EFLAGS: 00010246
Aug  1 19:39:54 ps02-node02 kernel: [14944.749016] RAX: 0000000000000008 RBX: ffff9ac93d8f47c0 RCX: 0000000000000003
Aug  1 19:39:54 ps02-node02 kernel: [14944.749017] RDX: 0000000000005726 RSI: 000000000000000e RDI: ffff9ac93d8f47c0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749018] RBP: 000000000000000e R08: ffff9ac9bbe03c88 R09: 0000000000000000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749019] R10: 0000000000000006 R11: 0000000000000000 R12: 00000000fd905c47
Aug  1 19:39:54 ps02-node02 kernel: [14944.749020] R13: 00000000fd8fda62 R14: 0000000000000019 R15: 0000000000000000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749021] FS:  00007f8593098700(0000) GS:ffff9ac9bbe00000(0000) knlGS:0000000000000000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  1 19:39:54 ps02-node02 kernel: [14944.749023] CR2: 000055720e46f000 CR3: 0000001f41906001 CR4: 00000000007606e0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug  1 19:39:54 ps02-node02 kernel: [14944.749027] PKRU: 55555554
Aug  1 19:39:54 ps02-node02 kernel: [14944.749028] Call Trace:
Aug  1 19:39:54 ps02-node02 kernel: [14944.749030]  <IRQ>
Aug  1 19:39:54 ps02-node02 kernel: [14944.749033]  tcp_ack+0x727/0x870
Aug  1 19:39:54 ps02-node02 kernel: [14944.749038]  tcp_rcv_established+0x1bf/0x540
Aug  1 19:39:54 ps02-node02 kernel: [14944.749042]  tcp_v4_do_rcv+0x12f/0x1d0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749044]  tcp_v4_rcv+0x88f/0x990
Aug  1 19:39:54 ps02-node02 kernel: [14944.749047]  ip_local_deliver_finish+0x92/0x1d0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749049]  ip_local_deliver+0x5b/0xc0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749051]  ip_rcv+0x266/0x380
Aug  1 19:39:54 ps02-node02 kernel: [14944.749056]  ? load_balance+0x158/0x960
Aug  1 19:39:54 ps02-node02 kernel: [14944.749060]  __netif_receive_skb_core+0x545/0xab0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749064]  ? __note_gp_changes+0x2e/0x1c0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749066]  ? process_backlog+0xad/0x160
Aug  1 19:39:54 ps02-node02 kernel: [14944.749068]  process_backlog+0xad/0x160
Aug  1 19:39:54 ps02-node02 kernel: [14944.749070]  net_rx_action+0x25a/0x3a0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749073]  ? rebalance_domains+0xe8/0x280
Aug  1 19:39:54 ps02-node02 kernel: [14944.749078]  __do_softirq+0xf5/0x296
Aug  1 19:39:54 ps02-node02 kernel: [14944.749081]  do_softirq_own_stack+0x2a/0x40
Aug  1 19:39:54 ps02-node02 kernel: [14944.749083]  </IRQ>
Aug  1 19:39:54 ps02-node02 kernel: [14944.749088]  do_softirq.part.16+0x3d/0x50
Aug  1 19:39:54 ps02-node02 kernel: [14944.749090]  __local_bh_enable_ip+0x49/0x50
Aug  1 19:39:54 ps02-node02 kernel: [14944.749092]  ip_finish_output2+0x193/0x370
Aug  1 19:39:54 ps02-node02 kernel: [14944.749095]  ? ip_output+0x62/0xc0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749096]  ip_output+0x62/0xc0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749098]  ? ip_local_out+0x17/0x40
Aug  1 19:39:54 ps02-node02 kernel: [14944.749100]  tcp_transmit_skb+0x4e5/0x940
Aug  1 19:39:54 ps02-node02 kernel: [14944.749104]  tcp_rcv_established+0x299/0x540
Aug  1 19:39:54 ps02-node02 kernel: [14944.749106]  tcp_v4_do_rcv+0x12f/0x1d0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749110]  __release_sock+0x7c/0xd0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749112]  release_sock+0x2b/0x90
Aug  1 19:39:54 ps02-node02 kernel: [14944.749114]  tcp_recvmsg+0x2b1/0x8c0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749119]  inet_recvmsg+0x40/0xa0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749121]  sock_read_iter+0x89/0xd0
Aug  1 19:39:54 ps02-node02 kernel: [14944.749125]  __vfs_read+0xd9/0x140
Aug  1 19:39:54 ps02-node02 kernel: [14944.749128]  vfs_read+0x8e/0x130
Aug  1 19:39:54 ps02-node02 kernel: [14944.749131]  SyS_read+0x42/0x90
Aug  1 19:39:54 ps02-node02 kernel: [14944.749135]  do_syscall_64+0x74/0x140
Aug  1 19:39:54 ps02-node02 kernel: [14944.749138]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Aug  1 19:39:54 ps02-node02 kernel: [14944.749140] RIP: 0033:0x7f859741851d
Aug  1 19:39:54 ps02-node02 kernel: [14944.749141] RSP: 002b:00007f8593095810 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749143] RAX: ffffffffffffffda RBX: 00005571f2b58800 RCX: 00007f859741851d
Aug  1 19:39:54 ps02-node02 kernel: [14944.749144] RDX: 0000000000013000 RSI: 00005571ffcdb000 RDI: 0000000000000070
Aug  1 19:39:54 ps02-node02 kernel: [14944.749144] RBP: 0000000000013000 R08: 0000000000000000 R09: 0000000000000001
Aug  1 19:39:54 ps02-node02 kernel: [14944.749145] R10: 00007f8593095b10 R11: 0000000000000293 R12: 00005571f2b58800
Aug  1 19:39:54 ps02-node02 kernel: [14944.749146] R13: 00005571ffcdb000 R14: 00005571f2b59be0 R15: 0000000000013000
Aug  1 19:39:54 ps02-node02 kernel: [14944.749148] Code: 4c c6 83 c0 01 44 39 c0 41 0f 4f c0 eb cd 49 0f af c2 44 89 ce 31 d2 44 8b 8f 30 06 00 00 48 8d 44 06 ff 48 f7 f6 44 29 c8 eb b0 <0f> 0b c3 0f 1f 40 00 0f 1f 44 00 00 53 0f b6 87 04 06 00 00 48
Aug  1 19:39:54 ps02-node02 kernel: [14944.749178] ---[ end trace d9590f96a10423a5 ]---

root@ps02-node02:~# ethtool -i eth3
driver: i40e
version: 2.1.14-k
firmware-version: 6.01 0x800036e4 1.1681.0
expansion-rom-version:
bus-info: 0000:17:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

Have you installed any external software that may lead to this ? also what is the net effect, does this cause any services to go down ? Note that the same warning code is in main PetaSAN 2.0 kernel, and you did not have any warnings before.

no here is no external software installed, no none of the services are going down, if you tell me that this warning is fine, we can stay with it but if this could run into errors we have to fix it.

What i suggest is trying to see if the new issue is nic driver related by either:
-Replace the nics with on a 1 or 2 nodes and see if this warning is coming only from systems with  X710 nics.
or
-Have an isolated test cluster with fresh installed 2.0 and apply the new 4.12 kernel using exiting X710 nics and see if you still have issues.

yes we have a second cluster, a fresh 3 node Cluster with petasan 2.0 (installed 3 weeks ago now with kernel 4.12) also the same problem

If the above does prove it is nic related, then you can report the issue to intel, i would also recommend you first update the firmware as suggested before: you have 5.05 but the latest is 6.01
https://downloadcenter.intel.com/download/25791/Ethernet-Non-Volatile-Memory-NVM-Update-Utility-for-Intel-Ethernet-Adapters-710-Series-Linux-?product=82947

we have done the fw update to 6.01 on our new cluster,  now we get the message that we have to update the Driver:

[    3.182168] i40e 0000:17:00.1: i40e_ptp_init: PTP not supported on eth1
[    3.191327] i40e 0000:17:00.1: PCI-Express: Speed 8.0GT/s Width x8
[    3.198464] i40e 0000:17:00.1: Features: PF-id[1] VFs: 32 VSIs: 34 QP: 24 RSS FD_ATR FD_SB NTUPLE DCB VxLAN Geneve VEPA
[    3.210793] i40e 0000:17:00.2: fw 6.0.48754 api 1.7 nvm 6.01 0x800036e4 1.1861.0
[    3.210794] i40e 0000:17:00.2: The driver for the device detected a newer version of the NVM image than expected. Please install the most recent version of the network driver.

also we got now on all 3 nodes with nic FW 6.01 this messages:

Aug  1 17:49:05 ps02-node02 consul[3079]: memberlist: Was able to reach ps02-node01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Aug  1 17:52:38 ps02-node02 consul[3079]: memberlist: Was able to reach ps02-node01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Aug  1 19:02:35 ps02-node02 consul[3079]: memberlist: Was able to reach ps02-node01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Aug  1 19:34:28 ps02-node02 consul[3079]: memberlist: Was able to reach ps02-node01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP
Aug  1 19:34:36 ps02-node02 consul[3079]: memberlist: Was able to reach ps02-node01 via TCP but not UDP, network may be misconfigured and not allowing bidirectional UDP

on the other cluster with fw 5.05 nic´s we do not have this messages in syslog

you can also try to build the 4.12 kernel with the latest i40e from intel site:
https://downloadcenter.intel.com/download/24411/Intel-Network-Adapter-Driver-for-PCIe-40-Gigabit-Ethernet-Network-Connections-Under-Linux-?product=75021

we do not have the kernel sources may you can build a kernel with the latest driver?

Another option is maybe consider using another nic than X710 if this is possible for you.

this will be the last and expensive option cause we have to change in this case 14 NIC´s

 

 

 

 

Hi

I recommend you report the issue with intel to help solve this problem. Based on the data you provided we do not think the issue is a show stopper but still need to be looked into and resolved.

I uploaded the sources for the 4.12 kernel to same download location,  so you can build it, we can also do  it but not after a couple of days since we are crunched for the 2.1 release.

Pages: 1 2 3 4