Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Production nodes goes down randomly !!

Hi Team,

Recently our petasan nodes went down randomly. We have a 4 node cluster setup and the below was found on the kernal logs.
And we observed few of the SCSI disks restarted , ideally, which is not supposed to happen because of the replication on the other nodes.
I Mainly looked into the kernel logs.
Or What logs should be considered to identify the issue.

LOGS :
Sep 2 14:09:57 ps-node-04 kernel: [81719.499480] libceph: mon2 (1)10.62.0.12:6789 session established
Sep 2 14:09:57 ps-node-04 kernel: [81719.500736] libceph: osd5 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500736] libceph: osd6 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500737] libceph: osd7 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500737] libceph: osd8 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500737] libceph: osd9 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500738] libceph: osd20 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500738] libceph: osd22 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500739] libceph: osd29 down
Sep 2 14:09:57 ps-node-04 kernel: [81719.500835] libceph: osd27 down
Sep 2 14:10:11 ps-node-04 kernel: [81733.223594] rbd1: p1
Sep 2 14:10:11 ps-node-04 kernel: [81733.223598] rbd1: p1 size 629137409 extends beyond EOD, truncated
Sep 2 14:10:11 ps-node-04 kernel: [81733.223933] rbd: rbd1: capacity 322122547200 features 0x1
Sep 2 14:10:42 ps-node-04 kernel: [81764.116729] Alternate GPT is invalid, using primary GPT.
Sep 2 14:10:42 ps-node-04 kernel: [81764.116734] rbd2: p1
Sep 2 14:10:42 ps-node-04 kernel: [81764.117107] rbd: rbd2: capacity 268435456000 features 0x1
Sep 2 14:11:13 ps-node-04 kernel: [81795.162411] Alternate GPT is invalid, using primary GPT.
Sep 2 14:11:13 ps-node-04 kernel: [81795.162417] rbd3: p1
Sep 2 14:11:13 ps-node-04 kernel: [81795.162793] rbd: rbd3: capacity 4398046511104 features 0x1
Sep 2 14:11:43 ps-node-04 kernel: [81825.825304] rbd4: p1
Sep 2 14:11:43 ps-node-04 kernel: [81825.825308] rbd4: p1 size 2147475457 extends beyond EOD, truncated
Sep 2 14:11:43 ps-node-04 kernel: [81825.825678] rbd: rbd4: capacity 1099511627776 features 0x1
Sep 2 14:12:12 ps-node-04 kernel: [81854.905845] rbd: rbd5: capacity 2199023255552 features 0x1
Sep 2 14:12:41 ps-node-04 kernel: [81883.824586] rbd6: p1
Sep 2 14:12:41 ps-node-04 kernel: [81883.824589] rbd6: p1 size 524279809 extends beyond EOD, truncated
Sep 2 14:12:41 ps-node-04 kernel: [81883.824945] rbd: rbd6: capacity 268435456000 features 0x1
Sep 2 14:13:12 ps-node-04 kernel: [81914.802699] rbd7: p1
Sep 2 14:13:12 ps-node-04 kernel: [81914.802704] rbd7: p1 size 524279809 extends beyond EOD, truncated
Sep 2 14:13:12 ps-node-04 kernel: [81914.803224] rbd: rbd7: capacity 268435456000 features 0x1
Sep 2 14:13:41 ps-node-04 kernel: [81943.510862] rbd8: p1
Sep 2 14:13:41 ps-node-04 kernel: [81943.510867] rbd8: p1 size 629137409 extends beyond EOD, truncated
Sep 2 14:13:41 ps-node-04 kernel: [81943.511261] rbd: rbd8: capacity 322122547200 features 0x1
Sep 2 14:14:12 ps-node-04 kernel: [81974.193978] rbd9: p1
Sep 2 14:14:12 ps-node-04 kernel: [81974.193982] rbd9: p1 size 314564609 extends beyond EOD, truncated
Sep 2 14:14:12 ps-node-04 kernel: [81974.194343] rbd: rbd9: capacity 161061273600 features 0x1
Sep 2 14:14:43 ps-node-04 kernel: [82005.879924] rbd10: p1
Sep 2 14:14:43 ps-node-04 kernel: [82005.879927] rbd10: p1 size 524279809 extends beyond EOD, truncated
Sep 2 14:14:43 ps-node-04 kernel: [82005.880274] rbd: rbd10: capacity 268435456000 features 0x1
Sep 2 14:15:13 ps-node-04 kernel: [82035.425422] rbd11: p1
Sep 2 14:15:13 ps-node-04 kernel: [82035.425426] rbd11: p1 size 314564609 extends beyond EOD, truncated
Sep 2 14:15:13 ps-node-04 kernel: [82035.425793] rbd: rbd11: capacity 161061273600 features 0x1
Sep 2 14:15:44 ps-node-04 kernel: [82066.376598] Alternate GPT is invalid, using primary GPT.
Sep 2 14:15:44 ps-node-04 kernel: [82066.376602] rbd12:
Sep 2 14:15:44 ps-node-04 kernel: [82066.376975] rbd: rbd12: capacity 966367641600 features 0x1
Sep 2 14:16:16 ps-node-04 kernel: [82098.085441] rbd13: p1
Sep 2 14:16:16 ps-node-04 kernel: [82098.085446] rbd13: p1 size 2147475457 extends beyond EOD, truncated
Sep 2 14:16:16 ps-node-04 kernel: [82098.085839] rbd: rbd13: capacity 1099511627776 features 0x1
Sep 2 14:16:44 ps-node-04 kernel: [82126.845416] Alternate GPT is invalid, using primary GPT.
Sep 2 14:16:44 ps-node-04 kernel: [82126.845422] rbd14: p1
Sep 2 14:16:44 ps-node-04 kernel: [82126.845798] rbd: rbd14: capacity 322122547200 features 0x1
Sep 2 14:17:15 ps-node-04 kernel: [82157.251991] rbd15: p1
Sep 2 14:17:15 ps-node-04 kernel: [82157.252362] rbd: rbd15: capacity 2199023255552 features 0x1
Sep 2 14:18:00 ps-node-04 kernel: [82202.144512] rbd16: p1
Sep 2 14:18:00 ps-node-04 kernel: [82202.144516] rbd16: p1 size 1048567809 extends beyond EOD, truncated
Sep 2 14:18:00 ps-node-04 kernel: [82202.144833] rbd: rbd16: capacity 536870912000 features 0x1
Sep 2 14:18:29 ps-node-04 kernel: [82231.742030] sd 0:0:2:0: [sdc] tag#0 Sense Key : Recovered Error [current] [descriptor]
Sep 2 14:18:29 ps-node-04 kernel: [82231.742033] sd 0:0:2:0: [sdc] tag#0 Add. Sense: Defect list not found
Sep 2 14:18:31 ps-node-04 kernel: [82233.448213] rbd17: p1
Sep 2 14:18:31 ps-node-04 kernel: [82233.448217] rbd17: p1 size 1048567809 extends beyond EOD, truncated
Sep 2 14:18:31 ps-node-04 kernel: [82233.448706] rbd: rbd17: capacity 536870912000 features 0x1
Sep 2 14:19:02 ps-node-04 kernel: [82264.784578] rbd18: p1
Sep 2 14:19:02 ps-node-04 kernel: [82264.784582] rbd18: p1 size 2147475457 extends beyond EOD, truncated
Sep 2 14:19:02 ps-node-04 kernel: [82264.784944] rbd: rbd18: capacity 1099511627776 features 0x1
Sep 2 14:19:32 ps-node-04 kernel: [82294.412074] rbd19: p1
Sep 2 14:19:32 ps-node-04 kernel: [82294.412079] rbd19: p1 size 1258283009 extends beyond EOD, truncated
Sep 2 14:19:32 ps-node-04 kernel: [82294.412465] rbd: rbd19: capacity 644245094400 features 0x1
Sep 2 14:19:39 ps-node-04 kernel: [82301.201325] libceph: osd5 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201327] libceph: osd6 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201328] libceph: osd7 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201328] libceph: osd8 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201328] libceph: osd9 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201329] libceph: osd20 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201330] libceph: osd22 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201330] libceph: osd27 weight 0x0 (out)
Sep 2 14:19:39 ps-node-04 kernel: [82301.201331] libceph: osd29 weight 0x0 (out)

Look at errors on dashboard Cluster Status,  is it ok or does it show error ?

show ceph status:

ceph status

ceph health detail

disable iSCSI fencing or reduce the backfill speed in maintenance tab if your disks are stressed by recovery.