Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Help with osds marked down

Hi,

Saturday and this morning we got a outage of our cluster. Can someone help me understanding the logs especially which mon marked the osds down and why?

Logs:

2020-09-21 06:17:35.376063 mon.ceph-node-mru-1 (mon.0) 2022148 : cluster [WRN] message from mon.2 was stamped 8.103277s in the future, clocks not synchronized
2020-09-21 06:18:00.506989 mon.ceph-node-mru-1 (mon.0) 2022150 : cluster [INF] Standby daemon mds.ceph-node-mru-2 is not responding, dropping it
2020-09-21 06:18:00.507234 mon.ceph-node-mru-1 (mon.0) 2022151 : cluster [INF] Standby daemon mds.ceph-node-mru-3 is not responding, dropping it
2020-09-21 06:18:00.533631 mon.ceph-node-mru-1 (mon.0) 2022152 : cluster [DBG] fsmap 1 up:standby
2020-09-21 06:18:05.516132 mon.ceph-node-mru-1 (mon.0) 2022153 : cluster [INF] Manager daemon ceph-node-mru-3 is unresponsive. No standby daemons available.
2020-09-21 06:18:05.517846 mon.ceph-node-mru-1 (mon.0) 2022154 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
2020-09-21 06:18:05.555477 mon.ceph-node-mru-1 (mon.0) 2022155 : cluster [DBG] mgrmap e263: no daemons active (since 0.0393453s)
2020-09-21 06:19:04.972536 mon.ceph-node-mru-1 (mon.0) 2022162 : cluster [WRN] 1 clock skew 8.10321s > max 0.3s
2020-09-21 06:19:04.972573 mon.ceph-node-mru-1 (mon.0) 2022163 : cluster [WRN] 2 clock skew 8.10302s > max 0.3s
2020-09-21 06:19:05.535907 mon.ceph-node-mru-1 (mon.0) 2022164 : cluster [WRN] Health check failed: clock skew detected on mon.ceph-node-mru-2, mon.ceph-node-mru-3 (MON_CLOCK_SKEW)
2020-09-21 06:19:34.972954 mon.ceph-node-mru-1 (mon.0) 2022171 : cluster [WRN] 1 clock skew 8.10321s > max 0.3s
2020-09-21 06:19:34.973002 mon.ceph-node-mru-1 (mon.0) 2022172 : cluster [WRN] 2 clock skew 8.10303s > max 0.3s
2020-09-21 06:20:34.973381 mon.ceph-node-mru-1 (mon.0) 2022182 : cluster [WRN] 1 clock skew 8.10322s > max 0.3s
2020-09-21 06:20:34.973433 mon.ceph-node-mru-1 (mon.0) 2022183 : cluster [WRN] 2 clock skew 8.10301s > max 0.3s
2020-09-21 06:22:04.973827 mon.ceph-node-mru-1 (mon.0) 2022195 : cluster [WRN] 1 clock skew 8.10326s > max 0.3s
2020-09-21 06:22:04.973876 mon.ceph-node-mru-1 (mon.0) 2022196 : cluster [WRN] 2 clock skew 8.10305s > max 0.3s
2020-09-21 06:24:04.974250 mon.ceph-node-mru-1 (mon.0) 2022209 : cluster [WRN] 1 clock skew 8.1033s > max 0.3s
2020-09-21 06:24:04.974298 mon.ceph-node-mru-1 (mon.0) 2022210 : cluster [WRN] 2 clock skew 8.10303s > max 0.3s
2020-09-21 06:26:34.974616 mon.ceph-node-mru-1 (mon.0) 2022225 : cluster [WRN] 1 clock skew 8.10339s > max 0.3s
2020-09-21 06:26:34.974661 mon.ceph-node-mru-1 (mon.0) 2022226 : cluster [WRN] 2 clock skew 8.10313s > max 0.3s
2020-09-21 06:28:05.848097 mon.ceph-node-mru-1 (mon.0) 2022237 : cluster [INF] osd.267 marked down after no beacon for 901.743295 seconds
2020-09-21 06:28:05.858930 mon.ceph-node-mru-1 (mon.0) 2022238 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2020-09-21 06:28:05.910835 mon.ceph-node-mru-1 (mon.0) 2022239 : cluster [DBG] osdmap e244731: 463 total, 462 up, 463 in
2020-09-21 06:28:07.034072 mon.ceph-node-mru-1 (mon.0) 2022240 : cluster [DBG] osdmap e244732: 463 total, 462 up, 463 in
2020-09-21 06:28:10.872838 mon.ceph-node-mru-1 (mon.0) 2022242 : cluster [INF] osd.101 marked down after no beacon for 902.909449 seconds
2020-09-21 06:28:10.872891 mon.ceph-node-mru-1 (mon.0) 2022243 : cluster [INF] osd.132 marked down after no beacon for 901.396088 seconds
2020-09-21 06:28:10.872910 mon.ceph-node-mru-1 (mon.0) 2022244 : cluster [INF] osd.136 marked down after no beacon for 900.153054 seconds
2020-09-21 06:28:10.872927 mon.ceph-node-mru-1 (mon.0) 2022245 : cluster [INF] osd.142 marked down after no beacon for 900.230733 seconds
2020-09-21 06:28:10.872951 mon.ceph-node-mru-1 (mon.0) 2022246 : cluster [INF] osd.167 marked down after no beacon for 901.638850 seconds
2020-09-21 06:28:10.872980 mon.ceph-node-mru-1 (mon.0) 2022247 : cluster [INF] osd.217 marked down after no beacon for 903.405562 seconds
2020-09-21 06:28:10.873003 mon.ceph-node-mru-1 (mon.0) 2022248 : cluster [INF] osd.244 marked down after no beacon for 902.890343 seconds
2020-09-21 06:28:10.873047 mon.ceph-node-mru-1 (mon.0) 2022249 : cluster [INF] osd.328 marked down after no beacon for 902.806242 seconds
2020-09-21 06:28:10.873079 mon.ceph-node-mru-1 (mon.0) 2022250 : cluster [INF] osd.380 marked down after no beacon for 900.014060 seconds
2020-09-21 06:28:10.873112 mon.ceph-node-mru-1 (mon.0) 2022251 : cluster [INF] osd.438 marked down after no beacon for 900.417971 seconds
2020-09-21 06:28:10.873133 mon.ceph-node-mru-1 (mon.0) 2022252 : cluster [INF] osd.460 marked down after no beacon for 901.782327 seconds
2020-09-21 06:28:10.890689 mon.ceph-node-mru-1 (mon.0) 2022253 : cluster [WRN] Health check update: 12 osds down (OSD_DOWN)
2020-09-21 06:28:11.155964 mon.ceph-node-mru-1 (mon.0) 2022254 : cluster [DBG] osdmap e244733: 463 total, 451 up, 463 in
2020-09-21 06:28:12.094895 mon.ceph-node-mru-1 (mon.0) 2022257 : cluster [DBG] osdmap e244734: 463 total, 451 up, 463 in

are the OSDs still down ? do you still have a clock skew warning ?

OSDs came back with 'wrongly marked me down', but cluster was offline for a while because it affected a lot osd (more than written above). I am a bit suprised by this: Mon-1 had an ntp server configured, mon-2 used mon-1 and mon-3 used mon-2 as ntpserver.(via your gui) In theory the servers should have the same time. Yesterday I set two ntpserver on all three mons manually and restartet mgr and mon daemons. Today the cluster had no problems (after a crash on Saturday evening and Monday morning). Hopefully this does not came back 😉

Regards,

Dennis