mgr fails to start
kpiti
23 Posts
February 19, 2023, 9:31 pmQuote from kpiti on February 19, 2023, 9:31 pmHi,
I have a freshly installed cluster - 3 mgr/osd + 2 osd only. After install the gui wouldn't show #OSDs and nodes->disks and such and I shutdowned the lot and started bringing up one by one. Now it seems none of the mgr nodes (services) seem to start. One error I already fixed is NTP on node1, was off 1h..
Status:
~# ceph status
cluster:
id: f789cfce-0cfc-4a21-8c42-2f231aacc56e
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
no active mgr
services:
mon: 3 daemons, quorum CEPH03,CEPH01,CEPH02 (age 49m)
mgr: no daemons active
mds: cephfs:0
osd: 30 osds: 12 up (since 35m), 12 in (since 25m); 101 remapped pgs
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
If I try to start mgr service (systemctl restart ceph-mgr@CEPH01) I get these keyring/too fast errors:
Feb 19 21:52:57 CEPH01 systemd[1]: Started Ceph cluster manager daemon.
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x55c3d6226140) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x7ffeb5972b10) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: failed to fetch mon config (--no-mon-config to skip)
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Main process exited, code=exited, status=1/FAILURE
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
The status returns:
# systemctl status ceph-mgr@CEPH01
● ceph-mgr@CEPH01.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 21:53:28 CET; 30min ago
Process: 2838 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id CEPH01 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2838 (code=exited, status=1/FAILURE)
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 21:53:28 CEPH01 systemd[1]: Stopped Ceph cluster manager daemon.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Start request repeated too quickly.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:53:28 CEPH01 systemd[1]: Failed to start Ceph cluster manager daemon.
/var/lib/ceph/mgr/ is empty.. As this is a vanilla install I can do whatever needed (hopefully not ISO reinstall, remote location). I did try to find solution first (i.e. this thread) but doesn't seem to have helped..
Thanks&cheers
p.s. Have the same problem with mds but let's go one step at the time
# systemctl status ceph-mds@CEPH01
● ceph-mds@CEPH01.service - Ceph metadata server daemon
Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 20:56:45 CET; 1h 31min ago
Main PID: 1608 (code=exited, status=1/FAILURE)
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 20:56:45 CEPH01 systemd[1]: Stopped Ceph metadata server daemon.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 20:56:45 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:24:02 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
Hi,
I have a freshly installed cluster - 3 mgr/osd + 2 osd only. After install the gui wouldn't show #OSDs and nodes->disks and such and I shutdowned the lot and started bringing up one by one. Now it seems none of the mgr nodes (services) seem to start. One error I already fixed is NTP on node1, was off 1h..
Status:
~# ceph status
cluster:
id: f789cfce-0cfc-4a21-8c42-2f231aacc56e
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
no active mgr
services:
mon: 3 daemons, quorum CEPH03,CEPH01,CEPH02 (age 49m)
mgr: no daemons active
mds: cephfs:0
osd: 30 osds: 12 up (since 35m), 12 in (since 25m); 101 remapped pgs
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
If I try to start mgr service (systemctl restart ceph-mgr@CEPH01) I get these keyring/too fast errors:
Feb 19 21:52:57 CEPH01 systemd[1]: Started Ceph cluster manager daemon.
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x55c3d6226140) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x7ffeb5972b10) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: failed to fetch mon config (--no-mon-config to skip)
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Main process exited, code=exited, status=1/FAILURE
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
The status returns:
# systemctl status ceph-mgr@CEPH01
● ceph-mgr@CEPH01.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 21:53:28 CET; 30min ago
Process: 2838 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id CEPH01 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2838 (code=exited, status=1/FAILURE)
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 21:53:28 CEPH01 systemd[1]: Stopped Ceph cluster manager daemon.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Start request repeated too quickly.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:53:28 CEPH01 systemd[1]: Failed to start Ceph cluster manager daemon.
/var/lib/ceph/mgr/ is empty.. As this is a vanilla install I can do whatever needed (hopefully not ISO reinstall, remote location). I did try to find solution first (i.e. this thread) but doesn't seem to have helped..
Thanks&cheers
p.s. Have the same problem with mds but let's go one step at the time
# systemctl status ceph-mds@CEPH01
● ceph-mds@CEPH01.service - Ceph metadata server daemon
Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 20:56:45 CET; 1h 31min ago
Main PID: 1608 (code=exited, status=1/FAILURE)
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 20:56:45 CEPH01 systemd[1]: Stopped Ceph metadata server daemon.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 20:56:45 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:24:02 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
admin
2,930 Posts
February 19, 2023, 10:37 pmQuote from admin on February 19, 2023, 10:37 pmcan you try to re-install. adjust time and time zone if needed.
can you try to re-install. adjust time and time zone if needed.
kpiti
23 Posts
February 19, 2023, 10:42 pmQuote from kpiti on February 19, 2023, 10:42 pmIs there a way to reinstall without blanking the disks and installing from ISO?
Is there a way to reinstall without blanking the disks and installing from ISO?
kpiti
23 Posts
February 25, 2023, 4:38 pmQuote from kpiti on February 25, 2023, 4:38 pmPossible bug:
As suggested I reinstalled the whole cluster very slowly and by the book, setting the timezone and everything (same as before). I got the whole cluster (5 nodes) up, not all OSDs configured yet but ran in to similar problems as before. As it turns out it happened when I added a NTP server in the adminUI and the reason seems that the new NTP server was only noted on node1 in /etc/ntp.conf and on the rest of the nodes the NTP server points to the first node. But the main problem is that ntpd was running just on the first node afterwards and the rest of the nodes had their ntpd killed and had a 1h drift. What follows is that you can't see most of the stuff in the GUI (like disks on another node). And can't start the mgr on the first node as it happened before..
After I restarted ntpd on the rest of the nodes everything started working again in GUI. I think this was the main reason for my initial error before as the first thing I noticed was ntp/time not synced.. Should setting of a NTP server set that on all nodes in /etc/ntp.conf or is it by design that it's set on one node only and the rest use that node as a primary ntp clock?
The status of ntp at the start of problems:
# systemctl status ntp
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-02-23 13:26:58 CET; 21h ago
Docs: man:ntpd(8)
Main PID: 3701 (code=exited, status=255/EXCEPTION)
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Failed with result 'exit-code'.
The log for the ntp failure on a failed node if you can spot the problem:
Feb 22 15:11:52 CEPH02 deploy.py[3668]: 22 Feb 15:11:52 ntpdate[3668]: no server suitable for synchronization found
Feb 22 15:11:52 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:52 CEPH02 ntpd[3676]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:52 CEPH02 ntpd[3676]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:52 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:52 CEPH02 ntpd[3679]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2022-06-28T00:00:00Z last=2017-01-01T00:00:00Z ofs=37
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): expired less than 240 days ago
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: ntpd exiting on signal 15 (Terminated)
Feb 22 15:11:52 CEPH02 systemd[1]: Stopping Network Time Service...
Feb 22 15:11:52 CEPH02 systemd[1]: ntp.service: Succeeded.
Feb 22 15:11:52 CEPH02 systemd[1]: Stopped Network Time Service.
Feb 22 15:11:54 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:54 CEPH02 ntpd[3698]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:54 CEPH02 ntpd[3698]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:54 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:54 CEPH02 ntpd[3701]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument
....
Feb 22 15:17:02 CEPH02 CRON[4530]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Possible bug:
As suggested I reinstalled the whole cluster very slowly and by the book, setting the timezone and everything (same as before). I got the whole cluster (5 nodes) up, not all OSDs configured yet but ran in to similar problems as before. As it turns out it happened when I added a NTP server in the adminUI and the reason seems that the new NTP server was only noted on node1 in /etc/ntp.conf and on the rest of the nodes the NTP server points to the first node. But the main problem is that ntpd was running just on the first node afterwards and the rest of the nodes had their ntpd killed and had a 1h drift. What follows is that you can't see most of the stuff in the GUI (like disks on another node). And can't start the mgr on the first node as it happened before..
After I restarted ntpd on the rest of the nodes everything started working again in GUI. I think this was the main reason for my initial error before as the first thing I noticed was ntp/time not synced.. Should setting of a NTP server set that on all nodes in /etc/ntp.conf or is it by design that it's set on one node only and the rest use that node as a primary ntp clock?
The status of ntp at the start of problems:
# systemctl status ntp
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-02-23 13:26:58 CET; 21h ago
Docs: man:ntpd(8)
Main PID: 3701 (code=exited, status=255/EXCEPTION)
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Failed with result 'exit-code'.
The log for the ntp failure on a failed node if you can spot the problem:
Feb 22 15:11:52 CEPH02 deploy.py[3668]: 22 Feb 15:11:52 ntpdate[3668]: no server suitable for synchronization found
Feb 22 15:11:52 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:52 CEPH02 ntpd[3676]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:52 CEPH02 ntpd[3676]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:52 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:52 CEPH02 ntpd[3679]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2022-06-28T00:00:00Z last=2017-01-01T00:00:00Z ofs=37
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): expired less than 240 days ago
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: ntpd exiting on signal 15 (Terminated)
Feb 22 15:11:52 CEPH02 systemd[1]: Stopping Network Time Service...
Feb 22 15:11:52 CEPH02 systemd[1]: ntp.service: Succeeded.
Feb 22 15:11:52 CEPH02 systemd[1]: Stopped Network Time Service.
Feb 22 15:11:54 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:54 CEPH02 ntpd[3698]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:54 CEPH02 ntpd[3698]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:54 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:54 CEPH02 ntpd[3701]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument
....
Feb 22 15:17:02 CEPH02 CRON[4530]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
admin
2,930 Posts
February 25, 2023, 4:55 pmQuote from admin on February 25, 2023, 4:55 pmThanks for the feedback.
Ceph is very sensitive to clock sync, a fraction of a second will break the system. 1h time difference is a lot, ntp service may take a long time to sync a 1h difference. ntp is tricky how it syncs time, it just does not set node time to target time..just think if you had write transaction logs you could tag new logs as historic.
Thanks for the feedback.
Ceph is very sensitive to clock sync, a fraction of a second will break the system. 1h time difference is a lot, ntp service may take a long time to sync a 1h difference. ntp is tricky how it syncs time, it just does not set node time to target time..just think if you had write transaction logs you could tag new logs as historic.
Last edited on February 25, 2023, 4:57 pm by admin · #5
kpiti
23 Posts
February 25, 2023, 5:01 pmQuote from kpiti on February 25, 2023, 5:01 pmAgree. All clustered systems rely heavily on synced clocks.. The problem was that ntpd was dead on all but the first node.
Is it by design that all other nodes sync from the first node or should I set the primary ntp server to the same external ntp?
Agree. All clustered systems rely heavily on synced clocks.. The problem was that ntpd was dead on all but the first node.
Is it by design that all other nodes sync from the first node or should I set the primary ntp server to the same external ntp?
admin
2,930 Posts
February 25, 2023, 5:20 pmQuote from admin on February 25, 2023, 5:20 pmi am not sure why the ntp service was dead, it should retry to sync and not quit, and if dead it should restart by the systemd system.
Yes it is by design nodes to sync among thenselves ( node 3 syncs from node 2) rather than they all sync from an external source, reason being the latency between the external source and cluster nodes could be high so it is possible that the nodes will be out of sync with each other.
i am not sure why the ntp service was dead, it should retry to sync and not quit, and if dead it should restart by the systemd system.
Yes it is by design nodes to sync among thenselves ( node 3 syncs from node 2) rather than they all sync from an external source, reason being the latency between the external source and cluster nodes could be high so it is possible that the nodes will be out of sync with each other.
mgr fails to start
kpiti
23 Posts
Quote from kpiti on February 19, 2023, 9:31 pmHi,
I have a freshly installed cluster - 3 mgr/osd + 2 osd only. After install the gui wouldn't show #OSDs and nodes->disks and such and I shutdowned the lot and started bringing up one by one. Now it seems none of the mgr nodes (services) seem to start. One error I already fixed is NTP on node1, was off 1h..
Status:
~# ceph status
cluster:
id: f789cfce-0cfc-4a21-8c42-2f231aacc56e
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
no active mgrservices:
mon: 3 daemons, quorum CEPH03,CEPH01,CEPH02 (age 49m)
mgr: no daemons active
mds: cephfs:0
osd: 30 osds: 12 up (since 35m), 12 in (since 25m); 101 remapped pgsdata:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:If I try to start mgr service (systemctl restart ceph-mgr@CEPH01) I get these keyring/too fast errors:
Feb 19 21:52:57 CEPH01 systemd[1]: Started Ceph cluster manager daemon.
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x55c3d6226140) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x7ffeb5972b10) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: failed to fetch mon config (--no-mon-config to skip)
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Main process exited, code=exited, status=1/FAILURE
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.The status returns:
# systemctl status ceph-mgr@CEPH01
● ceph-mgr@CEPH01.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 21:53:28 CET; 30min ago
Process: 2838 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id CEPH01 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2838 (code=exited, status=1/FAILURE)Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 21:53:28 CEPH01 systemd[1]: Stopped Ceph cluster manager daemon.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Start request repeated too quickly.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:53:28 CEPH01 systemd[1]: Failed to start Ceph cluster manager daemon./var/lib/ceph/mgr/ is empty.. As this is a vanilla install I can do whatever needed (hopefully not ISO reinstall, remote location). I did try to find solution first (i.e. this thread) but doesn't seem to have helped..
Thanks&cheers
p.s. Have the same problem with mds but let's go one step at the time
# systemctl status ceph-mds@CEPH01
● ceph-mds@CEPH01.service - Ceph metadata server daemon
Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 20:56:45 CET; 1h 31min ago
Main PID: 1608 (code=exited, status=1/FAILURE)Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 20:56:45 CEPH01 systemd[1]: Stopped Ceph metadata server daemon.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 20:56:45 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:24:02 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
Hi,
I have a freshly installed cluster - 3 mgr/osd + 2 osd only. After install the gui wouldn't show #OSDs and nodes->disks and such and I shutdowned the lot and started bringing up one by one. Now it seems none of the mgr nodes (services) seem to start. One error I already fixed is NTP on node1, was off 1h..
Status:
~# ceph status
cluster:
id: f789cfce-0cfc-4a21-8c42-2f231aacc56e
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
no active mgrservices:
mon: 3 daemons, quorum CEPH03,CEPH01,CEPH02 (age 49m)
mgr: no daemons active
mds: cephfs:0
osd: 30 osds: 12 up (since 35m), 12 in (since 25m); 101 remapped pgsdata:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:
If I try to start mgr service (systemctl restart ceph-mgr@CEPH01) I get these keyring/too fast errors:
Feb 19 21:52:57 CEPH01 systemd[1]: Started Ceph cluster manager daemon.
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x55c3d6226140) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-CEPH01/keyring: (2) No>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: 2023-02-19T21:52:57.666+0100 7fe1d4782040 -1 AuthRegistry(0x7ffeb5972b10) no keyring found at /var/lib/ceph/mgr/ceph-CEPH01/>
Feb 19 21:52:57 CEPH01 ceph-mgr[2816]: failed to fetch mon config (--no-mon-config to skip)
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Main process exited, code=exited, status=1/FAILURE
Feb 19 21:52:57 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
The status returns:
# systemctl status ceph-mgr@CEPH01
● ceph-mgr@CEPH01.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 21:53:28 CET; 30min ago
Process: 2838 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id CEPH01 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2838 (code=exited, status=1/FAILURE)Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 21:53:28 CEPH01 systemd[1]: Stopped Ceph cluster manager daemon.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Start request repeated too quickly.
Feb 19 21:53:28 CEPH01 systemd[1]: ceph-mgr@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:53:28 CEPH01 systemd[1]: Failed to start Ceph cluster manager daemon.
/var/lib/ceph/mgr/ is empty.. As this is a vanilla install I can do whatever needed (hopefully not ISO reinstall, remote location). I did try to find solution first (i.e. this thread) but doesn't seem to have helped..
Thanks&cheers
p.s. Have the same problem with mds but let's go one step at the time
# systemctl status ceph-mds@CEPH01
● ceph-mds@CEPH01.service - Ceph metadata server daemon
Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-02-19 20:56:45 CET; 1h 31min ago
Main PID: 1608 (code=exited, status=1/FAILURE)Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Scheduled restart job, restart counter is at 3.
Feb 19 20:56:45 CEPH01 systemd[1]: Stopped Ceph metadata server daemon.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 20:56:45 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 20:56:45 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Start request repeated too quickly.
Feb 19 21:24:02 CEPH01 systemd[1]: ceph-mds@CEPH01.service: Failed with result 'exit-code'.
Feb 19 21:24:02 CEPH01 systemd[1]: Failed to start Ceph metadata server daemon.
admin
2,930 Posts
Quote from admin on February 19, 2023, 10:37 pmcan you try to re-install. adjust time and time zone if needed.
can you try to re-install. adjust time and time zone if needed.
kpiti
23 Posts
Quote from kpiti on February 19, 2023, 10:42 pmIs there a way to reinstall without blanking the disks and installing from ISO?
Is there a way to reinstall without blanking the disks and installing from ISO?
kpiti
23 Posts
Quote from kpiti on February 25, 2023, 4:38 pmPossible bug:
As suggested I reinstalled the whole cluster very slowly and by the book, setting the timezone and everything (same as before). I got the whole cluster (5 nodes) up, not all OSDs configured yet but ran in to similar problems as before. As it turns out it happened when I added a NTP server in the adminUI and the reason seems that the new NTP server was only noted on node1 in /etc/ntp.conf and on the rest of the nodes the NTP server points to the first node. But the main problem is that ntpd was running just on the first node afterwards and the rest of the nodes had their ntpd killed and had a 1h drift. What follows is that you can't see most of the stuff in the GUI (like disks on another node). And can't start the mgr on the first node as it happened before..
After I restarted ntpd on the rest of the nodes everything started working again in GUI. I think this was the main reason for my initial error before as the first thing I noticed was ntp/time not synced.. Should setting of a NTP server set that on all nodes in /etc/ntp.conf or is it by design that it's set on one node only and the rest use that node as a primary ntp clock?
The status of ntp at the start of problems:
# systemctl status ntp
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-02-23 13:26:58 CET; 21h ago
Docs: man:ntpd(8)
Main PID: 3701 (code=exited, status=255/EXCEPTION)Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Failed with result 'exit-code'.The log for the ntp failure on a failed node if you can spot the problem:
Feb 22 15:11:52 CEPH02 deploy.py[3668]: 22 Feb 15:11:52 ntpdate[3668]: no server suitable for synchronization found
Feb 22 15:11:52 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:52 CEPH02 ntpd[3676]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:52 CEPH02 ntpd[3676]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:52 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:52 CEPH02 ntpd[3679]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2022-06-28T00:00:00Z last=2017-01-01T00:00:00Z ofs=37
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): expired less than 240 days ago
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: ntpd exiting on signal 15 (Terminated)
Feb 22 15:11:52 CEPH02 systemd[1]: Stopping Network Time Service...
Feb 22 15:11:52 CEPH02 systemd[1]: ntp.service: Succeeded.
Feb 22 15:11:52 CEPH02 systemd[1]: Stopped Network Time Service.
Feb 22 15:11:54 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:54 CEPH02 ntpd[3698]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:54 CEPH02 ntpd[3698]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:54 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:54 CEPH02 ntpd[3701]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument....
Feb 22 15:17:02 CEPH02 CRON[4530]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Possible bug:
As suggested I reinstalled the whole cluster very slowly and by the book, setting the timezone and everything (same as before). I got the whole cluster (5 nodes) up, not all OSDs configured yet but ran in to similar problems as before. As it turns out it happened when I added a NTP server in the adminUI and the reason seems that the new NTP server was only noted on node1 in /etc/ntp.conf and on the rest of the nodes the NTP server points to the first node. But the main problem is that ntpd was running just on the first node afterwards and the rest of the nodes had their ntpd killed and had a 1h drift. What follows is that you can't see most of the stuff in the GUI (like disks on another node). And can't start the mgr on the first node as it happened before..
After I restarted ntpd on the rest of the nodes everything started working again in GUI. I think this was the main reason for my initial error before as the first thing I noticed was ntp/time not synced.. Should setting of a NTP server set that on all nodes in /etc/ntp.conf or is it by design that it's set on one node only and the rest use that node as a primary ntp clock?
The status of ntp at the start of problems:
# systemctl status ntp
● ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; disabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2023-02-23 13:26:58 CET; 21h ago
Docs: man:ntpd(8)
Main PID: 3701 (code=exited, status=255/EXCEPTION)Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Main process exited, code=exited, status=255/EXCEPTION
Feb 23 13:26:58 CEPH02 systemd[1]: ntp.service: Failed with result 'exit-code'.
The log for the ntp failure on a failed node if you can spot the problem:
Feb 22 15:11:52 CEPH02 deploy.py[3668]: 22 Feb 15:11:52 ntpdate[3668]: no server suitable for synchronization found
Feb 22 15:11:52 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:52 CEPH02 ntpd[3676]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:52 CEPH02 ntpd[3676]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:52 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:52 CEPH02 ntpd[3679]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash signature
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, expire=2022-06-28T00:00:00Z last=2017-01-01T00:00:00Z ofs=37
Feb 22 15:11:52 CEPH02 ntpd[3679]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): expired less than 240 days ago
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:52 CEPH02 ntpd[3679]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:52 CEPH02 ntpd[3679]: ntpd exiting on signal 15 (Terminated)
Feb 22 15:11:52 CEPH02 systemd[1]: Stopping Network Time Service...
Feb 22 15:11:52 CEPH02 systemd[1]: ntp.service: Succeeded.
Feb 22 15:11:52 CEPH02 systemd[1]: Stopped Network Time Service.
Feb 22 15:11:54 CEPH02 systemd[1]: Starting Network Time Service...
Feb 22 15:11:54 CEPH02 ntpd[3698]: ntpd 4.2.8p12@1.3728-o (1): Starting
Feb 22 15:11:54 CEPH02 ntpd[3698]: Command line: /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 104:110
Feb 22 15:11:54 CEPH02 systemd[1]: Started Network Time Service.
Feb 22 15:11:54 CEPH02 ntpd[3701]: proto: precision = 0.090 usec (-23)
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 0 v6wildcard [::]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 2 lo 127.0.0.1:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 3 eth2 10.12.13.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 4 bond0.1353 10.12.201.52:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 5 lo [::1]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 6 usb0 [fe80::2806:e2ff:fea7:4724%6]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 7 bond0 [fe80::3eec:efff:fe5c:9c88%7]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listen normally on 8 bond0.1353 [fe80::3eec:efff:fe5c:9c88%8]:123
Feb 22 15:11:54 CEPH02 ntpd[3701]: Listening on routing socket on fd #25 for interface updates
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument
Feb 22 15:11:54 CEPH02 deploy.py[3709]: /opt/petasan/config/tuning/current/post_deploy_script: line 28: echo: write error: Invalid argument....
Feb 22 15:17:02 CEPH02 CRON[4530]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Feb 22 15:17:17 CEPH02 ntpd[3701]: Listen normally on 9 eth2 [2001:1470:fff0:1303:3eec:efff:fe60:8854]:123
Feb 22 15:17:17 CEPH02 ntpd[3701]: new interface(s) found: waking up resolver
Feb 22 15:21:26 CEPH02 ntpd[3701]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
admin
2,930 Posts
Quote from admin on February 25, 2023, 4:55 pmThanks for the feedback.
Ceph is very sensitive to clock sync, a fraction of a second will break the system. 1h time difference is a lot, ntp service may take a long time to sync a 1h difference. ntp is tricky how it syncs time, it just does not set node time to target time..just think if you had write transaction logs you could tag new logs as historic.
Thanks for the feedback.
Ceph is very sensitive to clock sync, a fraction of a second will break the system. 1h time difference is a lot, ntp service may take a long time to sync a 1h difference. ntp is tricky how it syncs time, it just does not set node time to target time..just think if you had write transaction logs you could tag new logs as historic.
kpiti
23 Posts
Quote from kpiti on February 25, 2023, 5:01 pmAgree. All clustered systems rely heavily on synced clocks.. The problem was that ntpd was dead on all but the first node.
Is it by design that all other nodes sync from the first node or should I set the primary ntp server to the same external ntp?
Agree. All clustered systems rely heavily on synced clocks.. The problem was that ntpd was dead on all but the first node.
Is it by design that all other nodes sync from the first node or should I set the primary ntp server to the same external ntp?
admin
2,930 Posts
Quote from admin on February 25, 2023, 5:20 pmi am not sure why the ntp service was dead, it should retry to sync and not quit, and if dead it should restart by the systemd system.
Yes it is by design nodes to sync among thenselves ( node 3 syncs from node 2) rather than they all sync from an external source, reason being the latency between the external source and cluster nodes could be high so it is possible that the nodes will be out of sync with each other.
i am not sure why the ntp service was dead, it should retry to sync and not quit, and if dead it should restart by the systemd system.
Yes it is by design nodes to sync among thenselves ( node 3 syncs from node 2) rather than they all sync from an external source, reason being the latency between the external source and cluster nodes could be high so it is possible that the nodes will be out of sync with each other.