Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Ceph cluster manager daemon failure (after 2.3.1 upgrade)

Hi,

we did the PetaSAN upgrade from 2.3.0 to 2.3.1.
The upgrade so far went fine, except one manager daemon won't start on one node.

So after I did the restart of the manager (step 3.3 of the upgrade guide) I got the following message:

Job for ceph-mgr@HBPS03.service failed because the control process exited with error code.
See "systemctl status ceph-mgr@HBPS03.service" and "journalctl -xe" for details.

The status of the service says:

ceph-mgr@HBPS03.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; indirect; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-11-01 18:34:50 CET; 4min 0s ago
Process: 13575 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id HBPS03 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 13575 (code=exited, status=1/FAILURE)

Nov 01 18:34:50 HBPS03 systemd[1]: Stopped Ceph cluster manager daemon.
Nov 01 18:34:50 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Start request repeated too quickly.
Nov 01 18:34:50 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Failed with result 'exit-code'.
Nov 01 18:34:50 HBPS03 systemd[1]: Failed to start Ceph cluster manager daemon.
Nov 01 18:38:14 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Start request repeated too quickly.
Nov 01 18:38:14 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Failed with result 'exit-code'.
Nov 01 18:38:14 HBPS03 systemd[1]: Failed to start Ceph cluster manager daemon.

I also did a complete reboot of this node without success.

ceph-mgr@HBPS03.service - Ceph cluster manager daemon
Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; indirect; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2019-11-01 18:45:54 CET; 1min 33s ago
Process: 3954 ExecStart=/usr/bin/ceph-mgr -f --cluster ${CLUSTER} --id HBPS03 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 3954 (code=exited, status=1/FAILURE)

Nov 01 18:45:54 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Service hold-off time over, scheduling restart.
Nov 01 18:45:54 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Scheduled restart job, restart counter is at 3.
Nov 01 18:45:54 HBPS03 systemd[1]: Stopped Ceph cluster manager daemon.
Nov 01 18:45:54 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Start request repeated too quickly.
Nov 01 18:45:54 HBPS03 systemd[1]: ceph-mgr@HBPS03.service: Failed with result 'exit-code'.
Nov 01 18:45:54 HBPS03 systemd[1]: Failed to start Ceph cluster manager daemon.

What problem could this occurs?

I got a few more information's. This is the output of the log ceph-mgr.HBPS03.log:

2019-11-01 18:21:56.858395 7f053c25f800 0 set uid:gid to 64045:64045 (ceph:ceph)
2019-11-01 18:21:56.858413 7f053c25f800 0 ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable), process ceph-mgr, pid 2434
2019-11-01 18:21:56.860471 7f053c25f800 0 pidfile_write: ignore empty --pid-file
2019-11-01 18:21:56.865317 7f053c25f800 -1 auth: unable to find a keyring on /var/lib/ceph/mgr/ceph-HBPS03/keyring: (2) No such file or directory
2019-11-01 18:21:56.865332 7f053c25f800 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication

And YES this is the problem. The directory /var/lib/ceph/mgr/ on the node HBPS03 is empty. No sub directory ceph-HBPS03 and no keyring like on the other nodes.

How could this happens and how could this fixed?
Can i just create the directory ceph-HBPS03 and copy the keyring from another node?

Thanks for your help.

I understand that you run the Nautilus upgrade script and followed the guide, all went well expect 1 manager, if so can you run the following on the node in question and show output:

dpkg -s ceph-mgr | grep Version
ceph status | grep mgr

/opt/petasan/scripts/create_mgr.py
tail /opt/petasan/log/PetaSAN.log

Hi,
we also have another problem. I see that this node is now rebooting every 2-3 hours... 🙁

root@HBPS03:~# dpkg -s ceph-mgr | grep Version
Version: 14.2.2-1bionic

root@HBPS03:~# ceph status | grep mgr
mgr: HBPS01(active, since 13h), standbys: HBPS02

root@HBPS03:~# /opt/petasan/scripts/create_mgr.py
updated caps for client.admin

root@HBPS03:~# tail /opt/petasan/log/PetaSAN.log
02/11/2019 07:44:25 INFO Service is starting.
02/11/2019 07:44:25 INFO Cleaning unused configurations.
02/11/2019 07:44:25 INFO Cleaning all mapped disks
02/11/2019 07:44:25 INFO Cleaning unused rbd images.
02/11/2019 07:44:25 INFO Cleaning unused ips.
02/11/2019 07:50:20 INFO create_mgr() fresh install
02/11/2019 07:50:20 INFO create_mgr() started
02/11/2019 07:50:20 INFO create_mgr() cmd: mkdir -p /var/lib/ceph/mgr/ceph-HBPS03
02/11/2019 07:50:20 INFO create_mgr() cmd: ceph --cluster ceph auth get-or-create mgr.HBPS03 mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-HBPS03/keyring
02/11/2019 07:50:21 INFO create_mgr() ended successfully

OK after this the keyring file is now existing and the ceph-manager service is running.
Hopefully the rebooting will now stops.

Thanks for your quick help!

The problem with the rebooting node still persist. I can see that these reboots are some how time triggered, but I have no idea where to look for the reason. And it is an actual reboot of the node, if i check the uptime.

This problem first occurs after the upgrade.
Can you please help with this.

Montior Status

as it tuns out you are customer, please log this in our support portal.