Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

1/3 Mons down - out of quorum during update

Was in process of updating cluster (see topic: Update Issue)

6 node cluster named petasan1 - petasan6

I have already applied the updates to nodes: petasan2, petasan3 and petasan6

Directions say to only perform update when Health is shown as OK.

System finished fixing misplaced objects after updating petasan2 this morning, but my health is still in Warning due to petasan1 being out of quorum. Is this because it has not been updated yet? Is it ok to update this node even though health is Warn?

What is the best way to proceed?

Neil

It should be in quorum, even if not updated yet.

can you try to restart the mon service on the out of quorum node

systemctl restart ceph-mon@$(hostname)

do you see errors in the log in /var/log/ceph ?

I ran status first to see if it was running, it was.

root@petasan1:~# systemctl status ceph-mon@petasan1.service

ceph-mon@petasan1.service - Ceph cluster monitor daemon

     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)

     Active: active (running) since Sat 2023-03-04 16:50:15 EST; 7 months 12 days ago

   Main PID: 29606 (ceph-mon)

      Tasks: 27

     Memory: 2.5G

     CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@petasan1.service

             └─29606 /usr/bin/ceph-mon -f --cluster ceph --id petasan1 --setuser ceph --setgroup ceph

Oct 16 14:52:41 petasan1 ceph-mon[29606]: 2023-10-16T14:52:41.673-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:52:46 petasan1 ceph-mon[29606]: 2023-10-16T14:52:46.673-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:52:51 petasan1 ceph-mon[29606]: 2023-10-16T14:52:51.673-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:52:56 petasan1 ceph-mon[29606]: 2023-10-16T14:52:56.673-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:53:01 petasan1 ceph-mon[29606]: 2023-10-16T14:53:01.674-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:53:06 petasan1 ceph-mon[29606]: 2023-10-16T14:53:06.674-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:53:11 petasan1 ceph-mon[29606]: 2023-10-16T14:53:11.674-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:53:16 petasan1 ceph-mon[29606]: 2023-10-16T14:53:16.674-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:53:21 petasan1 ceph-mon[29606]: 2023-10-16T14:53:21.674-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

Oct 16 14:53:26 petasan1 ceph-mon[29606]: 2023-10-16T14:53:26.674-0400 7f0ddd136700 -1 mon.petasan1@1(probing) e5 get_health_metrics reporting 172 slow ops, oldest

is mon_command({"prefix": "osd crush set-device-class", "class": "hdd", "ids": ["35"]} v 0)

root@petasan1:~# systemctl restart ceph-mon@petasan1.service


 

So I restarted it:

root@petasan1:~# systemctl status ceph-mon@petasan1.service

ceph-mon@petasan1.service - Ceph cluster monitor daemon

     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled)

     Active: active (running) since Mon 2023-10-16 14:53:41 EDT; 1min 57s ago

   Main PID: 1057613 (ceph-mon)

      Tasks: 26

     Memory: 348.1M

     CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@petasan1.service

             └─1057613 /usr/bin/ceph-mon -f --cluster ceph --id petasan1 --setuser ceph --setgroup ceph

Oct 16 14:53:41 petasan1 systemd[1]: Started Ceph cluster monitor daemon.

Oct 16 14:53:48 petasan1 ceph-mon[1057613]: 2023-10-16T14:53:48.802-0400 7f484bd1c580 -1 compacting monitor store ...

Oct 16 14:53:49 petasan1 ceph-mon[1057613]: 2023-10-16T14:53:49.918-0400 7f484bd1c580 -1 done compacting

 


Ceph -s still says it is out:

cluster:

    id:     1da111ec-ffe8-4029-9834-e0988079925b

    health: HEALTH_WARN

            1/3 mons down, quorum petasan3,petasan2

  services:

    mon: 3 daemons, quorum petasan3,petasan2 (age 41h), out of quorum: petasan1

    mgr: petasan2(active, since 41h), standbys: petasan3, petasan1

    mds: 1/1 daemons up, 1 standby

    osd: 155 osds: 155 up (since 41h), 155 in (since 41h)

  data:

    volumes: 1/1 healthy

    pools:   6 pools, 3489 pgs

    objects: 54.87M objects, 208 TiB

    usage:   423 TiB used, 694 TiB / 1.1 PiB avail

    pgs:     3468 active+clean

             13   active+clean+scrubbing

             8    active+clean+scrubbing+deep

  io:

    client:   2.5 MiB/s rd, 629 op/s rd, 0 op/s wr

 


In ceph-mon.petasan1.log this is repeating over and over:

2023-10-16T14:17:04.199-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.199-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.203-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.207-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dd992f700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0dde939700  0 can't decode unknown message type 140 MSG_AUTH=17

2023-10-16T14:17:04.211-0400 7f0ddf13a700  0 can't decode unknown message type 140 MSG_AUTH=17

 

 

 

It is strange, but i would go ahead and upgrade that node, it should fix it.

Tried the update… It was not successful and I am not sure what I should do at this point.

The full message output is long, I will provide a link below… Start of the update is:

START:

root@petasan1:~# /opt/petasan/scripts/online-updates/update.sh
Hit:2 https://repo.45drives.com/debian focal InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-security InRelease
Hit:1 https://archive.petasan.org/repo_v3 petasan-v3 InRelease
Reading package lists... Done
Building dependency tree
Reading state information... Done
195 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
The following packages were automatically installed and are no longer required:
libleveldb1d liblttng-ust-ctl4 liblttng-ust0
Use 'apt autoremove' to remove them.
The following NEW packages will be installed:
catatonit ceph-volume criu crun dbus-user-session dns-root-data dnsmasq-base
libfuse3-3 libidn11 liblua5.3-0 libnet1 libprotobuf-c1 libsqlite3-mod-ceph
linux-image-5.14.21-04-petasan podman-machine-cni podman-plugins
python3-protobuf
The following packages will be upgraded:
amd64-microcode apache2 apache2-bin apache2-data apache2-utils base-files
bind9-host bind9-libs binutils binutils-common binutils-x86-64-linux-gnu
bsdutils ceph ceph-base ceph-common ceph-mds ceph-mgr ceph-mgr-cephadm
ceph-mgr-dashboard ceph-mgr-modules-core ceph-mon ceph-osd ceph-petasan
cephadm conmon containernetworking-plugins containers-common cpupower ctdb
curl dmidecode fdisk gcc-10-base graphite-web intel-microcode iotop iptables
krb5-user libapparmor1 libaprutil1 libaprutil1-dbd-sqlite3 libaprutil1-ldap
libavahi-client3 libavahi-common-data libavahi-common3 libbinutils libblkid1
libc-ares2 libc-bin libc-dev-bin libc6 libc6-dev libcap2 libcap2-bin
libcephfs2 libctf-nobfd0 libctf0 libcups2 libcurl3-gnutls libcurl4 libdw1
libelf1 libfdisk1 libfreetype6 libgcc-s1 libgcc1 libglib2.0-0 libgpgme11
libgssapi-krb5-2 libgssrpc4 libip4tc2 libip6tc2 libk5crypto3
libkadm5clnt-mit11 libkadm5srv-mit11 libkdb5-9 libkrb5-3 libkrb5support0
libldb2 libmount1 libmysqlclient21 libncurses5 libncurses6 libncursesw5
libncursesw6 libnghttp2-14 libnss-winbind libpam-cap libpam-systemd
libpam-winbind libperl5.30 libprotobuf17 libpython3.8 libpython3.8-minimal
libpython3.8-stdlib librados2 libradosstriper1 librbd1 librgw2
librte-eal20.0 librte-ethdev20.0 librte-kvargs20.0 librte-mbuf20.0
librte-mempool20.0 librte-meter20.0 librte-net20.0 librte-ring20.0
libsmartcols1 libsnmp-base libsnmp-dev libsnmp35 libssh-4 libssl-dev
libssl1.1 libstdc++6 libsystemd0 libtalloc2 libtdb1 libtevent0 libtiff5
libtinfo5 libtinfo6 libudev-dev libudev1 libunwind8 libuuid1 libwbclient0
libwebp6 libx11-6 libx11-data libxml2 libxpm4 libxtables12
linux-image-petasan linux-libc-dev locales mount ncurses-base ncurses-bin
nfs-common openssh-client openssh-server openssh-sftp-server openssl perl
perl-base perl-modules-5.30 petasan petasan-container-images
petasan-firmware petasan-stats-config podman python3-ceph-argparse
python3-ceph-common python3-cephfs python3-django python3-flask python3-ldb
python3-rados python3-rbd python3-requests python3-rgw python3-samba
python3-sqlparse python3-talloc python3-tdb python3-werkzeug python3.8
python3.8-minimal radosgw rbd-fuse rbd-mirror rbd-nbd samba samba-common
samba-common-bin samba-libs samba-vfs-modules slirp4netns snmp snmpd sudo
sysstat systemd systemd-sysv tdb-tools tzdata udev util-linux uuid-runtime
vim vim-common vim-runtime winbind xxd
195 upgraded, 17 newly installed, 0 to remove and 0 not upgraded.
Need to get 594 MB of archives.
After this operation, 49.0 MB of additional disk space will be used.
Get:3 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libc6-dev amd64 2.31-0ubuntu9.12 [2,519 kB]
Get:10 https://repo.45drives.com/debian focal/main amd64 dmidecode amd64 3.3-3 [62.9 kB]
Get:1 https://archive.petasan.org/repo_v3 petasan-v3/updates amd64 rbd-mirror am

 

--

END OF OUTPUT:

cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/blacklist-framebuffer.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/blacklist-rare-network.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/blacklist.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/intel-microcode-blacklist.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/iwlwifi.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/mdadm.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//etc/modprobe.d/owfs-common.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//usr/lib/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//usr/lib/modprobe.d/aliases.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//usr/lib/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//usr/lib/modprobe.d/fbdev-blacklist.conf': No such file or directory
mkdir: cannot create directory ‘/var/tmp/mkinitramfs_lmxSWL//usr/lib/modprobe.d’: No space left on device
cp: cannot create regular file '/var/tmp/mkinitramfs_lmxSWL//usr/lib/modprobe.d/systemd.conf': No such file or directory
mktemp: failed to create directory via template ‘/var/tmp/mkinitramfs-EFW_XXXXXXXXXX’: No space left on device
E: amd64-microcode: cannot create temporary directory
E: /usr/share/initramfs-tools/hooks/amd64_microcode failed with return 1.
update-initramfs: failed for /boot/initrd.img-4.12.14-28-petasan with 1.
dpkg: error processing package initramfs-tools (--configure):
installed initramfs-tools package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
linux-image-5.14.21-04-petasan
linux-image-petasan
petasan
initramfs-tools
E: Sub-process /usr/bin/dpkg returned an error code (1)
dpkg: error: version '-petasan' has bad syntax: version number is empty
root@petasan1:~#

 

What should I do next?

 

 

Link to full output (2MB)

No space left on device... could be out of space on OS disk partition. maybe the log dir is full, can you look with df or du command and clear any log files in /var/log. maybe start with /var/log/ceph

ugh! didn't even notice that.

I have cleared out lots of .gz log files, have 1.9G of space available…

but now when i try to run an update i get the following:

root@petasan1:/var/log/ceph# /opt/petasan/scripts/online-updates/update.sh 

No current version of PetaSAN installed.

it seems the package install process was messed up due to the device full issue. can you try fixing with

apt -f install

else can you show output of

dpkg -l

This is resolved, and everything is up to date on the newest version. I really appreciate the assistance. You guys are great!