Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

node stays disconnected following power outage

Pages: 1 2 3

I had high hopes this would be solved in a similarly-named topic, but ...

Let me start with a couple of points:

  • I am not a ceph expert - far from it
  • I have seen this happen 3 times now on different releases of PetaSAN

I have a 3-node PetaSAN 2.2.0 cluster running, with each node running identical SuperMicro kit:

  • Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (56vcpu)
  • 256GB memory
  • 3x nvme
  • 12x 4TB disk
  • 2x 256GB ssd for os
  • separate physical 10GbE nics, each on its own separate logical network, for:
    • mgmt
    • backend 1
    • backend 2
    • iscsi 1
    • iscsi 2

Each node can ping each other on all relevant interfaces. This particular cluster has been running well for about 2 months.

Hopefully that covers the needed background ... Now, on to the problem at hand. One node went offline, and was found to be powered off. I powered it back on, and watched 'ceph -w' until the PG_DEGRADED warnings resolved. At this point, ceph status looks like this:

root@pdflvsan01cp003:~# ceph --cluster pdflvsan01 status
cluster:
id:     ef401168-163c-4c1f-a301-b5dd67287d7e
health: HEALTH_WARN
1/3 mons down, quorum pdflvsan01cp001,pdflvsan01cp002

services:
mon: 3 daemons, quorum pdflvsan01cp001,pdflvsan01cp002, out of quorum: pdflvsan01cp003
mgr: pdflvsan01cp001(active)
osd: 36 osds: 24 up, 24 in

data:
pools:   1 pools, 1024 pgs
objects: 4483k objects, 17928 GB
usage:   35873 GB used, 53548 GB / 89422 GB avail
pgs:     1024 active+clean

Now ... my first clue about what might be going on was that the node failed to resolve the names for the other cluster nodes - and I found that /etc/resolv.conf was empty. That lead me to peer into /opt/petasan/config ... among other things that looked wrong, I found that /opt/petasan/config/etc/consul.d contained only a client subdirectory. That fact reminded me of what I had seen on the above-mentioned prior failures, and matches what was seen in another post already mentioned.

To make a longer story a bit shorter, I find that a lot of things (consul, ceph-mon, etc.) are simply not running on this particular node - and it appears to me as if the configuration items shared/replicated between nodes are no longer being shared/replicated to this node and that is causing these daemons to fail to start.

Thus far, the PetaSAN documentation appears to only cover point-and-click stuff in the web interface and none of the under-the-covers stuff. And I have not yet found answers which seem to cover my particular issue. Yet I have seen this same issue at least 3 times now with different versions of PetaSAN. Prior, I simply rebuilt the cluster from the ground up. This time, however, the cluster is in use and taking it down would't be prudent. I would like to get to the bottom of the issue and learn:

  • how this part of PetaSAN works
  • how to solve the issue properly
  • what I may have done wrong which caused this situation

So here I am, asking for assistance from those with more experience than myself. Please ask whatever questions that are needed and I'll do my best to follow along. Meanwhile, some things I found in the logs which may or may not help:

Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: Started Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished starting up.
--
-- The start-up result is done.
Apr 25 07:41:32 pdflvsan01cp003 ceph-mon[55518]: unable to stat setuser_match_path /var/lib/ceph/$type/$cluster-$id: (2) No such file or directory
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Unit entered failed state.
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Failed with result 'exit-code'.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Service hold-off time over, scheduling restart.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: Stopped Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished shutting down.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Start request repeated too quickly.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has failed.
--
-- The result is failed.
Apr 25 09:43:25 pdflvsan01cp003 files_sync.py[3484]: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Config/Files?recurse=1 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f017

The current state is that 1 node out of 3 is not working, the 2 other nodes created the extra data replicas among themselves, so the pgs recovered from being degraded to active-clean. However if any node out of the 2 fails, the cluster will not respond to io but will still have 1 copy of the data.

As to why a power outage will cause a node to fail, could be many things and likely hardware related. Some consumer grade SSDs are more susceptible to power outages in Ceph, similarly some controllers with cache without battery backing can fail in such cases. Search online to see what SSDs are recommended with Ceph and which can get corrupted on power outages. It helps on some models to use the hdparm tool to disable any on-disk caching. It is also possible the failure affected the OS disk, but i would expect this would have made the node non bookable.

I recommend the following:

First try to see why the services are not starting up:

Does the node have the correct ips for management, backend 1 & 2 ? check via

ip addr

If so can you ping the other nodes on these all 3 interfaces/subnets ?
If yes then try to start an OSD manually to see what errors you get

/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph

what errors do you see on the console ? you can also see Ceph logs in /var/log/ceph

If you think the OS disk has gone bad, you can re-install on this node and during deployment you should select "Replace Management Node" rather that "Join", leave the OSD disks as they are and do not select them during deployment. IF the OSDs are also bad you may need to delete them and re-add them.

One more thing : do you have a /var/lib/ceph/ path present on the  OS disk ? any chance you have another/older PetaSAN OS disk that is now booting from ?

 

I agree on current state.

I do not believe any hardware has failed.

All nodes can ping each other on all relevant interfaces, and all ip addresses are correct.

Starting OSD manually:

root@pdflvsan01cp003:/var/lib/ceph/osd# /usr/bin/ceph-osd -f --cluster pdflvsan01 --id 24 --setuser ceph --setgroup ceph
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory

root@pdflvsan01cp003:/var/lib/ceph/osd# cat /var/log/ceph/pdflvsan01-osd.24.log
2019-04-25 11:33:22.781193 7f5f4f9cee00 0 set uid:gid to 64045:64045 (ceph:ceph)
2019-04-25 11:33:22.781206 7f5f4f9cee00 0 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process ceph-osd, pid 64769
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory

Also found in kernel log many instances of the following, matching each physical disk used for OSD, just after boot:

Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.946149] XFS (sdl1): Mounting V5 Filesystem
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.966518] XFS (sdl1): Ending clean mount
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.969936] XFS (sdl1): Unmounting Filesystem

 

 

Also: I do not believe this node is booting from an older installation. If it were doing so, would 'ceph status' report properly against the current cluster?

 

Perhaps I was wrong in thinking that configurations are not being replicated:

root@pdflvsan01cp003:/var/log/ceph# gluster peer status
Number of Peers: 2

Hostname: 10.205.2.11
Uuid: c7fc65b3-166b-465d-9578-e50de6725461
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp002

Hostname: 10.205.2.10
Uuid: 91a22f87-4d83-445f-a7d5-1a6e5f555556
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp001
root@pdflvsan01cp003:/var/log/ceph# gluster volume info

Volume Name: gfs-vol
Type: Replicate
Volume ID: ab2592c5-8c3b-4bf8-8c0f-0296db0254fb
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.205.2.10:/opt/petasan/config/gfs-brick
Brick2: 10.205.2.12:/opt/petasan/config/gfs-brick
Brick3: 10.205.2.11:/opt/petasan/config/gfs-brick
Options Reconfigured:
nfs.disable: true
network.ping-timeout: 5
performance.readdir-ahead: on

it was not clear if you ran ceph status on that node, so yes it seems the correct boot disk.

i am not saying you had a hardware failure, but it is possible the SSDs and or controller  are not robust against power failures causing data corruption, this could be how they cache data as per my post.  what disk models do you use ? can you check hdparm if they have cache enabled ? do you use a controller with cache ? does it have battery backing ?

can you confirm that partition 1 on all OSDs are not mounted (via mount command) ?  if you try to mount them  manually to a temp directory, do they mount ?  if not then i would recommend you re-install the node and deploy with "Replace Management Node" and re-create the OSDs. There is no point trying to repair the xfs files system if you already have clean copies of the data on other nodes. This would be a short term fix, later you should check if your disks can withstand power failures without data corruption.

 

None of the OSDs have partition 1 mounted; they appear to mount on boot, then unmount, as depicted in a previous log snippet. They are definitely mountable, however - here are two examples:

 

root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'

root@pdflvsan01cp003:~# mount /dev/sdh1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
/dev/sdh1

root@pdflvsan01cp003:~# find /mnt/
/mnt/

root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdh1 94M 5.5M 89M 6% /mnt

root@pdflvsan01cp003:~# umount /mnt/

root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'

root@pdflvsan01cp003:~# mount /dev/sda1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
systemd-1
/dev/sda1

root@pdflvsan01cp003:~# find /mnt/
/mnt/

root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 94M 5.5M 89M 6% /mnt

root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $0 }'
/dev/sda1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)

root@pdflvsan01cp003:~# umount /mnt

Clearly these filesystems appear to be blank. What might cause that, exactly? Surely not all of these lost all data and reverted to a fresh blank filesystem due to power outage ... I think it more likely that they were somehow purposefully blanked. Here is a quick snippet from one of the surviving nodes, for comparison purposes:

root@pdflvsan01cp001:~# mount | awk '$1 ~ "dev.*1" { print $0 }'
/dev/sdd1 on /var/lib/ceph/osd/pdflvsan01-13 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdc1 on /var/lib/ceph/osd/pdflvsan01-15 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sde1 on /var/lib/ceph/osd/pdflvsan01-12 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdm1 on /var/lib/ceph/osd/pdflvsan01-19 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdl1 on /var/lib/ceph/osd/pdflvsan01-18 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdj1 on /var/lib/ceph/osd/pdflvsan01-20 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdi1 on /var/lib/ceph/osd/pdflvsan01-23 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdk1 on /var/lib/ceph/osd/pdflvsan01-21 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/pdflvsan01-14 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sda1 on /var/lib/ceph/osd/pdflvsan01-16 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdn1 on /var/lib/ceph/osd/pdflvsan01-17 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdh1 on /var/lib/ceph/osd/pdflvsan01-22 type xfs (rw,noatime,attr2,inode64,noquota)

root@pdflvsan01cp001:~# find /var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13/ceph_fsid
/var/lib/ceph/osd/pdflvsan01-13/fsid
/var/lib/ceph/osd/pdflvsan01-13/magic
/var/lib/ceph/osd/pdflvsan01-13/block.db_uuid
/var/lib/ceph/osd/pdflvsan01-13/block.db
/var/lib/ceph/osd/pdflvsan01-13/block_uuid
/var/lib/ceph/osd/pdflvsan01-13/block
/var/lib/ceph/osd/pdflvsan01-13/type
/var/lib/ceph/osd/pdflvsan01-13/keyring
/var/lib/ceph/osd/pdflvsan01-13/whoami
/var/lib/ceph/osd/pdflvsan01-13/activate.monmap
/var/lib/ceph/osd/pdflvsan01-13/kv_backend
/var/lib/ceph/osd/pdflvsan01-13/bluefs
/var/lib/ceph/osd/pdflvsan01-13/mkfs_done
/var/lib/ceph/osd/pdflvsan01-13/ready
/var/lib/ceph/osd/pdflvsan01-13/systemd
/var/lib/ceph/osd/pdflvsan01-13/active

I don't mind rebuilding node3 and having ceph rebalance - but I want to learn as much as possible from this situation first. Given what we know at this point, what are my options? Are there next steps?

they were not blanked, the filesystem is likely corrupt, if you use

mount | awk '$1 ~ "dev.*1$" { print $0 }'

does it show the mount point and list the type as xfs  ? do you see any dmesg messages if you manually mount them ?

to double check..is the OS disk ok ? are /opt/petasan/config and /var/lib/ceph mounted ? can you touch a file in them ?

For the short term the re-install will get you going with 3 nodes. Longer term is to investigate the model of your SSD disk used on the OSDs, does it have power loss protection ? many consumer grade SSDs do not, in such case it may be better to replace the OSDs disks.

Yes, these filesystems look similar to the ones on the surviving node:

/dev/sdh1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)

[15512.508162] XFS (sdh1): Mounting V5 Filesystem
[15512.636997] XFS (sdh1): Ending clean mount

The latter shows up on boot as well, but then gets immediately unmounted as shown in an earlier snippet.

The other filesystems are mounted and writable:

root@pdflvsan01cp003:~# mount | grep -e /opt/petasan/config -e /var/lib/ceph
/dev/sdm4 on /var/lib/ceph type ext4 (rw,relatime,data=ordered)
/dev/sdm5 on /opt/petasan/config type ext4 (rw,relatime,data=ordered)
10.205.2.10:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

root@pdflvsan01cp003:~# touch /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo

root@pdflvsan01cp003:~# ls -l /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /var/lib/ceph/zzfoo

root@pdflvsan01cp003:~# rm /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~#

I am curious about how the complete consul configuration could have been lost, though ... here is what the affected node has:

root@pdflvsan01cp003:~# find /opt/petasan/config/etc/consul.d -ls
1835032 4 drwxr-xr-x 3 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d
1835033 4 drwxr-xr-x 2 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client
1835034 4 -rw-r--r-- 1 root root 202 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client/config.json

What writes the consul configuration to that location, and when? Is it somehow synchronized across the cluster, or does it exist statically on every node? Somewhere, somehow, server/config.json got removed and replaced with client/config.json ... From a surviving node:

root@pdflvsan01cp001:~# find /opt/petasan/config/etc/consul.d -ls
6553624 4 drwxr-xr-x 3 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d
6553625 4 drwxr-xr-x 2 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server
6553626 4 -rw-r--r-- 1 root root 195 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server/config.json

 

As for the disks, the OS uses SSDSC2BB240G7, which has enhanced power loss data protection. And this appears to be where the consul config is stored.

The OSDs are spinning rust and not SSD. Specifically: HGST Ultrastar 4TB 3.5" 7200RPM SAS3 ... not sure about battery or supercaps on those or the LSI controller though.

I am also unsure about the nvme devices used for journal ...

Meanwhile, all I have found in the kernel logs relating to the xfs filesystems are that they cleanly mounted before, and after the power incident they cleanly mounted and then immediately unmounted.

It could very well be the controller, if caching is on without battery backing it could be the issue. If you can test unplugging power on a test cluster with different settings of controller caching and see what works and what does not, if not and depending on your environment you could potentially do this test on the current node after installation but it is a tough call to do this with a production system. Note that turning off caching with spinning disks will drop performance.

The consul setup showing client rather than server on this node is puzzling. The only explanation i see and since you mentioned this is not the first time you had power loss issue, you may have re-installed and selected "Join Existing Cluster" rater than "Replace Management Node", so when you re-install select the later.

Pages: 1 2 3