node stays disconnected following power outage
deweyhylton
14 Posts
April 25, 2019, 1:52 pmQuote from deweyhylton on April 25, 2019, 1:52 pmI had high hopes this would be solved in a similarly-named topic, but ...
Let me start with a couple of points:
- I am not a ceph expert - far from it
- I have seen this happen 3 times now on different releases of PetaSAN
I have a 3-node PetaSAN 2.2.0 cluster running, with each node running identical SuperMicro kit:
- Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (56vcpu)
- 256GB memory
- 3x nvme
- 12x 4TB disk
- 2x 256GB ssd for os
- separate physical 10GbE nics, each on its own separate logical network, for:
- mgmt
- backend 1
- backend 2
- iscsi 1
- iscsi 2
Each node can ping each other on all relevant interfaces. This particular cluster has been running well for about 2 months.
Hopefully that covers the needed background ... Now, on to the problem at hand. One node went offline, and was found to be powered off. I powered it back on, and watched 'ceph -w' until the PG_DEGRADED warnings resolved. At this point, ceph status looks like this:
root@pdflvsan01cp003:~# ceph --cluster pdflvsan01 status
cluster:
id: ef401168-163c-4c1f-a301-b5dd67287d7e
health: HEALTH_WARN
1/3 mons down, quorum pdflvsan01cp001,pdflvsan01cp002
services:
mon: 3 daemons, quorum pdflvsan01cp001,pdflvsan01cp002, out of quorum: pdflvsan01cp003
mgr: pdflvsan01cp001(active)
osd: 36 osds: 24 up, 24 in
data:
pools: 1 pools, 1024 pgs
objects: 4483k objects, 17928 GB
usage: 35873 GB used, 53548 GB / 89422 GB avail
pgs: 1024 active+clean
Now ... my first clue about what might be going on was that the node failed to resolve the names for the other cluster nodes - and I found that /etc/resolv.conf was empty. That lead me to peer into /opt/petasan/config ... among other things that looked wrong, I found that /opt/petasan/config/etc/consul.d contained only a client subdirectory. That fact reminded me of what I had seen on the above-mentioned prior failures, and matches what was seen in another post already mentioned.
To make a longer story a bit shorter, I find that a lot of things (consul, ceph-mon, etc.) are simply not running on this particular node - and it appears to me as if the configuration items shared/replicated between nodes are no longer being shared/replicated to this node and that is causing these daemons to fail to start.
Thus far, the PetaSAN documentation appears to only cover point-and-click stuff in the web interface and none of the under-the-covers stuff. And I have not yet found answers which seem to cover my particular issue. Yet I have seen this same issue at least 3 times now with different versions of PetaSAN. Prior, I simply rebuilt the cluster from the ground up. This time, however, the cluster is in use and taking it down would't be prudent. I would like to get to the bottom of the issue and learn:
- how this part of PetaSAN works
- how to solve the issue properly
- what I may have done wrong which caused this situation
So here I am, asking for assistance from those with more experience than myself. Please ask whatever questions that are needed and I'll do my best to follow along. Meanwhile, some things I found in the logs which may or may not help:
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: Started Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished starting up.
--
-- The start-up result is done.
Apr 25 07:41:32 pdflvsan01cp003 ceph-mon[55518]: unable to stat setuser_match_path /var/lib/ceph/$type/$cluster-$id: (2) No such file or directory
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Unit entered failed state.
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Failed with result 'exit-code'.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Service hold-off time over, scheduling restart.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: Stopped Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished shutting down.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Start request repeated too quickly.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has failed.
--
-- The result is failed.
Apr 25 09:43:25 pdflvsan01cp003 files_sync.py[3484]: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Config/Files?recurse=1 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f017
I had high hopes this would be solved in a similarly-named topic, but ...
Let me start with a couple of points:
- I am not a ceph expert - far from it
- I have seen this happen 3 times now on different releases of PetaSAN
I have a 3-node PetaSAN 2.2.0 cluster running, with each node running identical SuperMicro kit:
- Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (56vcpu)
- 256GB memory
- 3x nvme
- 12x 4TB disk
- 2x 256GB ssd for os
- separate physical 10GbE nics, each on its own separate logical network, for:
- mgmt
- backend 1
- backend 2
- iscsi 1
- iscsi 2
Each node can ping each other on all relevant interfaces. This particular cluster has been running well for about 2 months.
Hopefully that covers the needed background ... Now, on to the problem at hand. One node went offline, and was found to be powered off. I powered it back on, and watched 'ceph -w' until the PG_DEGRADED warnings resolved. At this point, ceph status looks like this:
root@pdflvsan01cp003:~# ceph --cluster pdflvsan01 status
cluster:
id: ef401168-163c-4c1f-a301-b5dd67287d7e
health: HEALTH_WARN
1/3 mons down, quorum pdflvsan01cp001,pdflvsan01cp002
services:
mon: 3 daemons, quorum pdflvsan01cp001,pdflvsan01cp002, out of quorum: pdflvsan01cp003
mgr: pdflvsan01cp001(active)
osd: 36 osds: 24 up, 24 in
data:
pools: 1 pools, 1024 pgs
objects: 4483k objects, 17928 GB
usage: 35873 GB used, 53548 GB / 89422 GB avail
pgs: 1024 active+clean
Now ... my first clue about what might be going on was that the node failed to resolve the names for the other cluster nodes - and I found that /etc/resolv.conf was empty. That lead me to peer into /opt/petasan/config ... among other things that looked wrong, I found that /opt/petasan/config/etc/consul.d contained only a client subdirectory. That fact reminded me of what I had seen on the above-mentioned prior failures, and matches what was seen in another post already mentioned.
To make a longer story a bit shorter, I find that a lot of things (consul, ceph-mon, etc.) are simply not running on this particular node - and it appears to me as if the configuration items shared/replicated between nodes are no longer being shared/replicated to this node and that is causing these daemons to fail to start.
Thus far, the PetaSAN documentation appears to only cover point-and-click stuff in the web interface and none of the under-the-covers stuff. And I have not yet found answers which seem to cover my particular issue. Yet I have seen this same issue at least 3 times now with different versions of PetaSAN. Prior, I simply rebuilt the cluster from the ground up. This time, however, the cluster is in use and taking it down would't be prudent. I would like to get to the bottom of the issue and learn:
- how this part of PetaSAN works
- how to solve the issue properly
- what I may have done wrong which caused this situation
So here I am, asking for assistance from those with more experience than myself. Please ask whatever questions that are needed and I'll do my best to follow along. Meanwhile, some things I found in the logs which may or may not help:
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: Started Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished starting up.
--
-- The start-up result is done.
Apr 25 07:41:32 pdflvsan01cp003 ceph-mon[55518]: unable to stat setuser_match_path /var/lib/ceph/$type/$cluster-$id: (2) No such file or directory
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Unit entered failed state.
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Failed with result 'exit-code'.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Service hold-off time over, scheduling restart.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: Stopped Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished shutting down.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Start request repeated too quickly.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has failed.
--
-- The result is failed.
Apr 25 09:43:25 pdflvsan01cp003 files_sync.py[3484]: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Config/Files?recurse=1 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f017
Last edited on April 25, 2019, 1:58 pm by deweyhylton · #1
admin
2,930 Posts
April 25, 2019, 2:50 pmQuote from admin on April 25, 2019, 2:50 pmThe current state is that 1 node out of 3 is not working, the 2 other nodes created the extra data replicas among themselves, so the pgs recovered from being degraded to active-clean. However if any node out of the 2 fails, the cluster will not respond to io but will still have 1 copy of the data.
As to why a power outage will cause a node to fail, could be many things and likely hardware related. Some consumer grade SSDs are more susceptible to power outages in Ceph, similarly some controllers with cache without battery backing can fail in such cases. Search online to see what SSDs are recommended with Ceph and which can get corrupted on power outages. It helps on some models to use the hdparm tool to disable any on-disk caching. It is also possible the failure affected the OS disk, but i would expect this would have made the node non bookable.
I recommend the following:
First try to see why the services are not starting up:
Does the node have the correct ips for management, backend 1 & 2 ? check via
ip addr
If so can you ping the other nodes on these all 3 interfaces/subnets ?
If yes then try to start an OSD manually to see what errors you get
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
what errors do you see on the console ? you can also see Ceph logs in /var/log/ceph
If you think the OS disk has gone bad, you can re-install on this node and during deployment you should select "Replace Management Node" rather that "Join", leave the OSD disks as they are and do not select them during deployment. IF the OSDs are also bad you may need to delete them and re-add them.
One more thing : do you have a /var/lib/ceph/ path present on the OS disk ? any chance you have another/older PetaSAN OS disk that is now booting from ?
The current state is that 1 node out of 3 is not working, the 2 other nodes created the extra data replicas among themselves, so the pgs recovered from being degraded to active-clean. However if any node out of the 2 fails, the cluster will not respond to io but will still have 1 copy of the data.
As to why a power outage will cause a node to fail, could be many things and likely hardware related. Some consumer grade SSDs are more susceptible to power outages in Ceph, similarly some controllers with cache without battery backing can fail in such cases. Search online to see what SSDs are recommended with Ceph and which can get corrupted on power outages. It helps on some models to use the hdparm tool to disable any on-disk caching. It is also possible the failure affected the OS disk, but i would expect this would have made the node non bookable.
I recommend the following:
First try to see why the services are not starting up:
Does the node have the correct ips for management, backend 1 & 2 ? check via
ip addr
If so can you ping the other nodes on these all 3 interfaces/subnets ?
If yes then try to start an OSD manually to see what errors you get
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
what errors do you see on the console ? you can also see Ceph logs in /var/log/ceph
If you think the OS disk has gone bad, you can re-install on this node and during deployment you should select "Replace Management Node" rather that "Join", leave the OSD disks as they are and do not select them during deployment. IF the OSDs are also bad you may need to delete them and re-add them.
One more thing : do you have a /var/lib/ceph/ path present on the OS disk ? any chance you have another/older PetaSAN OS disk that is now booting from ?
Last edited on April 25, 2019, 3:00 pm by admin · #2
deweyhylton
14 Posts
April 25, 2019, 3:45 pmQuote from deweyhylton on April 25, 2019, 3:45 pmI agree on current state.
I do not believe any hardware has failed.
All nodes can ping each other on all relevant interfaces, and all ip addresses are correct.
Starting OSD manually:
root@pdflvsan01cp003:/var/lib/ceph/osd# /usr/bin/ceph-osd -f --cluster pdflvsan01 --id 24 --setuser ceph --setgroup ceph
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory
root@pdflvsan01cp003:/var/lib/ceph/osd# cat /var/log/ceph/pdflvsan01-osd.24.log
2019-04-25 11:33:22.781193 7f5f4f9cee00 0 set uid:gid to 64045:64045 (ceph:ceph)
2019-04-25 11:33:22.781206 7f5f4f9cee00 0 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process ceph-osd, pid 64769
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory
Also found in kernel log many instances of the following, matching each physical disk used for OSD, just after boot:
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.946149] XFS (sdl1): Mounting V5 Filesystem
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.966518] XFS (sdl1): Ending clean mount
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.969936] XFS (sdl1): Unmounting Filesystem
I agree on current state.
I do not believe any hardware has failed.
All nodes can ping each other on all relevant interfaces, and all ip addresses are correct.
Starting OSD manually:
root@pdflvsan01cp003:/var/lib/ceph/osd# /usr/bin/ceph-osd -f --cluster pdflvsan01 --id 24 --setuser ceph --setgroup ceph
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory
root@pdflvsan01cp003:/var/lib/ceph/osd# cat /var/log/ceph/pdflvsan01-osd.24.log
2019-04-25 11:33:22.781193 7f5f4f9cee00 0 set uid:gid to 64045:64045 (ceph:ceph)
2019-04-25 11:33:22.781206 7f5f4f9cee00 0 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process ceph-osd, pid 64769
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory
Also found in kernel log many instances of the following, matching each physical disk used for OSD, just after boot:
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.946149] XFS (sdl1): Mounting V5 Filesystem
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.966518] XFS (sdl1): Ending clean mount
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.969936] XFS (sdl1): Unmounting Filesystem
deweyhylton
14 Posts
April 25, 2019, 3:59 pmQuote from deweyhylton on April 25, 2019, 3:59 pmAlso: I do not believe this node is booting from an older installation. If it were doing so, would 'ceph status' report properly against the current cluster?
Also: I do not believe this node is booting from an older installation. If it were doing so, would 'ceph status' report properly against the current cluster?
deweyhylton
14 Posts
April 25, 2019, 4:59 pmQuote from deweyhylton on April 25, 2019, 4:59 pmPerhaps I was wrong in thinking that configurations are not being replicated:
root@pdflvsan01cp003:/var/log/ceph# gluster peer status
Number of Peers: 2
Hostname: 10.205.2.11
Uuid: c7fc65b3-166b-465d-9578-e50de6725461
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp002
Hostname: 10.205.2.10
Uuid: 91a22f87-4d83-445f-a7d5-1a6e5f555556
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp001
root@pdflvsan01cp003:/var/log/ceph# gluster volume info
Volume Name: gfs-vol
Type: Replicate
Volume ID: ab2592c5-8c3b-4bf8-8c0f-0296db0254fb
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.205.2.10:/opt/petasan/config/gfs-brick
Brick2: 10.205.2.12:/opt/petasan/config/gfs-brick
Brick3: 10.205.2.11:/opt/petasan/config/gfs-brick
Options Reconfigured:
nfs.disable: true
network.ping-timeout: 5
performance.readdir-ahead: on
Perhaps I was wrong in thinking that configurations are not being replicated:
root@pdflvsan01cp003:/var/log/ceph# gluster peer status
Number of Peers: 2
Hostname: 10.205.2.11
Uuid: c7fc65b3-166b-465d-9578-e50de6725461
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp002
Hostname: 10.205.2.10
Uuid: 91a22f87-4d83-445f-a7d5-1a6e5f555556
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp001
root@pdflvsan01cp003:/var/log/ceph# gluster volume info
Volume Name: gfs-vol
Type: Replicate
Volume ID: ab2592c5-8c3b-4bf8-8c0f-0296db0254fb
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.205.2.10:/opt/petasan/config/gfs-brick
Brick2: 10.205.2.12:/opt/petasan/config/gfs-brick
Brick3: 10.205.2.11:/opt/petasan/config/gfs-brick
Options Reconfigured:
nfs.disable: true
network.ping-timeout: 5
performance.readdir-ahead: on
admin
2,930 Posts
April 25, 2019, 5:27 pmQuote from admin on April 25, 2019, 5:27 pmit was not clear if you ran ceph status on that node, so yes it seems the correct boot disk.
i am not saying you had a hardware failure, but it is possible the SSDs and or controller are not robust against power failures causing data corruption, this could be how they cache data as per my post. what disk models do you use ? can you check hdparm if they have cache enabled ? do you use a controller with cache ? does it have battery backing ?
can you confirm that partition 1 on all OSDs are not mounted (via mount command) ? if you try to mount them manually to a temp directory, do they mount ? if not then i would recommend you re-install the node and deploy with "Replace Management Node" and re-create the OSDs. There is no point trying to repair the xfs files system if you already have clean copies of the data on other nodes. This would be a short term fix, later you should check if your disks can withstand power failures without data corruption.
it was not clear if you ran ceph status on that node, so yes it seems the correct boot disk.
i am not saying you had a hardware failure, but it is possible the SSDs and or controller are not robust against power failures causing data corruption, this could be how they cache data as per my post. what disk models do you use ? can you check hdparm if they have cache enabled ? do you use a controller with cache ? does it have battery backing ?
can you confirm that partition 1 on all OSDs are not mounted (via mount command) ? if you try to mount them manually to a temp directory, do they mount ? if not then i would recommend you re-install the node and deploy with "Replace Management Node" and re-create the OSDs. There is no point trying to repair the xfs files system if you already have clean copies of the data on other nodes. This would be a short term fix, later you should check if your disks can withstand power failures without data corruption.
deweyhylton
14 Posts
April 25, 2019, 7:00 pmQuote from deweyhylton on April 25, 2019, 7:00 pmNone of the OSDs have partition 1 mounted; they appear to mount on boot, then unmount, as depicted in a previous log snippet. They are definitely mountable, however - here are two examples:
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sdh1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
/dev/sdh1
root@pdflvsan01cp003:~# find /mnt/
/mnt/
root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdh1 94M 5.5M 89M 6% /mnt
root@pdflvsan01cp003:~# umount /mnt/
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sda1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
systemd-1
/dev/sda1
root@pdflvsan01cp003:~# find /mnt/
/mnt/
root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 94M 5.5M 89M 6% /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $0 }'
/dev/sda1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)
root@pdflvsan01cp003:~# umount /mnt
Clearly these filesystems appear to be blank. What might cause that, exactly? Surely not all of these lost all data and reverted to a fresh blank filesystem due to power outage ... I think it more likely that they were somehow purposefully blanked. Here is a quick snippet from one of the surviving nodes, for comparison purposes:
root@pdflvsan01cp001:~# mount | awk '$1 ~ "dev.*1" { print $0 }'
/dev/sdd1 on /var/lib/ceph/osd/pdflvsan01-13 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdc1 on /var/lib/ceph/osd/pdflvsan01-15 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sde1 on /var/lib/ceph/osd/pdflvsan01-12 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdm1 on /var/lib/ceph/osd/pdflvsan01-19 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdl1 on /var/lib/ceph/osd/pdflvsan01-18 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdj1 on /var/lib/ceph/osd/pdflvsan01-20 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdi1 on /var/lib/ceph/osd/pdflvsan01-23 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdk1 on /var/lib/ceph/osd/pdflvsan01-21 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/pdflvsan01-14 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sda1 on /var/lib/ceph/osd/pdflvsan01-16 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdn1 on /var/lib/ceph/osd/pdflvsan01-17 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdh1 on /var/lib/ceph/osd/pdflvsan01-22 type xfs (rw,noatime,attr2,inode64,noquota)
root@pdflvsan01cp001:~# find /var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13/ceph_fsid
/var/lib/ceph/osd/pdflvsan01-13/fsid
/var/lib/ceph/osd/pdflvsan01-13/magic
/var/lib/ceph/osd/pdflvsan01-13/block.db_uuid
/var/lib/ceph/osd/pdflvsan01-13/block.db
/var/lib/ceph/osd/pdflvsan01-13/block_uuid
/var/lib/ceph/osd/pdflvsan01-13/block
/var/lib/ceph/osd/pdflvsan01-13/type
/var/lib/ceph/osd/pdflvsan01-13/keyring
/var/lib/ceph/osd/pdflvsan01-13/whoami
/var/lib/ceph/osd/pdflvsan01-13/activate.monmap
/var/lib/ceph/osd/pdflvsan01-13/kv_backend
/var/lib/ceph/osd/pdflvsan01-13/bluefs
/var/lib/ceph/osd/pdflvsan01-13/mkfs_done
/var/lib/ceph/osd/pdflvsan01-13/ready
/var/lib/ceph/osd/pdflvsan01-13/systemd
/var/lib/ceph/osd/pdflvsan01-13/active
I don't mind rebuilding node3 and having ceph rebalance - but I want to learn as much as possible from this situation first. Given what we know at this point, what are my options? Are there next steps?
None of the OSDs have partition 1 mounted; they appear to mount on boot, then unmount, as depicted in a previous log snippet. They are definitely mountable, however - here are two examples:
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sdh1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
/dev/sdh1
root@pdflvsan01cp003:~# find /mnt/
/mnt/
root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdh1 94M 5.5M 89M 6% /mnt
root@pdflvsan01cp003:~# umount /mnt/
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sda1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
systemd-1
/dev/sda1
root@pdflvsan01cp003:~# find /mnt/
/mnt/
root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 94M 5.5M 89M 6% /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $0 }'
/dev/sda1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)
root@pdflvsan01cp003:~# umount /mnt
Clearly these filesystems appear to be blank. What might cause that, exactly? Surely not all of these lost all data and reverted to a fresh blank filesystem due to power outage ... I think it more likely that they were somehow purposefully blanked. Here is a quick snippet from one of the surviving nodes, for comparison purposes:
root@pdflvsan01cp001:~# mount | awk '$1 ~ "dev.*1" { print $0 }'
/dev/sdd1 on /var/lib/ceph/osd/pdflvsan01-13 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdc1 on /var/lib/ceph/osd/pdflvsan01-15 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sde1 on /var/lib/ceph/osd/pdflvsan01-12 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdm1 on /var/lib/ceph/osd/pdflvsan01-19 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdl1 on /var/lib/ceph/osd/pdflvsan01-18 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdj1 on /var/lib/ceph/osd/pdflvsan01-20 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdi1 on /var/lib/ceph/osd/pdflvsan01-23 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdk1 on /var/lib/ceph/osd/pdflvsan01-21 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/pdflvsan01-14 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sda1 on /var/lib/ceph/osd/pdflvsan01-16 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdn1 on /var/lib/ceph/osd/pdflvsan01-17 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdh1 on /var/lib/ceph/osd/pdflvsan01-22 type xfs (rw,noatime,attr2,inode64,noquota)
root@pdflvsan01cp001:~# find /var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13/ceph_fsid
/var/lib/ceph/osd/pdflvsan01-13/fsid
/var/lib/ceph/osd/pdflvsan01-13/magic
/var/lib/ceph/osd/pdflvsan01-13/block.db_uuid
/var/lib/ceph/osd/pdflvsan01-13/block.db
/var/lib/ceph/osd/pdflvsan01-13/block_uuid
/var/lib/ceph/osd/pdflvsan01-13/block
/var/lib/ceph/osd/pdflvsan01-13/type
/var/lib/ceph/osd/pdflvsan01-13/keyring
/var/lib/ceph/osd/pdflvsan01-13/whoami
/var/lib/ceph/osd/pdflvsan01-13/activate.monmap
/var/lib/ceph/osd/pdflvsan01-13/kv_backend
/var/lib/ceph/osd/pdflvsan01-13/bluefs
/var/lib/ceph/osd/pdflvsan01-13/mkfs_done
/var/lib/ceph/osd/pdflvsan01-13/ready
/var/lib/ceph/osd/pdflvsan01-13/systemd
/var/lib/ceph/osd/pdflvsan01-13/active
I don't mind rebuilding node3 and having ceph rebalance - but I want to learn as much as possible from this situation first. Given what we know at this point, what are my options? Are there next steps?
admin
2,930 Posts
April 25, 2019, 7:46 pmQuote from admin on April 25, 2019, 7:46 pmthey were not blanked, the filesystem is likely corrupt, if you use
mount | awk '$1 ~ "dev.*1$" { print $0 }'
does it show the mount point and list the type as xfs ? do you see any dmesg messages if you manually mount them ?
to double check..is the OS disk ok ? are /opt/petasan/config and /var/lib/ceph mounted ? can you touch a file in them ?
For the short term the re-install will get you going with 3 nodes. Longer term is to investigate the model of your SSD disk used on the OSDs, does it have power loss protection ? many consumer grade SSDs do not, in such case it may be better to replace the OSDs disks.
they were not blanked, the filesystem is likely corrupt, if you use
mount | awk '$1 ~ "dev.*1$" { print $0 }'
does it show the mount point and list the type as xfs ? do you see any dmesg messages if you manually mount them ?
to double check..is the OS disk ok ? are /opt/petasan/config and /var/lib/ceph mounted ? can you touch a file in them ?
For the short term the re-install will get you going with 3 nodes. Longer term is to investigate the model of your SSD disk used on the OSDs, does it have power loss protection ? many consumer grade SSDs do not, in such case it may be better to replace the OSDs disks.
Last edited on April 25, 2019, 7:52 pm by admin · #8
deweyhylton
14 Posts
April 25, 2019, 9:18 pmQuote from deweyhylton on April 25, 2019, 9:18 pmYes, these filesystems look similar to the ones on the surviving node:
/dev/sdh1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)
[15512.508162] XFS (sdh1): Mounting V5 Filesystem
[15512.636997] XFS (sdh1): Ending clean mount
The latter shows up on boot as well, but then gets immediately unmounted as shown in an earlier snippet.
The other filesystems are mounted and writable:
root@pdflvsan01cp003:~# mount | grep -e /opt/petasan/config -e /var/lib/ceph
/dev/sdm4 on /var/lib/ceph type ext4 (rw,relatime,data=ordered)
/dev/sdm5 on /opt/petasan/config type ext4 (rw,relatime,data=ordered)
10.205.2.10:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
root@pdflvsan01cp003:~# touch /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~# ls -l /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /var/lib/ceph/zzfoo
root@pdflvsan01cp003:~# rm /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~#
I am curious about how the complete consul configuration could have been lost, though ... here is what the affected node has:
root@pdflvsan01cp003:~# find /opt/petasan/config/etc/consul.d -ls
1835032 4 drwxr-xr-x 3 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d
1835033 4 drwxr-xr-x 2 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client
1835034 4 -rw-r--r-- 1 root root 202 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client/config.json
What writes the consul configuration to that location, and when? Is it somehow synchronized across the cluster, or does it exist statically on every node? Somewhere, somehow, server/config.json got removed and replaced with client/config.json ... From a surviving node:
root@pdflvsan01cp001:~# find /opt/petasan/config/etc/consul.d -ls
6553624 4 drwxr-xr-x 3 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d
6553625 4 drwxr-xr-x 2 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server
6553626 4 -rw-r--r-- 1 root root 195 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server/config.json
As for the disks, the OS uses SSDSC2BB240G7, which has enhanced power loss data protection. And this appears to be where the consul config is stored.
The OSDs are spinning rust and not SSD. Specifically: HGST Ultrastar 4TB 3.5" 7200RPM SAS3 ... not sure about battery or supercaps on those or the LSI controller though.
I am also unsure about the nvme devices used for journal ...
Meanwhile, all I have found in the kernel logs relating to the xfs filesystems are that they cleanly mounted before, and after the power incident they cleanly mounted and then immediately unmounted.
Yes, these filesystems look similar to the ones on the surviving node:
/dev/sdh1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)
[15512.508162] XFS (sdh1): Mounting V5 Filesystem
[15512.636997] XFS (sdh1): Ending clean mount
The latter shows up on boot as well, but then gets immediately unmounted as shown in an earlier snippet.
The other filesystems are mounted and writable:
root@pdflvsan01cp003:~# mount | grep -e /opt/petasan/config -e /var/lib/ceph
/dev/sdm4 on /var/lib/ceph type ext4 (rw,relatime,data=ordered)
/dev/sdm5 on /opt/petasan/config type ext4 (rw,relatime,data=ordered)
10.205.2.10:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
root@pdflvsan01cp003:~# touch /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~# ls -l /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /var/lib/ceph/zzfoo
root@pdflvsan01cp003:~# rm /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~#
I am curious about how the complete consul configuration could have been lost, though ... here is what the affected node has:
root@pdflvsan01cp003:~# find /opt/petasan/config/etc/consul.d -ls
1835032 4 drwxr-xr-x 3 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d
1835033 4 drwxr-xr-x 2 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client
1835034 4 -rw-r--r-- 1 root root 202 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client/config.json
What writes the consul configuration to that location, and when? Is it somehow synchronized across the cluster, or does it exist statically on every node? Somewhere, somehow, server/config.json got removed and replaced with client/config.json ... From a surviving node:
root@pdflvsan01cp001:~# find /opt/petasan/config/etc/consul.d -ls
6553624 4 drwxr-xr-x 3 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d
6553625 4 drwxr-xr-x 2 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server
6553626 4 -rw-r--r-- 1 root root 195 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server/config.json
As for the disks, the OS uses SSDSC2BB240G7, which has enhanced power loss data protection. And this appears to be where the consul config is stored.
The OSDs are spinning rust and not SSD. Specifically: HGST Ultrastar 4TB 3.5" 7200RPM SAS3 ... not sure about battery or supercaps on those or the LSI controller though.
I am also unsure about the nvme devices used for journal ...
Meanwhile, all I have found in the kernel logs relating to the xfs filesystems are that they cleanly mounted before, and after the power incident they cleanly mounted and then immediately unmounted.
admin
2,930 Posts
April 25, 2019, 10:13 pmQuote from admin on April 25, 2019, 10:13 pmIt could very well be the controller, if caching is on without battery backing it could be the issue. If you can test unplugging power on a test cluster with different settings of controller caching and see what works and what does not, if not and depending on your environment you could potentially do this test on the current node after installation but it is a tough call to do this with a production system. Note that turning off caching with spinning disks will drop performance.
The consul setup showing client rather than server on this node is puzzling. The only explanation i see and since you mentioned this is not the first time you had power loss issue, you may have re-installed and selected "Join Existing Cluster" rater than "Replace Management Node", so when you re-install select the later.
It could very well be the controller, if caching is on without battery backing it could be the issue. If you can test unplugging power on a test cluster with different settings of controller caching and see what works and what does not, if not and depending on your environment you could potentially do this test on the current node after installation but it is a tough call to do this with a production system. Note that turning off caching with spinning disks will drop performance.
The consul setup showing client rather than server on this node is puzzling. The only explanation i see and since you mentioned this is not the first time you had power loss issue, you may have re-installed and selected "Join Existing Cluster" rater than "Replace Management Node", so when you re-install select the later.
Last edited on April 25, 2019, 10:15 pm by admin · #10
node stays disconnected following power outage
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 1:52 pmI had high hopes this would be solved in a similarly-named topic, but ...
Let me start with a couple of points:
- I am not a ceph expert - far from it
- I have seen this happen 3 times now on different releases of PetaSAN
I have a 3-node PetaSAN 2.2.0 cluster running, with each node running identical SuperMicro kit:
- Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (56vcpu)
- 256GB memory
- 3x nvme
- 12x 4TB disk
- 2x 256GB ssd for os
- separate physical 10GbE nics, each on its own separate logical network, for:
- mgmt
- backend 1
- backend 2
- iscsi 1
- iscsi 2
Each node can ping each other on all relevant interfaces. This particular cluster has been running well for about 2 months.
Hopefully that covers the needed background ... Now, on to the problem at hand. One node went offline, and was found to be powered off. I powered it back on, and watched 'ceph -w' until the PG_DEGRADED warnings resolved. At this point, ceph status looks like this:
root@pdflvsan01cp003:~# ceph --cluster pdflvsan01 status
cluster:
id: ef401168-163c-4c1f-a301-b5dd67287d7e
health: HEALTH_WARN
1/3 mons down, quorum pdflvsan01cp001,pdflvsan01cp002services:
mon: 3 daemons, quorum pdflvsan01cp001,pdflvsan01cp002, out of quorum: pdflvsan01cp003
mgr: pdflvsan01cp001(active)
osd: 36 osds: 24 up, 24 indata:
pools: 1 pools, 1024 pgs
objects: 4483k objects, 17928 GB
usage: 35873 GB used, 53548 GB / 89422 GB avail
pgs: 1024 active+cleanNow ... my first clue about what might be going on was that the node failed to resolve the names for the other cluster nodes - and I found that /etc/resolv.conf was empty. That lead me to peer into /opt/petasan/config ... among other things that looked wrong, I found that /opt/petasan/config/etc/consul.d contained only a client subdirectory. That fact reminded me of what I had seen on the above-mentioned prior failures, and matches what was seen in another post already mentioned.
To make a longer story a bit shorter, I find that a lot of things (consul, ceph-mon, etc.) are simply not running on this particular node - and it appears to me as if the configuration items shared/replicated between nodes are no longer being shared/replicated to this node and that is causing these daemons to fail to start.
Thus far, the PetaSAN documentation appears to only cover point-and-click stuff in the web interface and none of the under-the-covers stuff. And I have not yet found answers which seem to cover my particular issue. Yet I have seen this same issue at least 3 times now with different versions of PetaSAN. Prior, I simply rebuilt the cluster from the ground up. This time, however, the cluster is in use and taking it down would't be prudent. I would like to get to the bottom of the issue and learn:
- how this part of PetaSAN works
- how to solve the issue properly
- what I may have done wrong which caused this situation
So here I am, asking for assistance from those with more experience than myself. Please ask whatever questions that are needed and I'll do my best to follow along. Meanwhile, some things I found in the logs which may or may not help:
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: Started Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished starting up.
--
-- The start-up result is done.
Apr 25 07:41:32 pdflvsan01cp003 ceph-mon[55518]: unable to stat setuser_match_path /var/lib/ceph/$type/$cluster-$id: (2) No such file or directory
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Unit entered failed state.
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Failed with result 'exit-code'.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Service hold-off time over, scheduling restart.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: Stopped Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished shutting down.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Start request repeated too quickly.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has failed.
--
-- The result is failed.
Apr 25 09:43:25 pdflvsan01cp003 files_sync.py[3484]: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Config/Files?recurse=1 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f017
I had high hopes this would be solved in a similarly-named topic, but ...
Let me start with a couple of points:
- I am not a ceph expert - far from it
- I have seen this happen 3 times now on different releases of PetaSAN
I have a 3-node PetaSAN 2.2.0 cluster running, with each node running identical SuperMicro kit:
- Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz (56vcpu)
- 256GB memory
- 3x nvme
- 12x 4TB disk
- 2x 256GB ssd for os
- separate physical 10GbE nics, each on its own separate logical network, for:
- mgmt
- backend 1
- backend 2
- iscsi 1
- iscsi 2
Each node can ping each other on all relevant interfaces. This particular cluster has been running well for about 2 months.
Hopefully that covers the needed background ... Now, on to the problem at hand. One node went offline, and was found to be powered off. I powered it back on, and watched 'ceph -w' until the PG_DEGRADED warnings resolved. At this point, ceph status looks like this:
root@pdflvsan01cp003:~# ceph --cluster pdflvsan01 status
cluster:
id: ef401168-163c-4c1f-a301-b5dd67287d7e
health: HEALTH_WARN
1/3 mons down, quorum pdflvsan01cp001,pdflvsan01cp002services:
mon: 3 daemons, quorum pdflvsan01cp001,pdflvsan01cp002, out of quorum: pdflvsan01cp003
mgr: pdflvsan01cp001(active)
osd: 36 osds: 24 up, 24 indata:
pools: 1 pools, 1024 pgs
objects: 4483k objects, 17928 GB
usage: 35873 GB used, 53548 GB / 89422 GB avail
pgs: 1024 active+clean
Now ... my first clue about what might be going on was that the node failed to resolve the names for the other cluster nodes - and I found that /etc/resolv.conf was empty. That lead me to peer into /opt/petasan/config ... among other things that looked wrong, I found that /opt/petasan/config/etc/consul.d contained only a client subdirectory. That fact reminded me of what I had seen on the above-mentioned prior failures, and matches what was seen in another post already mentioned.
To make a longer story a bit shorter, I find that a lot of things (consul, ceph-mon, etc.) are simply not running on this particular node - and it appears to me as if the configuration items shared/replicated between nodes are no longer being shared/replicated to this node and that is causing these daemons to fail to start.
Thus far, the PetaSAN documentation appears to only cover point-and-click stuff in the web interface and none of the under-the-covers stuff. And I have not yet found answers which seem to cover my particular issue. Yet I have seen this same issue at least 3 times now with different versions of PetaSAN. Prior, I simply rebuilt the cluster from the ground up. This time, however, the cluster is in use and taking it down would't be prudent. I would like to get to the bottom of the issue and learn:
- how this part of PetaSAN works
- how to solve the issue properly
- what I may have done wrong which caused this situation
So here I am, asking for assistance from those with more experience than myself. Please ask whatever questions that are needed and I'll do my best to follow along. Meanwhile, some things I found in the logs which may or may not help:
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: Started Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished starting up.
--
-- The start-up result is done.
Apr 25 07:41:32 pdflvsan01cp003 ceph-mon[55518]: unable to stat setuser_match_path /var/lib/ceph/$type/$cluster-$id: (2) No such file or directory
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Main process exited, code=exited, status=1/FAILURE
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Unit entered failed state.
Apr 25 07:41:32 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Failed with result 'exit-code'.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Service hold-off time over, scheduling restart.
Apr 25 07:41:42 pdflvsan01cp003 systemd[1]: Stopped Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has finished shutting down.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: ceph-mon@pdflvsan01cp003.service: Start request repeated too quickly.
Apr 25 07:43:18 pdflvsan01cp003 systemd[1]: Failed to start Ceph cluster monitor daemon.
-- Subject: Unit ceph-mon@pdflvsan01cp003.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit ceph-mon@pdflvsan01cp003.service has failed.
--
-- The result is failed.
Apr 25 09:43:25 pdflvsan01cp003 files_sync.py[3484]: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Config/Files?recurse=1 (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f017
admin
2,930 Posts
Quote from admin on April 25, 2019, 2:50 pmThe current state is that 1 node out of 3 is not working, the 2 other nodes created the extra data replicas among themselves, so the pgs recovered from being degraded to active-clean. However if any node out of the 2 fails, the cluster will not respond to io but will still have 1 copy of the data.
As to why a power outage will cause a node to fail, could be many things and likely hardware related. Some consumer grade SSDs are more susceptible to power outages in Ceph, similarly some controllers with cache without battery backing can fail in such cases. Search online to see what SSDs are recommended with Ceph and which can get corrupted on power outages. It helps on some models to use the hdparm tool to disable any on-disk caching. It is also possible the failure affected the OS disk, but i would expect this would have made the node non bookable.
I recommend the following:
First try to see why the services are not starting up:
Does the node have the correct ips for management, backend 1 & 2 ? check via
ip addr
If so can you ping the other nodes on these all 3 interfaces/subnets ?
If yes then try to start an OSD manually to see what errors you get/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
what errors do you see on the console ? you can also see Ceph logs in /var/log/ceph
If you think the OS disk has gone bad, you can re-install on this node and during deployment you should select "Replace Management Node" rather that "Join", leave the OSD disks as they are and do not select them during deployment. IF the OSDs are also bad you may need to delete them and re-add them.
One more thing : do you have a /var/lib/ceph/ path present on the OS disk ? any chance you have another/older PetaSAN OS disk that is now booting from ?
The current state is that 1 node out of 3 is not working, the 2 other nodes created the extra data replicas among themselves, so the pgs recovered from being degraded to active-clean. However if any node out of the 2 fails, the cluster will not respond to io but will still have 1 copy of the data.
As to why a power outage will cause a node to fail, could be many things and likely hardware related. Some consumer grade SSDs are more susceptible to power outages in Ceph, similarly some controllers with cache without battery backing can fail in such cases. Search online to see what SSDs are recommended with Ceph and which can get corrupted on power outages. It helps on some models to use the hdparm tool to disable any on-disk caching. It is also possible the failure affected the OS disk, but i would expect this would have made the node non bookable.
I recommend the following:
First try to see why the services are not starting up:
Does the node have the correct ips for management, backend 1 & 2 ? check via
ip addr
If so can you ping the other nodes on these all 3 interfaces/subnets ?
If yes then try to start an OSD manually to see what errors you get
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
what errors do you see on the console ? you can also see Ceph logs in /var/log/ceph
If you think the OS disk has gone bad, you can re-install on this node and during deployment you should select "Replace Management Node" rather that "Join", leave the OSD disks as they are and do not select them during deployment. IF the OSDs are also bad you may need to delete them and re-add them.
One more thing : do you have a /var/lib/ceph/ path present on the OS disk ? any chance you have another/older PetaSAN OS disk that is now booting from ?
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 3:45 pmI agree on current state.
I do not believe any hardware has failed.
All nodes can ping each other on all relevant interfaces, and all ip addresses are correct.
Starting OSD manually:
root@pdflvsan01cp003:/var/lib/ceph/osd# /usr/bin/ceph-osd -f --cluster pdflvsan01 --id 24 --setuser ceph --setgroup ceph
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directoryroot@pdflvsan01cp003:/var/lib/ceph/osd# cat /var/log/ceph/pdflvsan01-osd.24.log
2019-04-25 11:33:22.781193 7f5f4f9cee00 0 set uid:gid to 64045:64045 (ceph:ceph)
2019-04-25 11:33:22.781206 7f5f4f9cee00 0 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process ceph-osd, pid 64769
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directoryAlso found in kernel log many instances of the following, matching each physical disk used for OSD, just after boot:
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.946149] XFS (sdl1): Mounting V5 Filesystem
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.966518] XFS (sdl1): Ending clean mount
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.969936] XFS (sdl1): Unmounting Filesystem
I agree on current state.
I do not believe any hardware has failed.
All nodes can ping each other on all relevant interfaces, and all ip addresses are correct.
Starting OSD manually:
root@pdflvsan01cp003:/var/lib/ceph/osd# /usr/bin/ceph-osd -f --cluster pdflvsan01 --id 24 --setuser ceph --setgroup ceph
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directoryroot@pdflvsan01cp003:/var/lib/ceph/osd# cat /var/log/ceph/pdflvsan01-osd.24.log
2019-04-25 11:33:22.781193 7f5f4f9cee00 0 set uid:gid to 64045:64045 (ceph:ceph)
2019-04-25 11:33:22.781206 7f5f4f9cee00 0 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable), process ceph-osd, pid 64769
2019-04-25 11:33:22.781580 7f5f4f9cee00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/pdflvsan01-24: (2) No such file or directory
Also found in kernel log many instances of the following, matching each physical disk used for OSD, just after boot:
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.946149] XFS (sdl1): Mounting V5 Filesystem
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.966518] XFS (sdl1): Ending clean mount
Apr 24 10:34:17 pdflvsan01cp003 kernel: [ 162.969936] XFS (sdl1): Unmounting Filesystem
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 3:59 pmAlso: I do not believe this node is booting from an older installation. If it were doing so, would 'ceph status' report properly against the current cluster?
Also: I do not believe this node is booting from an older installation. If it were doing so, would 'ceph status' report properly against the current cluster?
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 4:59 pmPerhaps I was wrong in thinking that configurations are not being replicated:
root@pdflvsan01cp003:/var/log/ceph# gluster peer status
Number of Peers: 2Hostname: 10.205.2.11
Uuid: c7fc65b3-166b-465d-9578-e50de6725461
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp002Hostname: 10.205.2.10
Uuid: 91a22f87-4d83-445f-a7d5-1a6e5f555556
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp001
root@pdflvsan01cp003:/var/log/ceph# gluster volume infoVolume Name: gfs-vol
Type: Replicate
Volume ID: ab2592c5-8c3b-4bf8-8c0f-0296db0254fb
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.205.2.10:/opt/petasan/config/gfs-brick
Brick2: 10.205.2.12:/opt/petasan/config/gfs-brick
Brick3: 10.205.2.11:/opt/petasan/config/gfs-brick
Options Reconfigured:
nfs.disable: true
network.ping-timeout: 5
performance.readdir-ahead: on
Perhaps I was wrong in thinking that configurations are not being replicated:
root@pdflvsan01cp003:/var/log/ceph# gluster peer status
Number of Peers: 2Hostname: 10.205.2.11
Uuid: c7fc65b3-166b-465d-9578-e50de6725461
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp002Hostname: 10.205.2.10
Uuid: 91a22f87-4d83-445f-a7d5-1a6e5f555556
State: Peer in Cluster (Connected)
Other names:
pdflvsan01cp001
root@pdflvsan01cp003:/var/log/ceph# gluster volume infoVolume Name: gfs-vol
Type: Replicate
Volume ID: ab2592c5-8c3b-4bf8-8c0f-0296db0254fb
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.205.2.10:/opt/petasan/config/gfs-brick
Brick2: 10.205.2.12:/opt/petasan/config/gfs-brick
Brick3: 10.205.2.11:/opt/petasan/config/gfs-brick
Options Reconfigured:
nfs.disable: true
network.ping-timeout: 5
performance.readdir-ahead: on
admin
2,930 Posts
Quote from admin on April 25, 2019, 5:27 pmit was not clear if you ran ceph status on that node, so yes it seems the correct boot disk.
i am not saying you had a hardware failure, but it is possible the SSDs and or controller are not robust against power failures causing data corruption, this could be how they cache data as per my post. what disk models do you use ? can you check hdparm if they have cache enabled ? do you use a controller with cache ? does it have battery backing ?
can you confirm that partition 1 on all OSDs are not mounted (via mount command) ? if you try to mount them manually to a temp directory, do they mount ? if not then i would recommend you re-install the node and deploy with "Replace Management Node" and re-create the OSDs. There is no point trying to repair the xfs files system if you already have clean copies of the data on other nodes. This would be a short term fix, later you should check if your disks can withstand power failures without data corruption.
it was not clear if you ran ceph status on that node, so yes it seems the correct boot disk.
i am not saying you had a hardware failure, but it is possible the SSDs and or controller are not robust against power failures causing data corruption, this could be how they cache data as per my post. what disk models do you use ? can you check hdparm if they have cache enabled ? do you use a controller with cache ? does it have battery backing ?
can you confirm that partition 1 on all OSDs are not mounted (via mount command) ? if you try to mount them manually to a temp directory, do they mount ? if not then i would recommend you re-install the node and deploy with "Replace Management Node" and re-create the OSDs. There is no point trying to repair the xfs files system if you already have clean copies of the data on other nodes. This would be a short term fix, later you should check if your disks can withstand power failures without data corruption.
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 7:00 pmNone of the OSDs have partition 1 mounted; they appear to mount on boot, then unmount, as depicted in a previous log snippet. They are definitely mountable, however - here are two examples:
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sdh1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
/dev/sdh1root@pdflvsan01cp003:~# find /mnt/
/mnt/root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdh1 94M 5.5M 89M 6% /mntroot@pdflvsan01cp003:~# umount /mnt/
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sda1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
systemd-1
/dev/sda1root@pdflvsan01cp003:~# find /mnt/
/mnt/root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 94M 5.5M 89M 6% /mntroot@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $0 }'
/dev/sda1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)root@pdflvsan01cp003:~# umount /mnt
Clearly these filesystems appear to be blank. What might cause that, exactly? Surely not all of these lost all data and reverted to a fresh blank filesystem due to power outage ... I think it more likely that they were somehow purposefully blanked. Here is a quick snippet from one of the surviving nodes, for comparison purposes:
root@pdflvsan01cp001:~# mount | awk '$1 ~ "dev.*1" { print $0 }'
/dev/sdd1 on /var/lib/ceph/osd/pdflvsan01-13 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdc1 on /var/lib/ceph/osd/pdflvsan01-15 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sde1 on /var/lib/ceph/osd/pdflvsan01-12 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdm1 on /var/lib/ceph/osd/pdflvsan01-19 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdl1 on /var/lib/ceph/osd/pdflvsan01-18 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdj1 on /var/lib/ceph/osd/pdflvsan01-20 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdi1 on /var/lib/ceph/osd/pdflvsan01-23 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdk1 on /var/lib/ceph/osd/pdflvsan01-21 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/pdflvsan01-14 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sda1 on /var/lib/ceph/osd/pdflvsan01-16 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdn1 on /var/lib/ceph/osd/pdflvsan01-17 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdh1 on /var/lib/ceph/osd/pdflvsan01-22 type xfs (rw,noatime,attr2,inode64,noquota)root@pdflvsan01cp001:~# find /var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13/ceph_fsid
/var/lib/ceph/osd/pdflvsan01-13/fsid
/var/lib/ceph/osd/pdflvsan01-13/magic
/var/lib/ceph/osd/pdflvsan01-13/block.db_uuid
/var/lib/ceph/osd/pdflvsan01-13/block.db
/var/lib/ceph/osd/pdflvsan01-13/block_uuid
/var/lib/ceph/osd/pdflvsan01-13/block
/var/lib/ceph/osd/pdflvsan01-13/type
/var/lib/ceph/osd/pdflvsan01-13/keyring
/var/lib/ceph/osd/pdflvsan01-13/whoami
/var/lib/ceph/osd/pdflvsan01-13/activate.monmap
/var/lib/ceph/osd/pdflvsan01-13/kv_backend
/var/lib/ceph/osd/pdflvsan01-13/bluefs
/var/lib/ceph/osd/pdflvsan01-13/mkfs_done
/var/lib/ceph/osd/pdflvsan01-13/ready
/var/lib/ceph/osd/pdflvsan01-13/systemd
/var/lib/ceph/osd/pdflvsan01-13/activeI don't mind rebuilding node3 and having ceph rebalance - but I want to learn as much as possible from this situation first. Given what we know at this point, what are my options? Are there next steps?
None of the OSDs have partition 1 mounted; they appear to mount on boot, then unmount, as depicted in a previous log snippet. They are definitely mountable, however - here are two examples:
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sdh1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
/dev/sdh1root@pdflvsan01cp003:~# find /mnt/
/mnt/root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sdh1 94M 5.5M 89M 6% /mntroot@pdflvsan01cp003:~# umount /mnt/
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
root@pdflvsan01cp003:~# mount /dev/sda1 /mnt
root@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $1 }'
systemd-1
/dev/sda1root@pdflvsan01cp003:~# find /mnt/
/mnt/root@pdflvsan01cp003:~# df -h /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 94M 5.5M 89M 6% /mntroot@pdflvsan01cp003:~# mount | awk '$1 ~ "dev.*1$" { print $0 }'
/dev/sda1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)root@pdflvsan01cp003:~# umount /mnt
Clearly these filesystems appear to be blank. What might cause that, exactly? Surely not all of these lost all data and reverted to a fresh blank filesystem due to power outage ... I think it more likely that they were somehow purposefully blanked. Here is a quick snippet from one of the surviving nodes, for comparison purposes:
root@pdflvsan01cp001:~# mount | awk '$1 ~ "dev.*1" { print $0 }'
/dev/sdd1 on /var/lib/ceph/osd/pdflvsan01-13 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdc1 on /var/lib/ceph/osd/pdflvsan01-15 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sde1 on /var/lib/ceph/osd/pdflvsan01-12 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdm1 on /var/lib/ceph/osd/pdflvsan01-19 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdl1 on /var/lib/ceph/osd/pdflvsan01-18 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdj1 on /var/lib/ceph/osd/pdflvsan01-20 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdi1 on /var/lib/ceph/osd/pdflvsan01-23 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdk1 on /var/lib/ceph/osd/pdflvsan01-21 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdb1 on /var/lib/ceph/osd/pdflvsan01-14 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sda1 on /var/lib/ceph/osd/pdflvsan01-16 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdn1 on /var/lib/ceph/osd/pdflvsan01-17 type xfs (rw,noatime,attr2,inode64,noquota)
/dev/sdh1 on /var/lib/ceph/osd/pdflvsan01-22 type xfs (rw,noatime,attr2,inode64,noquota)root@pdflvsan01cp001:~# find /var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13
/var/lib/ceph/osd/pdflvsan01-13/ceph_fsid
/var/lib/ceph/osd/pdflvsan01-13/fsid
/var/lib/ceph/osd/pdflvsan01-13/magic
/var/lib/ceph/osd/pdflvsan01-13/block.db_uuid
/var/lib/ceph/osd/pdflvsan01-13/block.db
/var/lib/ceph/osd/pdflvsan01-13/block_uuid
/var/lib/ceph/osd/pdflvsan01-13/block
/var/lib/ceph/osd/pdflvsan01-13/type
/var/lib/ceph/osd/pdflvsan01-13/keyring
/var/lib/ceph/osd/pdflvsan01-13/whoami
/var/lib/ceph/osd/pdflvsan01-13/activate.monmap
/var/lib/ceph/osd/pdflvsan01-13/kv_backend
/var/lib/ceph/osd/pdflvsan01-13/bluefs
/var/lib/ceph/osd/pdflvsan01-13/mkfs_done
/var/lib/ceph/osd/pdflvsan01-13/ready
/var/lib/ceph/osd/pdflvsan01-13/systemd
/var/lib/ceph/osd/pdflvsan01-13/active
I don't mind rebuilding node3 and having ceph rebalance - but I want to learn as much as possible from this situation first. Given what we know at this point, what are my options? Are there next steps?
admin
2,930 Posts
Quote from admin on April 25, 2019, 7:46 pmthey were not blanked, the filesystem is likely corrupt, if you use
mount | awk '$1 ~ "dev.*1$" { print $0 }'
does it show the mount point and list the type as xfs ? do you see any dmesg messages if you manually mount them ?
to double check..is the OS disk ok ? are /opt/petasan/config and /var/lib/ceph mounted ? can you touch a file in them ?
For the short term the re-install will get you going with 3 nodes. Longer term is to investigate the model of your SSD disk used on the OSDs, does it have power loss protection ? many consumer grade SSDs do not, in such case it may be better to replace the OSDs disks.
they were not blanked, the filesystem is likely corrupt, if you use
mount | awk '$1 ~ "dev.*1$" { print $0 }'
does it show the mount point and list the type as xfs ? do you see any dmesg messages if you manually mount them ?
to double check..is the OS disk ok ? are /opt/petasan/config and /var/lib/ceph mounted ? can you touch a file in them ?
For the short term the re-install will get you going with 3 nodes. Longer term is to investigate the model of your SSD disk used on the OSDs, does it have power loss protection ? many consumer grade SSDs do not, in such case it may be better to replace the OSDs disks.
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 9:18 pmYes, these filesystems look similar to the ones on the surviving node:
/dev/sdh1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)
[15512.508162] XFS (sdh1): Mounting V5 Filesystem
[15512.636997] XFS (sdh1): Ending clean mountThe latter shows up on boot as well, but then gets immediately unmounted as shown in an earlier snippet.
The other filesystems are mounted and writable:
root@pdflvsan01cp003:~# mount | grep -e /opt/petasan/config -e /var/lib/ceph
/dev/sdm4 on /var/lib/ceph type ext4 (rw,relatime,data=ordered)
/dev/sdm5 on /opt/petasan/config type ext4 (rw,relatime,data=ordered)
10.205.2.10:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)root@pdflvsan01cp003:~# touch /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~# ls -l /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /var/lib/ceph/zzfooroot@pdflvsan01cp003:~# rm /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~#I am curious about how the complete consul configuration could have been lost, though ... here is what the affected node has:
root@pdflvsan01cp003:~# find /opt/petasan/config/etc/consul.d -ls
1835032 4 drwxr-xr-x 3 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d
1835033 4 drwxr-xr-x 2 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client
1835034 4 -rw-r--r-- 1 root root 202 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client/config.jsonWhat writes the consul configuration to that location, and when? Is it somehow synchronized across the cluster, or does it exist statically on every node? Somewhere, somehow, server/config.json got removed and replaced with client/config.json ... From a surviving node:
root@pdflvsan01cp001:~# find /opt/petasan/config/etc/consul.d -ls
6553624 4 drwxr-xr-x 3 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d
6553625 4 drwxr-xr-x 2 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server
6553626 4 -rw-r--r-- 1 root root 195 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server/config.json
As for the disks, the OS uses SSDSC2BB240G7, which has enhanced power loss data protection. And this appears to be where the consul config is stored.
The OSDs are spinning rust and not SSD. Specifically: HGST Ultrastar 4TB 3.5" 7200RPM SAS3 ... not sure about battery or supercaps on those or the LSI controller though.
I am also unsure about the nvme devices used for journal ...
Meanwhile, all I have found in the kernel logs relating to the xfs filesystems are that they cleanly mounted before, and after the power incident they cleanly mounted and then immediately unmounted.
Yes, these filesystems look similar to the ones on the surviving node:
/dev/sdh1 on /mnt type xfs (rw,relatime,attr2,inode64,noquota)
[15512.508162] XFS (sdh1): Mounting V5 Filesystem
[15512.636997] XFS (sdh1): Ending clean mount
The latter shows up on boot as well, but then gets immediately unmounted as shown in an earlier snippet.
The other filesystems are mounted and writable:
root@pdflvsan01cp003:~# mount | grep -e /opt/petasan/config -e /var/lib/ceph
/dev/sdm4 on /var/lib/ceph type ext4 (rw,relatime,data=ordered)
/dev/sdm5 on /opt/petasan/config type ext4 (rw,relatime,data=ordered)
10.205.2.10:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)root@pdflvsan01cp003:~# touch /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~# ls -l /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /opt/petasan/config/zzfoo
-rw-r--r-- 1 root root 0 Apr 25 16:10 /var/lib/ceph/zzfooroot@pdflvsan01cp003:~# rm /var/lib/ceph/zzfoo /opt/petasan/config/zzfoo
root@pdflvsan01cp003:~#
I am curious about how the complete consul configuration could have been lost, though ... here is what the affected node has:
root@pdflvsan01cp003:~# find /opt/petasan/config/etc/consul.d -ls
1835032 4 drwxr-xr-x 3 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d
1835033 4 drwxr-xr-x 2 root root 4096 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client
1835034 4 -rw-r--r-- 1 root root 202 Apr 24 09:54 /opt/petasan/config/etc/consul.d/client/config.json
What writes the consul configuration to that location, and when? Is it somehow synchronized across the cluster, or does it exist statically on every node? Somewhere, somehow, server/config.json got removed and replaced with client/config.json ... From a surviving node:
root@pdflvsan01cp001:~# find /opt/petasan/config/etc/consul.d -ls
6553624 4 drwxr-xr-x 3 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d
6553625 4 drwxr-xr-x 2 root root 4096 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server
6553626 4 -rw-r--r-- 1 root root 195 Feb 21 02:11 /opt/petasan/config/etc/consul.d/server/config.json
As for the disks, the OS uses SSDSC2BB240G7, which has enhanced power loss data protection. And this appears to be where the consul config is stored.
The OSDs are spinning rust and not SSD. Specifically: HGST Ultrastar 4TB 3.5" 7200RPM SAS3 ... not sure about battery or supercaps on those or the LSI controller though.
I am also unsure about the nvme devices used for journal ...
Meanwhile, all I have found in the kernel logs relating to the xfs filesystems are that they cleanly mounted before, and after the power incident they cleanly mounted and then immediately unmounted.
admin
2,930 Posts
Quote from admin on April 25, 2019, 10:13 pmIt could very well be the controller, if caching is on without battery backing it could be the issue. If you can test unplugging power on a test cluster with different settings of controller caching and see what works and what does not, if not and depending on your environment you could potentially do this test on the current node after installation but it is a tough call to do this with a production system. Note that turning off caching with spinning disks will drop performance.
The consul setup showing client rather than server on this node is puzzling. The only explanation i see and since you mentioned this is not the first time you had power loss issue, you may have re-installed and selected "Join Existing Cluster" rater than "Replace Management Node", so when you re-install select the later.
It could very well be the controller, if caching is on without battery backing it could be the issue. If you can test unplugging power on a test cluster with different settings of controller caching and see what works and what does not, if not and depending on your environment you could potentially do this test on the current node after installation but it is a tough call to do this with a production system. Note that turning off caching with spinning disks will drop performance.
The consul setup showing client rather than server on this node is puzzling. The only explanation i see and since you mentioned this is not the first time you had power loss issue, you may have re-installed and selected "Join Existing Cluster" rater than "Replace Management Node", so when you re-install select the later.