Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

3-node cluster recovery in case of one node failure

Pages: 1 2

Hi,

In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?

When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.

Is this expected?

 

 

If you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery.  Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1  replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.

Sorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.

With 2 replicas it should fully recover with 2 nodes up.

Do you have any osds down  on the 2 up nodes  ?

Do you have enough space on your existing disks to store the extra replica ?

Can you list ceph status and ceph health

Here is what I do:

 

root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.target

And it then seems to be stuck in :

 

root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3

services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 in

data:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+clean

io:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr

Forgot to add that if I do bring mon back on, but not OSDs, it does start recovery

 

root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3

root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000

If on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?

can you get output of

ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME

 

It looks like I was too impatient 🙂

It did start recovery about 10 minutes after I took the services down...

Is it a configurable option in ceph, do you know?

Excellent 🙂

I believe you can change it via

osd_recovery_delay_start

but i would not recommend changing it

How is the recovery speed?

I have met a situation  in a small cluster that one of the osds went down(not out) and some pg became degraded .

The degraded object count kept increasing as clients were still writing data to the cluster.

Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).

It seems the recover speed can't catch up the speed of generating new objects.

I have to stop all the writing so the cluster can recover.

If a pg is in degraded status will all the new data that write into that pg also be degraded?

Pages: 1 2