Forums - PetaSAN

ForumGeneral Discussion3-node cluster recovery in case o …
You need to log in to create posts and topics. Login · Register
3-node cluster recovery in case of one node failure

Pages: 1 2

jeenode
27 Posts

March 25, 2018, 11:55 pm
Quote from jeenode on March 25, 2018, 11:55 pm
Hi,

In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?

When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.

Is this expected?

Hi,

In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?

When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.

Is this expected?

#1

admin
2,962 Posts

March 26, 2018, 8:59 am
Quote from admin on March 26, 2018, 8:59 am
If you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery. Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1 replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.

If you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery. Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1 replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.

Last edited on March 26, 2018, 9:03 am by admin · #2

jeenode
27 Posts

March 26, 2018, 1:34 pm
Quote from jeenode on March 26, 2018, 1:34 pm
Sorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.

Sorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.

#3

admin
2,962 Posts

March 26, 2018, 2:15 pm
Quote from admin on March 26, 2018, 2:15 pm
With 2 replicas it should fully recover with 2 nodes up.

Do you have any osds down on the 2 up nodes ?

Do you have enough space on your existing disks to store the extra replica ?

Can you list ceph status and ceph health

With 2 replicas it should fully recover with 2 nodes up.

Do you have any osds down on the 2 up nodes ?

Do you have enough space on your existing disks to store the extra replica ?

Can you list ceph status and ceph health

Last edited on March 26, 2018, 2:18 pm by admin · #4

jeenode
27 Posts

March 26, 2018, 3:00 pm
Quote from jeenode on March 26, 2018, 3:00 pm
Here is what I do:

root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.target

And it then seems to be stuck in :

root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3

services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 in

data:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+clean

io:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr

Here is what I do:

root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.target

And it then seems to be stuck in :

root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3

services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 in

data:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+clean

io:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr

#5

jeenode
27 Posts

March 26, 2018, 3:01 pm
Quote from jeenode on March 26, 2018, 3:01 pm
Forgot to add that if I do bring mon back on, but not OSDs, it does start recovery

root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3

root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000

Forgot to add that if I do bring mon back on, but not OSDs, it does start recovery

root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3

root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000

#6

admin
2,962 Posts

March 26, 2018, 3:29 pm
Quote from admin on March 26, 2018, 3:29 pm
If on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?

can you get output of

ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME

If on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?

can you get output of

ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME

#7

jeenode
27 Posts

March 26, 2018, 3:30 pm
Quote from jeenode on March 26, 2018, 3:30 pm
It looks like I was too impatient 🙂

It did start recovery about 10 minutes after I took the services down...

Is it a configurable option in ceph, do you know?

It looks like I was too impatient 🙂

It did start recovery about 10 minutes after I took the services down...

Is it a configurable option in ceph, do you know?

#8

admin
2,962 Posts

March 26, 2018, 3:45 pm
Quote from admin on March 26, 2018, 3:45 pm
Excellent 🙂

I believe you can change it via

osd_recovery_delay_start

but i would not recommend changing it

Excellent 🙂

I believe you can change it via

osd_recovery_delay_start

but i would not recommend changing it

#9

shadowlin
67 Posts

March 27, 2018, 4:00 am
Quote from shadowlin on March 27, 2018, 4:00 am
How is the recovery speed?

I have met a situation in a small cluster that one of the osds went down(not out) and some pg became degraded .

The degraded object count kept increasing as clients were still writing data to the cluster.

Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).

It seems the recover speed can't catch up the speed of generating new objects.

I have to stop all the writing so the cluster can recover.

If a pg is in degraded status will all the new data that write into that pg also be degraded?

How is the recovery speed?

I have met a situation in a small cluster that one of the osds went down(not out) and some pg became degraded .

The degraded object count kept increasing as clients were still writing data to the cluster.

Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).

It seems the recover speed can't catch up the speed of generating new objects.

I have to stop all the writing so the cluster can recover.

If a pg is in degraded status will all the new data that write into that pg also be degraded?

#10

Post Reply: 3-node cluster recovery in case of one node failure

Cancel

Pages: 1 2