3-node cluster recovery in case of one node failure
Pages: 1 2
jeenode
27 Posts
March 25, 2018, 11:55 pmQuote from jeenode on March 25, 2018, 11:55 pmHi,
In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?
When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.
Is this expected?
Hi,
In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?
When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.
Is this expected?
admin
2,930 Posts
March 26, 2018, 8:59 amQuote from admin on March 26, 2018, 8:59 amIf you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery. Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1 replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.
If you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery. Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1 replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.
Last edited on March 26, 2018, 9:03 am by admin · #2
jeenode
27 Posts
March 26, 2018, 1:34 pmQuote from jeenode on March 26, 2018, 1:34 pmSorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.
Sorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.
admin
2,930 Posts
March 26, 2018, 2:15 pmQuote from admin on March 26, 2018, 2:15 pmWith 2 replicas it should fully recover with 2 nodes up.
Do you have any osds down on the 2 up nodes ?
Do you have enough space on your existing disks to store the extra replica ?
Can you list ceph status and ceph health
With 2 replicas it should fully recover with 2 nodes up.
Do you have any osds down on the 2 up nodes ?
Do you have enough space on your existing disks to store the extra replica ?
Can you list ceph status and ceph health
Last edited on March 26, 2018, 2:18 pm by admin · #4
jeenode
27 Posts
March 26, 2018, 3:00 pmQuote from jeenode on March 26, 2018, 3:00 pmHere is what I do:
root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.target
And it then seems to be stuck in :
root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3
services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 in
data:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+clean
io:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr
Here is what I do:
root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.target
And it then seems to be stuck in :
root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3
services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 in
data:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+clean
io:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr
jeenode
27 Posts
March 26, 2018, 3:01 pmQuote from jeenode on March 26, 2018, 3:01 pmForgot to add that if I do bring mon back on, but not OSDs, it does start recovery
root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3
root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000
Forgot to add that if I do bring mon back on, but not OSDs, it does start recovery
root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3
root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000
admin
2,930 Posts
March 26, 2018, 3:29 pmQuote from admin on March 26, 2018, 3:29 pmIf on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?
can you get output of
ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME
If on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?
can you get output of
ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME
jeenode
27 Posts
March 26, 2018, 3:30 pmQuote from jeenode on March 26, 2018, 3:30 pmIt looks like I was too impatient 🙂
It did start recovery about 10 minutes after I took the services down...
Is it a configurable option in ceph, do you know?
It looks like I was too impatient 🙂
It did start recovery about 10 minutes after I took the services down...
Is it a configurable option in ceph, do you know?
admin
2,930 Posts
March 26, 2018, 3:45 pmQuote from admin on March 26, 2018, 3:45 pmExcellent 🙂
I believe you can change it via
osd_recovery_delay_start
but i would not recommend changing it
Excellent 🙂
I believe you can change it via
osd_recovery_delay_start
but i would not recommend changing it
shadowlin
67 Posts
March 27, 2018, 4:00 amQuote from shadowlin on March 27, 2018, 4:00 amHow is the recovery speed?
I have met a situation in a small cluster that one of the osds went down(not out) and some pg became degraded .
The degraded object count kept increasing as clients were still writing data to the cluster.
Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).
It seems the recover speed can't catch up the speed of generating new objects.
I have to stop all the writing so the cluster can recover.
If a pg is in degraded status will all the new data that write into that pg also be degraded?
How is the recovery speed?
I have met a situation in a small cluster that one of the osds went down(not out) and some pg became degraded .
The degraded object count kept increasing as clients were still writing data to the cluster.
Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).
It seems the recover speed can't catch up the speed of generating new objects.
I have to stop all the writing so the cluster can recover.
If a pg is in degraded status will all the new data that write into that pg also be degraded?
Pages: 1 2
3-node cluster recovery in case of one node failure
jeenode
27 Posts
Quote from jeenode on March 25, 2018, 11:55 pmHi,
In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?
When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.
Is this expected?
Hi,
In case of a 3 node cluster where each node is both mon and osd, if one node is down, is ceph expected to recover on two surviving nodes?
When I am trying this scenario, it looks like ceph is stuck in degraded state and doesnt start recovery.
Is this expected?
admin
2,930 Posts
Quote from admin on March 26, 2018, 8:59 amIf you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery. Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1 replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.
If you have 3 replicas, but only 2 nodes up, the cluster will remain in degraded state. It is still functioning and serving io but will be stuck in recovery. Ceph does not put more than 1 data replica on any 1 server so that if that server dies you loose no more than 1 replica. So in this case it needs at least 3 nodes up to distribute the replicas and report itself clean.
jeenode
27 Posts
Quote from jeenode on March 26, 2018, 1:34 pmSorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.
Sorry, should have said that 1 have 2 replicas, thats why I expected it to start the recovery.
admin
2,930 Posts
Quote from admin on March 26, 2018, 2:15 pmWith 2 replicas it should fully recover with 2 nodes up.
Do you have any osds down on the 2 up nodes ?
Do you have enough space on your existing disks to store the extra replica ?
Can you list ceph status and ceph health
With 2 replicas it should fully recover with 2 nodes up.
Do you have any osds down on the 2 up nodes ?
Do you have enough space on your existing disks to store the extra replica ?
Can you list ceph status and ceph health
jeenode
27 Posts
Quote from jeenode on March 26, 2018, 3:00 pmHere is what I do:
root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.targetAnd it then seems to be stuck in :
root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 indata:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+cleanio:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr
Here is what I do:
root@hqlonceph1:~# systemctl stop ceph-mon.target
root@hqlonceph1:~# systemctl stop ceph-osd.target
And it then seems to be stuck in :
root@hqlonceph1:~# ceph status
cluster:
id: 3ae11f18-2081-480a-942b-1ce6befc8ab7
health: HEALTH_WARN
5 osds down
1 host (5 osds) down
Degraded data redundancy: 189300/556730 objects degraded (34.002%), 171 pgs unclean, 174 pgs degraded
1/3 mons down, quorum hqlonceph2,hqlonceph3
services:
mon: 3 daemons, quorum hqlonceph2,hqlonceph3, out of quorum: hqlonceph1
mgr: hqlonceph3(active), standbys: hqlonceph1, hqlonceph2
osd: 15 osds: 10 up, 15 in
data:
pools: 1 pools, 256 pgs
objects: 271k objects, 1086 GB
usage: 2176 GB used, 14574 GB / 16750 GB avail
pgs: 189300/556730 objects degraded (34.002%)
174 active+undersized+degraded
82 active+clean
io:
client: 1535 B/s rd, 1535 B/s wr, 1 op/s rd, 2 op/s wr
jeenode
27 Posts
Quote from jeenode on March 26, 2018, 3:01 pmForgot to add that if I do bring mon back on, but not OSDs, it does start recovery
root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000
Forgot to add that if I do bring mon back on, but not OSDs, it does start recovery
root@hqlonceph1:~# ceph health
HEALTH_WARN 5 osds down; 1 host (5 osds) down; Degraded data redundancy: 189300/556730 objects degraded (34.002%), 174 pgs unclean, 174 pgs degraded, 174 pgs undersized; 1/3 mons down, quorum hqlonceph2,hqlonceph3
root@hqlonceph1:~# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 16.63734 root default
-3 5.53246 host hqlonceph1
0 hdd 1.11049 osd.0 down 1.00000 1.00000
1 hdd 1.11049 osd.1 down 1.00000 1.00000
2 hdd 1.11049 osd.2 down 1.00000 1.00000
3 hdd 1.11049 osd.3 down 1.00000 1.00000
12 hdd 1.09050 osd.12 down 1.00000 1.00000
-5 5.55244 host hqlonceph2
4 hdd 1.11049 osd.4 up 1.00000 1.00000
5 hdd 1.11049 osd.5 up 1.00000 1.00000
6 hdd 1.11049 osd.6 up 1.00000 1.00000
7 hdd 1.11049 osd.7 up 1.00000 1.00000
13 hdd 1.11049 osd.13 up 1.00000 1.00000
-7 5.55244 host hqlonceph3
8 hdd 1.11049 osd.8 up 1.00000 1.00000
9 hdd 1.11049 osd.9 up 1.00000 1.00000
10 hdd 1.11049 osd.10 up 1.00000 1.00000
11 hdd 1.11049 osd.11 up 1.00000 1.00000
14 hdd 1.11049 osd.14 up 1.00000 1.00000
admin
2,930 Posts
Quote from admin on March 26, 2018, 3:29 pmIf on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?
can you get output of
ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME
If on 1 one of the up OSDs you stop the service then restart it does it trigger recovery ?
can you get output of
ceph osd pool get rbd size --cluster CLUSTER_NAME
ceph --show-config --cluster CLUSTER_NAME | grep mon_osd_min_in_ratio
ceph --show-config --cluster CLUSTER_NAME | grep osd_backfill_full_ratio
ceph health detail --cluster CLUSTER_NAME
ceph pg dump_stuck unclean --cluster CLUSTER_NAME
ceph pg STUCK_PG_NUM query --cluster CLUSTER_NAME
jeenode
27 Posts
Quote from jeenode on March 26, 2018, 3:30 pmIt looks like I was too impatient 🙂
It did start recovery about 10 minutes after I took the services down...
Is it a configurable option in ceph, do you know?
It looks like I was too impatient 🙂
It did start recovery about 10 minutes after I took the services down...
Is it a configurable option in ceph, do you know?
admin
2,930 Posts
Quote from admin on March 26, 2018, 3:45 pmExcellent 🙂
I believe you can change it via
osd_recovery_delay_start
but i would not recommend changing it
Excellent 🙂
I believe you can change it via
osd_recovery_delay_start
but i would not recommend changing it
shadowlin
67 Posts
Quote from shadowlin on March 27, 2018, 4:00 amHow is the recovery speed?
I have met a situation in a small cluster that one of the osds went down(not out) and some pg became degraded .
The degraded object count kept increasing as clients were still writing data to the cluster.
Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).
It seems the recover speed can't catch up the speed of generating new objects.
I have to stop all the writing so the cluster can recover.
If a pg is in degraded status will all the new data that write into that pg also be degraded?
How is the recovery speed?
I have met a situation in a small cluster that one of the osds went down(not out) and some pg became degraded .
The degraded object count kept increasing as clients were still writing data to the cluster.
Even after the down osd went up the degraded objects count still kept increasing(after most part of the degraded objects got recovered because the osd went up again).
It seems the recover speed can't catch up the speed of generating new objects.
I have to stop all the writing so the cluster can recover.
If a pg is in degraded status will all the new data that write into that pg also be degraded?