Forums - PetaSAN

ForumGeneral DiscussionPool inactive after OSD failure
You need to log in to create posts and topics. Login · Register
Pool inactive after OSD failure

Pages: 1 2

trexman
60 Posts

October 16, 2019, 2:58 pm
Quote from trexman on October 16, 2019, 2:58 pm
Hello,

we are in the testing phase of our new PetaSAN.
During this phase I found a strange problem:

If I pull 2 drives of 2 different nodes the pool get inactive, the iSCSI disk disappear and the ESXi behind looses it datastore (of course).
After the 2 OSD have the state OUT (10 minutes) and ceph recovers the pool to be healthy, than the pool get's active again and the iSCSI disk appears.

I did now a lot of test:

1 drive

2 drives of the same node

2 drives of different nodes and wait 10 or 12 minutes in between

power down one node

Only if you pull 2 or more drives of different nodes we got this 10 minutes where the storage is "dead".

I do not hope that this is normal? (2 or more ODS failures in 10 minutes in different nodes is not usual, but it is not impossible)
What can trigger this problem?

Thanks

PS: I forgot a short overview of my cluster

3x HP ProLiant DL380p Gen8 (2x Intel E5-2670, 64GB RAM)
HP Smart Array P420i (Raid 0)
15x 2TB SSD total (5 SSD per node)

Pool:
- replicated
- Rule "by-host-ssd"
- Size 3
- Min size 2
- 512 PGs

Hello,

we are in the testing phase of our new PetaSAN.
During this phase I found a strange problem:

If I pull 2 drives of 2 different nodes the pool get inactive, the iSCSI disk disappear and the ESXi behind looses it datastore (of course).
After the 2 OSD have the state OUT (10 minutes) and ceph recovers the pool to be healthy, than the pool get's active again and the iSCSI disk appears.

I did now a lot of test:

1 drive

2 drives of the same node

2 drives of different nodes and wait 10 or 12 minutes in between

power down one node

Only if you pull 2 or more drives of different nodes we got this 10 minutes where the storage is "dead".

I do not hope that this is normal? (2 or more ODS failures in 10 minutes in different nodes is not usual, but it is not impossible)
What can trigger this problem?

Thanks

PS: I forgot a short overview of my cluster

3x HP ProLiant DL380p Gen8 (2x Intel E5-2670, 64GB RAM)
HP Smart Array P420i (Raid 0)
15x 2TB SSD total (5 SSD per node)

Pool:
- replicated
- Rule "by-host-ssd"
- Size 3
- Min size 2
- 512 PGs

Last edited on October 16, 2019, 3:14 pm by trexman · #1

admin
2,969 Posts

October 16, 2019, 3:40 pm
Quote from admin on October 16, 2019, 3:40 pm
Yes it is the default behavior. Note this happens when you have 2 simultaneous failures on different nodes for a replica 3 pool.

If you want to keep the pool active (not recommended) you can lower the min size in you pool settings from 2 to 1, meaning the pool will receive io wit just 1 replica active and no redundancy (not recommended), the problem is if you lose a third disk all your new data will be lost, but it is your decision if you want to do this. Typically the min size if best left at relplica size - 1.

You can also lower the 10 min time for recovery to start kicking in, but still the data needs to be replicated to min size (2) before the pool is active.

you can also think of creating more replicas or use EC pool with more redundancy at expense of performance.

Yes it is the default behavior. Note this happens when you have 2 simultaneous failures on different nodes for a replica 3 pool.

If you want to keep the pool active (not recommended) you can lower the min size in you pool settings from 2 to 1, meaning the pool will receive io wit just 1 replica active and no redundancy (not recommended), the problem is if you lose a third disk all your new data will be lost, but it is your decision if you want to do this. Typically the min size if best left at relplica size - 1.

You can also lower the 10 min time for recovery to start kicking in, but still the data needs to be replicated to min size (2) before the pool is active.

you can also think of creating more replicas or use EC pool with more redundancy at expense of performance.

Last edited on October 16, 2019, 4:01 pm by admin · #2

trexman
60 Posts

October 17, 2019, 11:59 am
Quote from trexman on October 17, 2019, 11:59 am
Hi,

thanks for the clarification.
I'm with you, that reducing the min size to 1 is not a good idea.

But to get my mind straight for the best way of our solution:
The problem is, that we use 3 replicas with the crush rule config "step chooseleaf firstn 0 type host" (what is useful, of course)

If we create more replicas ( Size=4 and Min size=3) than we also need 4 storage nodes right? Otherwise nothing change if 2 ODS of 2 nodes are failing?

The performance lost of EC pool is in my opinion to high (right now the performance is OK for our productive PetaSAN compared to our old Raid 10 NAS)

So for me the reducing of the recovery time from 10 min to 5 sounds reasonable. (It is getting more and more unusual that 2 OSDs fails in 5 min)
Or is there a reason not to change this?

Another point I just discovered (I don't no if this matters).
You still have this problem here (pool inactive) if you pull to ODS in maintenance mode (noout)
So also in a maintenance case you should not remove 2 OSDs from different nodes? Or should we use (noout and nodown)

Thank you.

Hi,

thanks for the clarification.
I'm with you, that reducing the min size to 1 is not a good idea.

But to get my mind straight for the best way of our solution:
The problem is, that we use 3 replicas with the crush rule config "step chooseleaf firstn 0 type host" (what is useful, of course)

If we create more replicas ( Size=4 and Min size=3) than we also need 4 storage nodes right? Otherwise nothing change if 2 ODS of 2 nodes are failing?

The performance lost of EC pool is in my opinion to high (right now the performance is OK for our productive PetaSAN compared to our old Raid 10 NAS)

So for me the reducing of the recovery time from 10 min to 5 sounds reasonable. (It is getting more and more unusual that 2 OSDs fails in 5 min)
Or is there a reason not to change this?

Another point I just discovered (I don't no if this matters).
You still have this problem here (pool inactive) if you pull to ODS in maintenance mode (noout)
So also in a maintenance case you should not remove 2 OSDs from different nodes? Or should we use (noout and nodown)

Thank you.

Last edited on October 17, 2019, 12:01 pm by trexman · #3

admin
2,969 Posts

October 17, 2019, 1:17 pm
Quote from admin on October 17, 2019, 1:17 pm
If you have 4 nodes, you can create a replica x 4 pool and set your min size = 2. so if 2 nodes fail, you still have 2 replicas functioning so you client io will be active.

as you pointed, the "failure domain" in you crush rule is at the host level. in larger setups you can setup a failure domain at a rack (or room..etc) level, so in such case you can tolerate 2 complete rack (or room..etc) failures while your io is active.

I recommend not to change the timeout, you do not want to have the cluster too sensitive to overact and kick recovery too quickly. If you do need to have 2 host failures and still keep active in a safe way, the size=4, min_size=2 with 4 nodes is the way to go. You should not change the maintenance settings as this will stop recovery which is not what you want.

If you have 4 nodes, you can create a replica x 4 pool and set your min size = 2. so if 2 nodes fail, you still have 2 replicas functioning so you client io will be active.

as you pointed, the "failure domain" in you crush rule is at the host level. in larger setups you can setup a failure domain at a rack (or room..etc) level, so in such case you can tolerate 2 complete rack (or room..etc) failures while your io is active.

I recommend not to change the timeout, you do not want to have the cluster too sensitive to overact and kick recovery too quickly. If you do need to have 2 host failures and still keep active in a safe way, the size=4, min_size=2 with 4 nodes is the way to go. You should not change the maintenance settings as this will stop recovery which is not what you want.

#4

trexman
60 Posts

October 18, 2019, 3:26 pm
Quote from trexman on October 18, 2019, 3:26 pm
OK, I think I understand it.

So the only point of don't decrease the recovery time is that the cluster is starting faster or quicker with a recovery, if a OSD fails. Right?
But is it not important that the cluster is getting healthy as quick as possible?

Also what is the point of waiting 10 min? There is no way that a failed OSD is recovery by itself... in fact in our environment if one SSD fails the raid controller is marking it as faulty and won't reactivate it.

One point that might be, if every 20 minutes a OSD fails... hmm.
OK, the performance is getting low every time the recovery or backfill is running.

Sorry for just loud thinking.

Or did i miss a important point. For me in our environment (with only 3 nodes) the recovery is more important than performance lost for a few minutes.

OK, I think I understand it.

So the only point of don't decrease the recovery time is that the cluster is starting faster or quicker with a recovery, if a OSD fails. Right?
But is it not important that the cluster is getting healthy as quick as possible?

Also what is the point of waiting 10 min? There is no way that a failed OSD is recovery by itself... in fact in our environment if one SSD fails the raid controller is marking it as faulty and won't reactivate it.

One point that might be, if every 20 minutes a OSD fails... hmm.
OK, the performance is getting low every time the recovery or backfill is running.

Sorry for just loud thinking.

Or did i miss a important point. For me in our environment (with only 3 nodes) the recovery is more important than performance lost for a few minutes.

#5

admin
2,969 Posts

October 18, 2019, 10:06 pm
Quote from admin on October 18, 2019, 10:06 pm
You raise a valid point, yes recovery will be delayed by 10 min, everyone wants 0 delay.

in practice, you do not want to be over sensitive, for example iSCSI fail over is recommended by industry vendors anywhere from 20 sec to 1 min, if you set it to 1-5 sec, you will get false positives and your system may not be stable. Ceph is a large system and the decision to remap data onto new OSDs is based on consensus between different nodes, in more complex failures in large clusters, you do not want to system to be too sensitive to the point of instability where it could flap back and forth.

Again the correct approach is to add a relplica, so the 10 min recovery delay will not be an issue.

You raise a valid point, yes recovery will be delayed by 10 min, everyone wants 0 delay.

in practice, you do not want to be over sensitive, for example iSCSI fail over is recommended by industry vendors anywhere from 20 sec to 1 min, if you set it to 1-5 sec, you will get false positives and your system may not be stable. Ceph is a large system and the decision to remap data onto new OSDs is based on consensus between different nodes, in more complex failures in large clusters, you do not want to system to be too sensitive to the point of instability where it could flap back and forth.

Again the correct approach is to add a relplica, so the 10 min recovery delay will not be an issue.

#6

moh
16 Posts

November 3, 2020, 4:47 pm
Quote from moh on November 3, 2020, 4:47 pm
we have the same issue, but for our case the pool still inactive.

we have have 3 nodes, with 48 OSDs and 2 down. the 2 OSD down belong to Node3, we used like pool configuration :

ceph osd pool set MyPool min_size 1
ceph osd pool set MyPool size 1

should disk loss on the same node be such a problem ?

my questions are is there a way to restart the turned off pool ?

we have the same issue, but for our case the pool still inactive.

we have have 3 nodes, with 48 OSDs and 2 down. the 2 OSD down belong to Node3, we used like pool configuration :

ceph osd pool set MyPool min_size 1
ceph osd pool set MyPool size 1

should disk loss on the same node be such a problem ?

my questions are is there a way to restart the turned off pool ?

#7

admin
2,969 Posts

November 3, 2020, 5:35 pm
Quote from admin on November 3, 2020, 5:35 pm
ceph osd pool set MyPool size 1 will store each data object only once with no redundancy, any data on the 2 down OSDs do not have other copies.

ceph osd pool set MyPool size 1 will store each data object only once with no redundancy, any data on the 2 down OSDs do not have other copies.

#8

moh
16 Posts

November 4, 2020, 3:54 pm
Quote from moh on November 4, 2020, 3:54 pm
okay now what should i do to reactivate my pool, because the iscsi disks are gone to bring them back.

okay now what should i do to reactivate my pool, because the iscsi disks are gone to bring them back.

#9

admin
2,969 Posts

November 4, 2020, 7:11 pm
Quote from admin on November 4, 2020, 7:11 pm
To bring the pool active with data, you will need to bring up the 2 OSD since you need the data on them. Else if you have a backup of all your data, you could delete the pool then create a new one and create disks then copy the data over.

To bring the pool active with data, you will need to bring up the 2 OSD since you need the data on them. Else if you have a backup of all your data, you could delete the pool then create a new one and create disks then copy the data over.

#10

Post Reply: Pool inactive after OSD failure

Cancel

Pages: 1 2