Forums - PetaSAN

ForumGeneral DiscussionOne node fail, cluster fail.
You need to log in to create posts and topics. Login · Register
One node fail, cluster fail.

maxthetor
24 Posts

June 10, 2017, 6:16 pm
Quote from maxthetor on June 10, 2017, 6:16 pm
When a node fails, the cluster fails. I have noticed that when a node fails, either by reboot for testing or maintenance or simply a crash, the entire cluster fails.

The virtual machines are frozen, until all the nodes are up.

And if the downtime of one of the nodes is too long, when the node comes back the cluster has difficulty recovering.

Is this behavior expected or does it have something that can be done?

my setup 3 nodes.

--------------------------------------------------------------------------------------------------

Quando um node falha, o cluster falha. Tenho percebido que quando um node falha, seja por reboot para testes ou manutencao ou simplesmente um crash, todo o cluster falha.

As maquinas virtuais ficam em estado de congelamento, ate que todos os nodes estejam up.

E se o tempo de downtime de um dos nodes for muito longo, quando o node voltar o cluster tem dificuldade para se recuperar.

Esse comportamento e esperado ou tem algo que possa ser feito?

When a node fails, the cluster fails. I have noticed that when a node fails, either by reboot for testing or maintenance or simply a crash, the entire cluster fails.

The virtual machines are frozen, until all the nodes are up.

And if the downtime of one of the nodes is too long, when the node comes back the cluster has difficulty recovering.

Is this behavior expected or does it have something that can be done?

my setup 3 nodes.

--------------------------------------------------------------------------------------------------

Quando um node falha, o cluster falha. Tenho percebido que quando um node falha, seja por reboot para testes ou manutencao ou simplesmente um crash, todo o cluster falha.

As maquinas virtuais ficam em estado de congelamento, ate que todos os nodes estejam up.

E se o tempo de downtime de um dos nodes for muito longo, quando o node voltar o cluster tem dificuldade para se recuperar.

Esse comportamento e esperado ou tem algo que possa ser feito?

#1

admin
2,930 Posts

June 10, 2017, 7:04 pm
Quote from admin on June 10, 2017, 7:04 pm
No certainly this is not the correct behavior. If a node fails, under ESXi with the default timeouts you should see approx a 25 sec pause before io resumes again.

If you can run the following command :

ceph status --cluster CLUSTER_NAME

please run this command on any node other than the one you shutdown and post the output :

Before you crash the node

After you crash the node

10 min after you crash the node

After you restart the node

10 min after you restart the node

you run the command by ssh to the node using username: root and your cluster password

Also for the hardware, are all 3 nodes similar hardware ? can you tell me RAM + how many NICs and their speed. Also after you crash the node and things are not working, can you please ssh to the 2 working nodes and run

atop

and observe:

How much free RAM

% Utilization for CPU

% Utilization for all disks ( system + OSDs )

% Utilization for NICs

Ceph does need resources, specially when you fail nodes and it tries to self heal and serve io at the same time. For example if you do not have free RAM things will seem to freeze,

No certainly this is not the correct behavior. If a node fails, under ESXi with the default timeouts you should see approx a 25 sec pause before io resumes again.

If you can run the following command :

ceph status --cluster CLUSTER_NAME

please run this command on any node other than the one you shutdown and post the output :

Before you crash the node

After you crash the node

10 min after you crash the node

After you restart the node

10 min after you restart the node

you run the command by ssh to the node using username: root and your cluster password

Also for the hardware, are all 3 nodes similar hardware ? can you tell me RAM + how many NICs and their speed. Also after you crash the node and things are not working, can you please ssh to the 2 working nodes and run

atop

and observe:

How much free RAM

% Utilization for CPU

% Utilization for all disks ( system + OSDs )

% Utilization for NICs

Ceph does need resources, specially when you fail nodes and it tries to self heal and serve io at the same time. For example if you do not have free RAM things will seem to freeze,

Last edited on June 10, 2017, 7:06 pm · #2

maxthetor
24 Posts

June 13, 2017, 2:35 am
Quote from maxthetor on June 13, 2017, 2:35 am
Before following their directions, I changed the iscsi disk settings to 3 paths, round robin iops to 1, and memory from 2 nodes to 4G.

After these changes I have no more immediate freezing of the virtual machines after a node crash. After 5 minutes I have a freezing of 2 minutes.

But I also do not have file system corruption.

Some cluster information:

3 nodes

node 1 - 64G RAM - 232GB Disk - Xeon Quad - 2 Nic Gigabit

node 2 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit

node 3 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit

>>Before you crash the node

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

too many PGs per OSD (666 > max 500)

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 172, quorum 0,1,2 san01,san02,san03

osdmap e337: 3 osds: 3 up, 3 in

flags sortbitwise,require_jewel_osds

pgmap v182817: 1000 pgs, 1 pools, 40803 MB data, 10298 objects

81378 MB used, 378 GB / 457 GB avail

1000 active+clean

client io 911 kB/s rd, 61743 B/s wr, 59 op/s rd, 22 op/s wr

>>After you crash the node

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

584 pgs degraded

374 pgs stuck unclean

584 pgs undersized

recovery 6035/20596 objects degraded (29.302%)

1/3 in osds are down

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 174, quorum 0,2 san01,san03

osdmap e339: 3 osds: 2 up, 3 in; 584 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v182850: 1000 pgs, 1 pools, 40804 MB data, 10298 objects

81374 MB used, 378 GB / 457 GB avail

6035/20596 objects degraded (29.302%)

584 active+undersized+degraded

416 active+clean

client io 199 kB/s rd, 13909 B/s wr, 21 op/s rd, 6 op/s wr

>>4min , frozen by 2 minutes , even so no crash ou corruption

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_ERR

435 pgs are stuck inactive for more than 300 seconds

519 pgs degraded

65 pgs peering

435 pgs stuck inactive

112 pgs stuck unclean

203 pgs undersized

5 requests are blocked > 32 sec

recovery 1236/20600 objects degraded (6.000%)

recovery 1236/20600 objects misplaced (6.000%)

too many PGs per OSD (1000 > max 500)

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 174, quorum 0,2 san01,san03

osdmap e342: 3 osds: 2 up, 2 in; 203 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v182968: 1000 pgs, 1 pools, 40811 MB data, 10300 objects

57541 MB used, 286 GB / 342 GB avail

1236/20600 objects degraded (6.000%)

1236/20600 objects misplaced (6.000%)

416 active+clean

316 activating+degraded

112 active+undersized+degraded+remapped

91 activating+undersized+degraded+remapped

65 peering

client io 2677 B/s rd, 1216 B/s wr, 2 op/s rd, 2 op/s wr

>>10 min after you crash the node

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

201 pgs backfill_wait

495 pgs degraded

2 pgs recovering

292 pgs recovery_wait

495 pgs stuck unclean

201 pgs undersized

recovery 7937/20602 objects degraded (38.525%)

recovery 2277/20602 objects misplaced (11.052%)

too many PGs per OSD (1000 > max 500)

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 174, quorum 0,2 san01,san03

osdmap e346: 3 osds: 2 up, 2 in; 201 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v183128: 1000 pgs, 1 pools, 40814 MB data, 10301 objects

61346 MB used, 282 GB / 342 GB avail

7937/20602 objects degraded (38.525%)

2277/20602 objects misplaced (11.052%)

505 active+clean

292 active+recovery_wait+degraded

201 active+undersized+degraded+remapped+wait_backfill

2 active+recovering+degraded

recovery io 21153 kB/s, 5 objects/s

client io 79515 B/s rd, 125 kB/s wr, 4 op/s rd, 41 op/s wr

Before following their directions, I changed the iscsi disk settings to 3 paths, round robin iops to 1, and memory from 2 nodes to 4G.

After these changes I have no more immediate freezing of the virtual machines after a node crash. After 5 minutes I have a freezing of 2 minutes.

But I also do not have file system corruption.

Some cluster information:

3 nodes

node 1 - 64G RAM - 232GB Disk - Xeon Quad - 2 Nic Gigabit

node 2 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit

node 3 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit

>>Before you crash the node

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

too many PGs per OSD (666 > max 500)

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 172, quorum 0,1,2 san01,san02,san03

osdmap e337: 3 osds: 3 up, 3 in

flags sortbitwise,require_jewel_osds

pgmap v182817: 1000 pgs, 1 pools, 40803 MB data, 10298 objects

81378 MB used, 378 GB / 457 GB avail

1000 active+clean

client io 911 kB/s rd, 61743 B/s wr, 59 op/s rd, 22 op/s wr

>>After you crash the node

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

584 pgs degraded

374 pgs stuck unclean

584 pgs undersized

recovery 6035/20596 objects degraded (29.302%)

1/3 in osds are down

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 174, quorum 0,2 san01,san03

osdmap e339: 3 osds: 2 up, 3 in; 584 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v182850: 1000 pgs, 1 pools, 40804 MB data, 10298 objects

81374 MB used, 378 GB / 457 GB avail

6035/20596 objects degraded (29.302%)

584 active+undersized+degraded

416 active+clean

client io 199 kB/s rd, 13909 B/s wr, 21 op/s rd, 6 op/s wr

>>4min , frozen by 2 minutes , even so no crash ou corruption

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_ERR

435 pgs are stuck inactive for more than 300 seconds

519 pgs degraded

65 pgs peering

435 pgs stuck inactive

112 pgs stuck unclean

203 pgs undersized

5 requests are blocked > 32 sec

recovery 1236/20600 objects degraded (6.000%)

recovery 1236/20600 objects misplaced (6.000%)

too many PGs per OSD (1000 > max 500)

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 174, quorum 0,2 san01,san03

osdmap e342: 3 osds: 2 up, 2 in; 203 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v182968: 1000 pgs, 1 pools, 40811 MB data, 10300 objects

57541 MB used, 286 GB / 342 GB avail

1236/20600 objects degraded (6.000%)

1236/20600 objects misplaced (6.000%)

416 active+clean

316 activating+degraded

112 active+undersized+degraded+remapped

91 activating+undersized+degraded+remapped

65 peering

client io 2677 B/s rd, 1216 B/s wr, 2 op/s rd, 2 op/s wr

>>10 min after you crash the node

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

201 pgs backfill_wait

495 pgs degraded

2 pgs recovering

292 pgs recovery_wait

495 pgs stuck unclean

201 pgs undersized

recovery 7937/20602 objects degraded (38.525%)

recovery 2277/20602 objects misplaced (11.052%)

too many PGs per OSD (1000 > max 500)

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 174, quorum 0,2 san01,san03

osdmap e346: 3 osds: 2 up, 2 in; 201 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v183128: 1000 pgs, 1 pools, 40814 MB data, 10301 objects

61346 MB used, 282 GB / 342 GB avail

7937/20602 objects degraded (38.525%)

2277/20602 objects misplaced (11.052%)

505 active+clean

292 active+recovery_wait+degraded

201 active+undersized+degraded+remapped+wait_backfill

2 active+recovering+degraded

recovery io 21153 kB/s, 5 objects/s

client io 79515 B/s rd, 125 kB/s wr, 4 op/s rd, 41 op/s wr

#3

maxthetor
24 Posts

June 13, 2017, 3:28 am
Quote from maxthetor on June 13, 2017, 3:28 am
>> 30 minutes later the cluster completely stopped.

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

203 pgs backfill_wait

358 pgs degraded

2 pgs recovering

153 pgs recovery_wait

358 pgs stuck unclean

203 pgs undersized

13 requests are blocked > 32 sec

recovery 5303/20630 objects degraded (25.705%)

recovery 2313/20630 objects misplaced (11.212%)

too many PGs per OSD (1000 > max 500)

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 182, quorum 0,2 san01,san03

osdmap e373: 3 osds: 2 up, 2 in; 203 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v185083: 1000 pgs, 1 pools, 40870 MB data, 10315 objects

66545 MB used, 277 GB / 342 GB avail

5303/20630 objects degraded (25.705%)

2313/20630 objects misplaced (11.212%)

642 active+clean

203 active+undersized+degraded+remapped+wait_backfill

153 active+recovery_wait+degraded

2 active+recovering+degraded

client io 4 B/s rd, 0 op/s rd, 0 op/s wr

>> 30 minutes later the cluster completely stopped.

cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf

health HEALTH_WARN

203 pgs backfill_wait

358 pgs degraded

2 pgs recovering

153 pgs recovery_wait

358 pgs stuck unclean

203 pgs undersized

13 requests are blocked > 32 sec

recovery 5303/20630 objects degraded (25.705%)

recovery 2313/20630 objects misplaced (11.212%)

too many PGs per OSD (1000 > max 500)

1 mons down, quorum 0,2 san01,san03

monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}

election epoch 182, quorum 0,2 san01,san03

osdmap e373: 3 osds: 2 up, 2 in; 203 remapped pgs

flags sortbitwise,require_jewel_osds

pgmap v185083: 1000 pgs, 1 pools, 40870 MB data, 10315 objects

66545 MB used, 277 GB / 342 GB avail

5303/20630 objects degraded (25.705%)

2313/20630 objects misplaced (11.212%)

642 active+clean

203 active+undersized+degraded+remapped+wait_backfill

153 active+recovery_wait+degraded

2 active+recovering+degraded

client io 4 B/s rd, 0 op/s rd, 0 op/s wr

#4

admin
2,930 Posts

June 13, 2017, 1:22 pm
Quote from admin on June 13, 2017, 1:22 pm
It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ? Also the fact that things started to freeze after 5 min suggests it is related to Ceph recovery putting stress on system. After 5 min Ceph will declare the shut OSD as down and and will start the recovery process: since some data objects now have only 1 replica, Ceph will recreate the second replica by copying data to/from your 2 up nodes. In case of 1 node failing in a 3 node cluster, the third of all existing cluster storage needs to be replicated, in case your nodes have 1 OSD, then all this will happen on 1 OSD disk.

This is what i suggest:

If you can, increase your ram from 4G to 8G and see if this fixes it..we have seem many cases ourselves of stuck recoveries due to limited RAM. Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.

The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.

It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ? Also the fact that things started to freeze after 5 min suggests it is related to Ceph recovery putting stress on system. After 5 min Ceph will declare the shut OSD as down and and will start the recovery process: since some data objects now have only 1 replica, Ceph will recreate the second replica by copying data to/from your 2 up nodes. In case of 1 node failing in a 3 node cluster, the third of all existing cluster storage needs to be replicated, in case your nodes have 1 OSD, then all this will happen on 1 OSD disk.

This is what i suggest:

If you can, increase your ram from 4G to 8G and see if this fixes it..we have seem many cases ourselves of stuck recoveries due to limited RAM. Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.

The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.

#5

maxthetor
24 Posts

June 13, 2017, 11:06 pm
Quote from maxthetor on June 13, 2017, 11:06 pm
>>It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ?

Yes. Correct.

>>If you can, increase your ram from 4G to 8G and see if this fixes it.

OK , I will do it now

>>Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.

I will do this in a few days, I bought more disks and I'm hoping to get there.

>>The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.

I'm still in doubt whether the problem is ceph or iscsi. To not generate more doubts I will redo the tests with 8G RAM before any option on this point.

>>It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ?

Yes. Correct.

>>If you can, increase your ram from 4G to 8G and see if this fixes it.

OK , I will do it now

>>Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.

I will do this in a few days, I bought more disks and I'm hoping to get there.

>>The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.

I'm still in doubt whether the problem is ceph or iscsi. To not generate more doubts I will redo the tests with 8G RAM before any option on this point.

#6

maxthetor
24 Posts

June 15, 2017, 9:56 pm
Quote from maxthetor on June 15, 2017, 9:56 pm
After upgrading from 4G to 8G of ram, the cluster does not crash when a node crashes.

I think it was really a memory problem.

Two problems I still can not solve.

1 - Virtual machines freeze for two minutes when ceph recovery starts. 5 minutes after a node crash.

2 - Sometimes some esxi server loses iscsi storage and does not return until a cluster node is restarted.

http://imgur.com/a/xlLFZ

After upgrading from 4G to 8G of ram, the cluster does not crash when a node crashes.

I think it was really a memory problem.

Two problems I still can not solve.

1 - Virtual machines freeze for two minutes when ceph recovery starts. 5 minutes after a node crash.

2 - Sometimes some esxi server loses iscsi storage and does not return until a cluster node is restarted.

View post on imgur.com

#7

admin
2,930 Posts

June 16, 2017, 12:59 am
Quote from admin on June 16, 2017, 12:59 am
Very good, this was what i expected. Although things are much better, still we have the issue that the Ceph recovery (using Ceph default configuration) is putting too much load on your existing hardware. Note that your hardware setup is still much lower than what we recommend in our guides. We have solved the RAM issue but there are other resources that could be the "new" bottleneck: cpu/disk/network but in your case i suspect it is the disk, since you only have 1 per node. It will be good if you can run the atop command during the freeze as mentioned earlier, i suspect the single disk will be near 100% busy.

There are 2 things that could be done:

Add more disks, this is not adding for capacity but for performance. Try to have your disks the same capacity (i noticed you have different sizes) so the load will be distributed evenly, this is specially true in your case when you have a very small number of disks in your cluster.

Lower the load/priority of the Ceph recovery process from their default values. Add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section on all nodes then reboot

osd_max_backfills = 1

osd_recovery_max_active = 1

osd_recovery_threads = 1

osd_recovery_op priority = 1

osd_client_op_priority = 63

osd_max_scrubs = 1

osd_scrub_during_recovery = false

osd_scrub_priority = 1

Either do the first or both. If you still have issues after adding a couple disks then please run the atop command and post a screen shot.

Very good, this was what i expected. Although things are much better, still we have the issue that the Ceph recovery (using Ceph default configuration) is putting too much load on your existing hardware. Note that your hardware setup is still much lower than what we recommend in our guides. We have solved the RAM issue but there are other resources that could be the "new" bottleneck: cpu/disk/network but in your case i suspect it is the disk, since you only have 1 per node. It will be good if you can run the atop command during the freeze as mentioned earlier, i suspect the single disk will be near 100% busy.

There are 2 things that could be done:

Add more disks, this is not adding for capacity but for performance. Try to have your disks the same capacity (i noticed you have different sizes) so the load will be distributed evenly, this is specially true in your case when you have a very small number of disks in your cluster.

Lower the load/priority of the Ceph recovery process from their default values. Add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section on all nodes then reboot

osd_max_backfills = 1

osd_recovery_max_active = 1

osd_recovery_threads = 1

osd_recovery_op priority = 1

osd_client_op_priority = 63

osd_max_scrubs = 1

osd_scrub_during_recovery = false

osd_scrub_priority = 1

Either do the first or both. If you still have issues after adding a couple disks then please run the atop command and post a screen shot.

Last edited on June 16, 2017, 1:00 am · #8

Post Reply: One node fail, cluster fail.

Cancel