One node fail, cluster fail.
maxthetor
24 Posts
June 10, 2017, 6:16 pmQuote from maxthetor on June 10, 2017, 6:16 pmWhen a node fails, the cluster fails. I have noticed that when a node fails, either by reboot for testing or maintenance or simply a crash, the entire cluster fails.
The virtual machines are frozen, until all the nodes are up.
And if the downtime of one of the nodes is too long, when the node comes back the cluster has difficulty recovering.
Is this behavior expected or does it have something that can be done?
my setup 3 nodes.
--------------------------------------------------------------------------------------------------
Quando um node falha, o cluster falha. Tenho percebido que quando um node falha, seja por reboot para testes ou manutencao ou simplesmente um crash, todo o cluster falha.
As maquinas virtuais ficam em estado de congelamento, ate que todos os nodes estejam up.
E se o tempo de downtime de um dos nodes for muito longo, quando o node voltar o cluster tem dificuldade para se recuperar.
Esse comportamento e esperado ou tem algo que possa ser feito?
When a node fails, the cluster fails. I have noticed that when a node fails, either by reboot for testing or maintenance or simply a crash, the entire cluster fails.
The virtual machines are frozen, until all the nodes are up.
And if the downtime of one of the nodes is too long, when the node comes back the cluster has difficulty recovering.
Is this behavior expected or does it have something that can be done?
my setup 3 nodes.
--------------------------------------------------------------------------------------------------
Quando um node falha, o cluster falha. Tenho percebido que quando um node falha, seja por reboot para testes ou manutencao ou simplesmente um crash, todo o cluster falha.
As maquinas virtuais ficam em estado de congelamento, ate que todos os nodes estejam up.
E se o tempo de downtime de um dos nodes for muito longo, quando o node voltar o cluster tem dificuldade para se recuperar.
Esse comportamento e esperado ou tem algo que possa ser feito?
admin
2,930 Posts
June 10, 2017, 7:04 pmQuote from admin on June 10, 2017, 7:04 pmNo certainly this is not the correct behavior. If a node fails, under ESXi with the default timeouts you should see approx a 25 sec pause before io resumes again.
If you can run the following command :
ceph status --cluster CLUSTER_NAME
please run this command on any node other than the one you shutdown and post the output :
- Before you crash the node
- After you crash the node
- 10 min after you crash the node
- After you restart the node
- 10 min after you restart the node
you run the command by ssh to the node using username: root and your cluster password
Also for the hardware, are all 3 nodes similar hardware ? can you tell me RAM + how many NICs and their speed. Also after you crash the node and things are not working, can you please ssh to the 2 working nodes and run
atop
and observe:
- How much free RAM
- % Utilization for CPU
- % Utilization for all disks ( system + OSDs )
- % Utilization for NICs
Ceph does need resources, specially when you fail nodes and it tries to self heal and serve io at the same time. For example if you do not have free RAM things will seem to freeze,
No certainly this is not the correct behavior. If a node fails, under ESXi with the default timeouts you should see approx a 25 sec pause before io resumes again.
If you can run the following command :
ceph status --cluster CLUSTER_NAME
please run this command on any node other than the one you shutdown and post the output :
- Before you crash the node
- After you crash the node
- 10 min after you crash the node
- After you restart the node
- 10 min after you restart the node
you run the command by ssh to the node using username: root and your cluster password
Also for the hardware, are all 3 nodes similar hardware ? can you tell me RAM + how many NICs and their speed. Also after you crash the node and things are not working, can you please ssh to the 2 working nodes and run
atop
and observe:
- How much free RAM
- % Utilization for CPU
- % Utilization for all disks ( system + OSDs )
- % Utilization for NICs
Ceph does need resources, specially when you fail nodes and it tries to self heal and serve io at the same time. For example if you do not have free RAM things will seem to freeze,
Last edited on June 10, 2017, 7:06 pm · #2
maxthetor
24 Posts
June 13, 2017, 2:35 amQuote from maxthetor on June 13, 2017, 2:35 amBefore following their directions, I changed the iscsi disk settings to 3 paths, round robin iops to 1, and memory from 2 nodes to 4G.
After these changes I have no more immediate freezing of the virtual machines after a node crash. After 5 minutes I have a freezing of 2 minutes.
But I also do not have file system corruption.
Some cluster information:
3 nodes
node 1 - 64G RAM - 232GB Disk - Xeon Quad - 2 Nic Gigabit
node 2 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
node 3 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
>>Before you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
too many PGs per OSD (666 > max 500)
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 172, quorum 0,1,2 san01,san02,san03
osdmap e337: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v182817: 1000 pgs, 1 pools, 40803 MB data, 10298 objects
81378 MB used, 378 GB / 457 GB avail
1000 active+clean
client io 911 kB/s rd, 61743 B/s wr, 59 op/s rd, 22 op/s wr
>>After you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
584 pgs degraded
374 pgs stuck unclean
584 pgs undersized
recovery 6035/20596 objects degraded (29.302%)
1/3 in osds are down
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e339: 3 osds: 2 up, 3 in; 584 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182850: 1000 pgs, 1 pools, 40804 MB data, 10298 objects
81374 MB used, 378 GB / 457 GB avail
6035/20596 objects degraded (29.302%)
584 active+undersized+degraded
416 active+clean
client io 199 kB/s rd, 13909 B/s wr, 21 op/s rd, 6 op/s wr
>>4min , frozen by 2 minutes , even so no crash ou corruption
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_ERR
435 pgs are stuck inactive for more than 300 seconds
519 pgs degraded
65 pgs peering
435 pgs stuck inactive
112 pgs stuck unclean
203 pgs undersized
5 requests are blocked > 32 sec
recovery 1236/20600 objects degraded (6.000%)
recovery 1236/20600 objects misplaced (6.000%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e342: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182968: 1000 pgs, 1 pools, 40811 MB data, 10300 objects
57541 MB used, 286 GB / 342 GB avail
1236/20600 objects degraded (6.000%)
1236/20600 objects misplaced (6.000%)
416 active+clean
316 activating+degraded
112 active+undersized+degraded+remapped
91 activating+undersized+degraded+remapped
65 peering
client io 2677 B/s rd, 1216 B/s wr, 2 op/s rd, 2 op/s wr
>>10 min after you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
201 pgs backfill_wait
495 pgs degraded
2 pgs recovering
292 pgs recovery_wait
495 pgs stuck unclean
201 pgs undersized
recovery 7937/20602 objects degraded (38.525%)
recovery 2277/20602 objects misplaced (11.052%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e346: 3 osds: 2 up, 2 in; 201 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v183128: 1000 pgs, 1 pools, 40814 MB data, 10301 objects
61346 MB used, 282 GB / 342 GB avail
7937/20602 objects degraded (38.525%)
2277/20602 objects misplaced (11.052%)
505 active+clean
292 active+recovery_wait+degraded
201 active+undersized+degraded+remapped+wait_backfill
2 active+recovering+degraded
recovery io 21153 kB/s, 5 objects/s
client io 79515 B/s rd, 125 kB/s wr, 4 op/s rd, 41 op/s wr
Before following their directions, I changed the iscsi disk settings to 3 paths, round robin iops to 1, and memory from 2 nodes to 4G.
After these changes I have no more immediate freezing of the virtual machines after a node crash. After 5 minutes I have a freezing of 2 minutes.
But I also do not have file system corruption.
Some cluster information:
3 nodes
node 1 - 64G RAM - 232GB Disk - Xeon Quad - 2 Nic Gigabit
node 2 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
node 3 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
>>Before you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
too many PGs per OSD (666 > max 500)
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 172, quorum 0,1,2 san01,san02,san03
osdmap e337: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v182817: 1000 pgs, 1 pools, 40803 MB data, 10298 objects
81378 MB used, 378 GB / 457 GB avail
1000 active+clean
client io 911 kB/s rd, 61743 B/s wr, 59 op/s rd, 22 op/s wr
>>After you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
584 pgs degraded
374 pgs stuck unclean
584 pgs undersized
recovery 6035/20596 objects degraded (29.302%)
1/3 in osds are down
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e339: 3 osds: 2 up, 3 in; 584 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182850: 1000 pgs, 1 pools, 40804 MB data, 10298 objects
81374 MB used, 378 GB / 457 GB avail
6035/20596 objects degraded (29.302%)
584 active+undersized+degraded
416 active+clean
client io 199 kB/s rd, 13909 B/s wr, 21 op/s rd, 6 op/s wr
>>4min , frozen by 2 minutes , even so no crash ou corruption
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_ERR
435 pgs are stuck inactive for more than 300 seconds
519 pgs degraded
65 pgs peering
435 pgs stuck inactive
112 pgs stuck unclean
203 pgs undersized
5 requests are blocked > 32 sec
recovery 1236/20600 objects degraded (6.000%)
recovery 1236/20600 objects misplaced (6.000%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e342: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182968: 1000 pgs, 1 pools, 40811 MB data, 10300 objects
57541 MB used, 286 GB / 342 GB avail
1236/20600 objects degraded (6.000%)
1236/20600 objects misplaced (6.000%)
416 active+clean
316 activating+degraded
112 active+undersized+degraded+remapped
91 activating+undersized+degraded+remapped
65 peering
client io 2677 B/s rd, 1216 B/s wr, 2 op/s rd, 2 op/s wr
>>10 min after you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
201 pgs backfill_wait
495 pgs degraded
2 pgs recovering
292 pgs recovery_wait
495 pgs stuck unclean
201 pgs undersized
recovery 7937/20602 objects degraded (38.525%)
recovery 2277/20602 objects misplaced (11.052%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e346: 3 osds: 2 up, 2 in; 201 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v183128: 1000 pgs, 1 pools, 40814 MB data, 10301 objects
61346 MB used, 282 GB / 342 GB avail
7937/20602 objects degraded (38.525%)
2277/20602 objects misplaced (11.052%)
505 active+clean
292 active+recovery_wait+degraded
201 active+undersized+degraded+remapped+wait_backfill
2 active+recovering+degraded
recovery io 21153 kB/s, 5 objects/s
client io 79515 B/s rd, 125 kB/s wr, 4 op/s rd, 41 op/s wr
maxthetor
24 Posts
June 13, 2017, 3:28 amQuote from maxthetor on June 13, 2017, 3:28 am>> 30 minutes later the cluster completely stopped.
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
203 pgs backfill_wait
358 pgs degraded
2 pgs recovering
153 pgs recovery_wait
358 pgs stuck unclean
203 pgs undersized
13 requests are blocked > 32 sec
recovery 5303/20630 objects degraded (25.705%)
recovery 2313/20630 objects misplaced (11.212%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 182, quorum 0,2 san01,san03
osdmap e373: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v185083: 1000 pgs, 1 pools, 40870 MB data, 10315 objects
66545 MB used, 277 GB / 342 GB avail
5303/20630 objects degraded (25.705%)
2313/20630 objects misplaced (11.212%)
642 active+clean
203 active+undersized+degraded+remapped+wait_backfill
153 active+recovery_wait+degraded
2 active+recovering+degraded
client io 4 B/s rd, 0 op/s rd, 0 op/s wr
>> 30 minutes later the cluster completely stopped.
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
203 pgs backfill_wait
358 pgs degraded
2 pgs recovering
153 pgs recovery_wait
358 pgs stuck unclean
203 pgs undersized
13 requests are blocked > 32 sec
recovery 5303/20630 objects degraded (25.705%)
recovery 2313/20630 objects misplaced (11.212%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 182, quorum 0,2 san01,san03
osdmap e373: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v185083: 1000 pgs, 1 pools, 40870 MB data, 10315 objects
66545 MB used, 277 GB / 342 GB avail
5303/20630 objects degraded (25.705%)
2313/20630 objects misplaced (11.212%)
642 active+clean
203 active+undersized+degraded+remapped+wait_backfill
153 active+recovery_wait+degraded
2 active+recovering+degraded
client io 4 B/s rd, 0 op/s rd, 0 op/s wr
admin
2,930 Posts
June 13, 2017, 1:22 pmQuote from admin on June 13, 2017, 1:22 pmIt is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ? Also the fact that things started to freeze after 5 min suggests it is related to Ceph recovery putting stress on system. After 5 min Ceph will declare the shut OSD as down and and will start the recovery process: since some data objects now have only 1 replica, Ceph will recreate the second replica by copying data to/from your 2 up nodes. In case of 1 node failing in a 3 node cluster, the third of all existing cluster storage needs to be replicated, in case your nodes have 1 OSD, then all this will happen on 1 OSD disk.
This is what i suggest:
If you can, increase your ram from 4G to 8G and see if this fixes it..we have seem many cases ourselves of stuck recoveries due to limited RAM. Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ? Also the fact that things started to freeze after 5 min suggests it is related to Ceph recovery putting stress on system. After 5 min Ceph will declare the shut OSD as down and and will start the recovery process: since some data objects now have only 1 replica, Ceph will recreate the second replica by copying data to/from your 2 up nodes. In case of 1 node failing in a 3 node cluster, the third of all existing cluster storage needs to be replicated, in case your nodes have 1 OSD, then all this will happen on 1 OSD disk.
This is what i suggest:
If you can, increase your ram from 4G to 8G and see if this fixes it..we have seem many cases ourselves of stuck recoveries due to limited RAM. Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
maxthetor
24 Posts
June 13, 2017, 11:06 pmQuote from maxthetor on June 13, 2017, 11:06 pm>>It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ?
Yes. Correct.
>>If you can, increase your ram from 4G to 8G and see if this fixes it.
OK , I will do it now
>>Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
I will do this in a few days, I bought more disks and I'm hoping to get there.
>>The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
I'm still in doubt whether the problem is ceph or iscsi. To not generate more doubts I will redo the tests with 8G RAM before any option on this point.
>>It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ?
Yes. Correct.
>>If you can, increase your ram from 4G to 8G and see if this fixes it.
OK , I will do it now
>>Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
I will do this in a few days, I bought more disks and I'm hoping to get there.
>>The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
I'm still in doubt whether the problem is ceph or iscsi. To not generate more doubts I will redo the tests with 8G RAM before any option on this point.
maxthetor
24 Posts
June 15, 2017, 9:56 pmQuote from maxthetor on June 15, 2017, 9:56 pmAfter upgrading from 4G to 8G of ram, the cluster does not crash when a node crashes.
I think it was really a memory problem.
Two problems I still can not solve.
1 - Virtual machines freeze for two minutes when ceph recovery starts. 5 minutes after a node crash.
2 - Sometimes some esxi server loses iscsi storage and does not return until a cluster node is restarted.
http://imgur.com/a/xlLFZ
After upgrading from 4G to 8G of ram, the cluster does not crash when a node crashes.
I think it was really a memory problem.
Two problems I still can not solve.
1 - Virtual machines freeze for two minutes when ceph recovery starts. 5 minutes after a node crash.
2 - Sometimes some esxi server loses iscsi storage and does not return until a cluster node is restarted.
admin
2,930 Posts
June 16, 2017, 12:59 amQuote from admin on June 16, 2017, 12:59 amVery good, this was what i expected. Although things are much better, still we have the issue that the Ceph recovery (using Ceph default configuration) is putting too much load on your existing hardware. Note that your hardware setup is still much lower than what we recommend in our guides. We have solved the RAM issue but there are other resources that could be the "new" bottleneck: cpu/disk/network but in your case i suspect it is the disk, since you only have 1 per node. It will be good if you can run the atop command during the freeze as mentioned earlier, i suspect the single disk will be near 100% busy.
There are 2 things that could be done:
Add more disks, this is not adding for capacity but for performance. Try to have your disks the same capacity (i noticed you have different sizes) so the load will be distributed evenly, this is specially true in your case when you have a very small number of disks in your cluster.
Lower the load/priority of the Ceph recovery process from their default values. Add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section on all nodes then reboot
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1
osd_recovery_op priority = 1
osd_client_op_priority = 63
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
Either do the first or both. If you still have issues after adding a couple disks then please run the atop command and post a screen shot.
Very good, this was what i expected. Although things are much better, still we have the issue that the Ceph recovery (using Ceph default configuration) is putting too much load on your existing hardware. Note that your hardware setup is still much lower than what we recommend in our guides. We have solved the RAM issue but there are other resources that could be the "new" bottleneck: cpu/disk/network but in your case i suspect it is the disk, since you only have 1 per node. It will be good if you can run the atop command during the freeze as mentioned earlier, i suspect the single disk will be near 100% busy.
There are 2 things that could be done:
Add more disks, this is not adding for capacity but for performance. Try to have your disks the same capacity (i noticed you have different sizes) so the load will be distributed evenly, this is specially true in your case when you have a very small number of disks in your cluster.
Lower the load/priority of the Ceph recovery process from their default values. Add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section on all nodes then reboot
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1
osd_recovery_op priority = 1
osd_client_op_priority = 63
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
Either do the first or both. If you still have issues after adding a couple disks then please run the atop command and post a screen shot.
Last edited on June 16, 2017, 1:00 am · #8
One node fail, cluster fail.
maxthetor
24 Posts
Quote from maxthetor on June 10, 2017, 6:16 pmWhen a node fails, the cluster fails. I have noticed that when a node fails, either by reboot for testing or maintenance or simply a crash, the entire cluster fails.
The virtual machines are frozen, until all the nodes are up.
And if the downtime of one of the nodes is too long, when the node comes back the cluster has difficulty recovering.
Is this behavior expected or does it have something that can be done?
my setup 3 nodes.
--------------------------------------------------------------------------------------------------
Quando um node falha, o cluster falha. Tenho percebido que quando um node falha, seja por reboot para testes ou manutencao ou simplesmente um crash, todo o cluster falha.
As maquinas virtuais ficam em estado de congelamento, ate que todos os nodes estejam up.
E se o tempo de downtime de um dos nodes for muito longo, quando o node voltar o cluster tem dificuldade para se recuperar.
Esse comportamento e esperado ou tem algo que possa ser feito?
When a node fails, the cluster fails. I have noticed that when a node fails, either by reboot for testing or maintenance or simply a crash, the entire cluster fails.
The virtual machines are frozen, until all the nodes are up.
And if the downtime of one of the nodes is too long, when the node comes back the cluster has difficulty recovering.
Is this behavior expected or does it have something that can be done?
my setup 3 nodes.
--------------------------------------------------------------------------------------------------
Quando um node falha, o cluster falha. Tenho percebido que quando um node falha, seja por reboot para testes ou manutencao ou simplesmente um crash, todo o cluster falha.
As maquinas virtuais ficam em estado de congelamento, ate que todos os nodes estejam up.
E se o tempo de downtime de um dos nodes for muito longo, quando o node voltar o cluster tem dificuldade para se recuperar.
Esse comportamento e esperado ou tem algo que possa ser feito?
admin
2,930 Posts
Quote from admin on June 10, 2017, 7:04 pmNo certainly this is not the correct behavior. If a node fails, under ESXi with the default timeouts you should see approx a 25 sec pause before io resumes again.
If you can run the following command :
ceph status --cluster CLUSTER_NAME
please run this command on any node other than the one you shutdown and post the output :
- Before you crash the node
- After you crash the node
- 10 min after you crash the node
- After you restart the node
- 10 min after you restart the node
you run the command by ssh to the node using username: root and your cluster password
Also for the hardware, are all 3 nodes similar hardware ? can you tell me RAM + how many NICs and their speed. Also after you crash the node and things are not working, can you please ssh to the 2 working nodes and run
atop
and observe:
- How much free RAM
- % Utilization for CPU
- % Utilization for all disks ( system + OSDs )
- % Utilization for NICs
Ceph does need resources, specially when you fail nodes and it tries to self heal and serve io at the same time. For example if you do not have free RAM things will seem to freeze,
No certainly this is not the correct behavior. If a node fails, under ESXi with the default timeouts you should see approx a 25 sec pause before io resumes again.
If you can run the following command :
ceph status --cluster CLUSTER_NAME
please run this command on any node other than the one you shutdown and post the output :
- Before you crash the node
- After you crash the node
- 10 min after you crash the node
- After you restart the node
- 10 min after you restart the node
you run the command by ssh to the node using username: root and your cluster password
Also for the hardware, are all 3 nodes similar hardware ? can you tell me RAM + how many NICs and their speed. Also after you crash the node and things are not working, can you please ssh to the 2 working nodes and run
atop
and observe:
- How much free RAM
- % Utilization for CPU
- % Utilization for all disks ( system + OSDs )
- % Utilization for NICs
Ceph does need resources, specially when you fail nodes and it tries to self heal and serve io at the same time. For example if you do not have free RAM things will seem to freeze,
maxthetor
24 Posts
Quote from maxthetor on June 13, 2017, 2:35 amBefore following their directions, I changed the iscsi disk settings to 3 paths, round robin iops to 1, and memory from 2 nodes to 4G.
After these changes I have no more immediate freezing of the virtual machines after a node crash. After 5 minutes I have a freezing of 2 minutes.
But I also do not have file system corruption.
Some cluster information:
3 nodes
node 1 - 64G RAM - 232GB Disk - Xeon Quad - 2 Nic Gigabit
node 2 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
node 3 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
>>Before you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
too many PGs per OSD (666 > max 500)
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 172, quorum 0,1,2 san01,san02,san03
osdmap e337: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v182817: 1000 pgs, 1 pools, 40803 MB data, 10298 objects
81378 MB used, 378 GB / 457 GB avail
1000 active+clean
client io 911 kB/s rd, 61743 B/s wr, 59 op/s rd, 22 op/s wr
>>After you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
584 pgs degraded
374 pgs stuck unclean
584 pgs undersized
recovery 6035/20596 objects degraded (29.302%)
1/3 in osds are down
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e339: 3 osds: 2 up, 3 in; 584 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182850: 1000 pgs, 1 pools, 40804 MB data, 10298 objects
81374 MB used, 378 GB / 457 GB avail
6035/20596 objects degraded (29.302%)
584 active+undersized+degraded
416 active+clean
client io 199 kB/s rd, 13909 B/s wr, 21 op/s rd, 6 op/s wr
>>4min , frozen by 2 minutes , even so no crash ou corruption
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_ERR
435 pgs are stuck inactive for more than 300 seconds
519 pgs degraded
65 pgs peering
435 pgs stuck inactive
112 pgs stuck unclean
203 pgs undersized
5 requests are blocked > 32 sec
recovery 1236/20600 objects degraded (6.000%)
recovery 1236/20600 objects misplaced (6.000%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e342: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182968: 1000 pgs, 1 pools, 40811 MB data, 10300 objects
57541 MB used, 286 GB / 342 GB avail
1236/20600 objects degraded (6.000%)
1236/20600 objects misplaced (6.000%)
416 active+clean
316 activating+degraded
112 active+undersized+degraded+remapped
91 activating+undersized+degraded+remapped
65 peering
client io 2677 B/s rd, 1216 B/s wr, 2 op/s rd, 2 op/s wr
>>10 min after you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
201 pgs backfill_wait
495 pgs degraded
2 pgs recovering
292 pgs recovery_wait
495 pgs stuck unclean
201 pgs undersized
recovery 7937/20602 objects degraded (38.525%)
recovery 2277/20602 objects misplaced (11.052%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e346: 3 osds: 2 up, 2 in; 201 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v183128: 1000 pgs, 1 pools, 40814 MB data, 10301 objects
61346 MB used, 282 GB / 342 GB avail
7937/20602 objects degraded (38.525%)
2277/20602 objects misplaced (11.052%)
505 active+clean
292 active+recovery_wait+degraded
201 active+undersized+degraded+remapped+wait_backfill
2 active+recovering+degraded
recovery io 21153 kB/s, 5 objects/s
client io 79515 B/s rd, 125 kB/s wr, 4 op/s rd, 41 op/s wr
Before following their directions, I changed the iscsi disk settings to 3 paths, round robin iops to 1, and memory from 2 nodes to 4G.
After these changes I have no more immediate freezing of the virtual machines after a node crash. After 5 minutes I have a freezing of 2 minutes.
But I also do not have file system corruption.
Some cluster information:
3 nodes
node 1 - 64G RAM - 232GB Disk - Xeon Quad - 2 Nic Gigabit
node 2 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
node 3 - 4G RAM - 120GB Disk - Xeon Quad - 2 Nic Gigabit
>>Before you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
too many PGs per OSD (666 > max 500)
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 172, quorum 0,1,2 san01,san02,san03
osdmap e337: 3 osds: 3 up, 3 in
flags sortbitwise,require_jewel_osds
pgmap v182817: 1000 pgs, 1 pools, 40803 MB data, 10298 objects
81378 MB used, 378 GB / 457 GB avail
1000 active+clean
client io 911 kB/s rd, 61743 B/s wr, 59 op/s rd, 22 op/s wr
>>After you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
584 pgs degraded
374 pgs stuck unclean
584 pgs undersized
recovery 6035/20596 objects degraded (29.302%)
1/3 in osds are down
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e339: 3 osds: 2 up, 3 in; 584 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182850: 1000 pgs, 1 pools, 40804 MB data, 10298 objects
81374 MB used, 378 GB / 457 GB avail
6035/20596 objects degraded (29.302%)
584 active+undersized+degraded
416 active+clean
client io 199 kB/s rd, 13909 B/s wr, 21 op/s rd, 6 op/s wr
>>4min , frozen by 2 minutes , even so no crash ou corruption
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_ERR
435 pgs are stuck inactive for more than 300 seconds
519 pgs degraded
65 pgs peering
435 pgs stuck inactive
112 pgs stuck unclean
203 pgs undersized
5 requests are blocked > 32 sec
recovery 1236/20600 objects degraded (6.000%)
recovery 1236/20600 objects misplaced (6.000%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e342: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v182968: 1000 pgs, 1 pools, 40811 MB data, 10300 objects
57541 MB used, 286 GB / 342 GB avail
1236/20600 objects degraded (6.000%)
1236/20600 objects misplaced (6.000%)
416 active+clean
316 activating+degraded
112 active+undersized+degraded+remapped
91 activating+undersized+degraded+remapped
65 peering
client io 2677 B/s rd, 1216 B/s wr, 2 op/s rd, 2 op/s wr
>>10 min after you crash the node
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
201 pgs backfill_wait
495 pgs degraded
2 pgs recovering
292 pgs recovery_wait
495 pgs stuck unclean
201 pgs undersized
recovery 7937/20602 objects degraded (38.525%)
recovery 2277/20602 objects misplaced (11.052%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 174, quorum 0,2 san01,san03
osdmap e346: 3 osds: 2 up, 2 in; 201 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v183128: 1000 pgs, 1 pools, 40814 MB data, 10301 objects
61346 MB used, 282 GB / 342 GB avail
7937/20602 objects degraded (38.525%)
2277/20602 objects misplaced (11.052%)
505 active+clean
292 active+recovery_wait+degraded
201 active+undersized+degraded+remapped+wait_backfill
2 active+recovering+degraded
recovery io 21153 kB/s, 5 objects/s
client io 79515 B/s rd, 125 kB/s wr, 4 op/s rd, 41 op/s wr
maxthetor
24 Posts
Quote from maxthetor on June 13, 2017, 3:28 am>> 30 minutes later the cluster completely stopped.
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
203 pgs backfill_wait
358 pgs degraded
2 pgs recovering
153 pgs recovery_wait
358 pgs stuck unclean
203 pgs undersized
13 requests are blocked > 32 sec
recovery 5303/20630 objects degraded (25.705%)
recovery 2313/20630 objects misplaced (11.212%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 182, quorum 0,2 san01,san03
osdmap e373: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v185083: 1000 pgs, 1 pools, 40870 MB data, 10315 objects
66545 MB used, 277 GB / 342 GB avail
5303/20630 objects degraded (25.705%)
2313/20630 objects misplaced (11.212%)
642 active+clean
203 active+undersized+degraded+remapped+wait_backfill
153 active+recovery_wait+degraded
2 active+recovering+degraded
client io 4 B/s rd, 0 op/s rd, 0 op/s wr
>> 30 minutes later the cluster completely stopped.
cluster 9f99e76f-1f50-4aa3-b876-dbf194a3cadf
health HEALTH_WARN
203 pgs backfill_wait
358 pgs degraded
2 pgs recovering
153 pgs recovery_wait
358 pgs stuck unclean
203 pgs undersized
13 requests are blocked > 32 sec
recovery 5303/20630 objects degraded (25.705%)
recovery 2313/20630 objects misplaced (11.212%)
too many PGs per OSD (1000 > max 500)
1 mons down, quorum 0,2 san01,san03
monmap e3: 3 mons at {san01=10.0.10.1:6789/0,san02=10.0.10.2:6789/0,san03=10.0.10.3:6789/0}
election epoch 182, quorum 0,2 san01,san03
osdmap e373: 3 osds: 2 up, 2 in; 203 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v185083: 1000 pgs, 1 pools, 40870 MB data, 10315 objects
66545 MB used, 277 GB / 342 GB avail
5303/20630 objects degraded (25.705%)
2313/20630 objects misplaced (11.212%)
642 active+clean
203 active+undersized+degraded+remapped+wait_backfill
153 active+recovery_wait+degraded
2 active+recovering+degraded
client io 4 B/s rd, 0 op/s rd, 0 op/s wr
admin
2,930 Posts
Quote from admin on June 13, 2017, 1:22 pmIt is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ? Also the fact that things started to freeze after 5 min suggests it is related to Ceph recovery putting stress on system. After 5 min Ceph will declare the shut OSD as down and and will start the recovery process: since some data objects now have only 1 replica, Ceph will recreate the second replica by copying data to/from your 2 up nodes. In case of 1 node failing in a 3 node cluster, the third of all existing cluster storage needs to be replicated, in case your nodes have 1 OSD, then all this will happen on 1 OSD disk.
This is what i suggest:
If you can, increase your ram from 4G to 8G and see if this fixes it..we have seem many cases ourselves of stuck recoveries due to limited RAM. Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ? Also the fact that things started to freeze after 5 min suggests it is related to Ceph recovery putting stress on system. After 5 min Ceph will declare the shut OSD as down and and will start the recovery process: since some data objects now have only 1 replica, Ceph will recreate the second replica by copying data to/from your 2 up nodes. In case of 1 node failing in a 3 node cluster, the third of all existing cluster storage needs to be replicated, in case your nodes have 1 OSD, then all this will happen on 1 OSD disk.
This is what i suggest:
If you can, increase your ram from 4G to 8G and see if this fixes it..we have seem many cases ourselves of stuck recoveries due to limited RAM. Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
maxthetor
24 Posts
Quote from maxthetor on June 13, 2017, 11:06 pm>>It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ?
Yes. Correct.
>>If you can, increase your ram from 4G to 8G and see if this fixes it.
OK , I will do it now
>>Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
I will do this in a few days, I bought more disks and I'm hoping to get there.
>>The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
I'm still in doubt whether the problem is ceph or iscsi. To not generate more doubts I will redo the tests with 8G RAM before any option on this point.
>>It is probably a resource issue, i understand you did increase RAM to 4G and things were a bit better...is this correct ?
Yes. Correct.
>>If you can, increase your ram from 4G to 8G and see if this fixes it.
OK , I will do it now
>>Also if this was testing environment and you can re-run clean, i would recommend you add another OSD and have 2 OSDs with 8G RAM.
I will do this in a few days, I bought more disks and I'm hoping to get there.
>>The above is the easiest way to check this. Else i suggest you measure the resources i mentioned in my earlier post using atop during the period of ESXi freeze on both storage nodes, then we can take it further.
I'm still in doubt whether the problem is ceph or iscsi. To not generate more doubts I will redo the tests with 8G RAM before any option on this point.
maxthetor
24 Posts
Quote from maxthetor on June 15, 2017, 9:56 pmAfter upgrading from 4G to 8G of ram, the cluster does not crash when a node crashes.
I think it was really a memory problem.
Two problems I still can not solve.
1 - Virtual machines freeze for two minutes when ceph recovery starts. 5 minutes after a node crash.
2 - Sometimes some esxi server loses iscsi storage and does not return until a cluster node is restarted.
http://imgur.com/a/xlLFZ
After upgrading from 4G to 8G of ram, the cluster does not crash when a node crashes.
I think it was really a memory problem.
Two problems I still can not solve.
1 - Virtual machines freeze for two minutes when ceph recovery starts. 5 minutes after a node crash.
2 - Sometimes some esxi server loses iscsi storage and does not return until a cluster node is restarted.
admin
2,930 Posts
Quote from admin on June 16, 2017, 12:59 amVery good, this was what i expected. Although things are much better, still we have the issue that the Ceph recovery (using Ceph default configuration) is putting too much load on your existing hardware. Note that your hardware setup is still much lower than what we recommend in our guides. We have solved the RAM issue but there are other resources that could be the "new" bottleneck: cpu/disk/network but in your case i suspect it is the disk, since you only have 1 per node. It will be good if you can run the atop command during the freeze as mentioned earlier, i suspect the single disk will be near 100% busy.
There are 2 things that could be done:
Add more disks, this is not adding for capacity but for performance. Try to have your disks the same capacity (i noticed you have different sizes) so the load will be distributed evenly, this is specially true in your case when you have a very small number of disks in your cluster.
Lower the load/priority of the Ceph recovery process from their default values. Add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section on all nodes then reboot
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1
osd_recovery_op priority = 1
osd_client_op_priority = 63
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
Either do the first or both. If you still have issues after adding a couple disks then please run the atop command and post a screen shot.
Very good, this was what i expected. Although things are much better, still we have the issue that the Ceph recovery (using Ceph default configuration) is putting too much load on your existing hardware. Note that your hardware setup is still much lower than what we recommend in our guides. We have solved the RAM issue but there are other resources that could be the "new" bottleneck: cpu/disk/network but in your case i suspect it is the disk, since you only have 1 per node. It will be good if you can run the atop command during the freeze as mentioned earlier, i suspect the single disk will be near 100% busy.
There are 2 things that could be done:
Add more disks, this is not adding for capacity but for performance. Try to have your disks the same capacity (i noticed you have different sizes) so the load will be distributed evenly, this is specially true in your case when you have a very small number of disks in your cluster.
Lower the load/priority of the Ceph recovery process from their default values. Add the following configuration to /etc/ceph/CLUSTER_NAME.conf under the [global] section on all nodes then reboot
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1
osd_recovery_op priority = 1
osd_client_op_priority = 63
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
Either do the first or both. If you still have issues after adding a couple disks then please run the atop command and post a screen shot.