Forums - PetaSAN

ForumGeneral DiscussionHealth warnings after changing nu …
You need to log in to create posts and topics. Login · Register
Health warnings after changing number of replicas

Ste
136 Posts

April 12, 2018, 10:12 am
Quote from Ste on April 12, 2018, 10:12 am
Hi all, I setup a 3-nodes petasan 2.0.0 cluster, added an iSCSI disk, connected a client to it and copied some data. Then I changed "number of replicas" from 2 to 3 in cluster settings, and a trasbsition period started during which the number of "clean PG" was approaching the number of total PG. Now, after some days, the situation is stuck with this warning:

Degraded data redundancy: 33/14121 objects degraded (0.234%), 6 pgs unclean, 6 pgs degraded, 6 pgs undersized

Is there a way to fix this issue ? Was my operation of changing replicas a bad practice ? This is a test cluster and I don't care about data inside, I'm just playing with petasan/ceph to learn how to manage it and see what to do and what is better not to do. 🙂

Thanks, Ste.

Hi all, I setup a 3-nodes petasan 2.0.0 cluster, added an iSCSI disk, connected a client to it and copied some data. Then I changed "number of replicas" from 2 to 3 in cluster settings, and a trasbsition period started during which the number of "clean PG" was approaching the number of total PG. Now, after some days, the situation is stuck with this warning:

Degraded data redundancy: 33/14121 objects degraded (0.234%), 6 pgs unclean, 6 pgs degraded, 6 pgs undersized

Is there a way to fix this issue ? Was my operation of changing replicas a bad practice ? This is a test cluster and I don't care about data inside, I'm just playing with petasan/ceph to learn how to manage it and see what to do and what is better not to do. 🙂

Thanks, Ste.

#1

admin
2,966 Posts

April 12, 2018, 11:59 am
Quote from admin on April 12, 2018, 11:59 am
It should reach active/clean by itself, so something is wrong.

find the 6 PGs that are unclean

ceph health detail --cluster CLUSTER_NAME | grep unclean
ceph pg dump --cluster CLUSTER_NAME | grep unclean

Choose 1 of them and run

ceph pg PG_NUM query --cluster CLUSTER_NAME

It will give detail at why it is stuck

Do you have any OSDs down ? Is there a risk of disks being full or near full ( 85% ) note that when you increased your replica count average disk usage will increase by 33%.

It should reach active/clean by itself, so something is wrong.

find the 6 PGs that are unclean

ceph health detail --cluster CLUSTER_NAME | grep unclean
ceph pg dump --cluster CLUSTER_NAME | grep unclean

Choose 1 of them and run

ceph pg PG_NUM query --cluster CLUSTER_NAME

It will give detail at why it is stuck

Do you have any OSDs down ? Is there a risk of disks being full or near full ( 85% ) note that when you increased your replica count average disk usage will increase by 33%.

#2

Ste
136 Posts

April 13, 2018, 8:50 am
Quote from Ste on April 13, 2018, 8:50 am
Hi, all OSD are up (11 up/0 down) and disk usage very low: 66.84 GB / 2.38 TB (2.74%). The output of the command follows:

root@petatest01:~# ceph pg dump --cluster petatest | grep degraded
dumped all
1.124 6 0 6 0 0 25165824 111 111 active+undersized+degraded 2018-04-09 17:29:28.496103 118'111 194:291 [7,3] 7 [7,3] 7 0'0 2018-04-05 11:56:17.194215 0'0 2018-04-05 11:56:17.194215

root@petatest01:~# ceph pg 1.124 query --cluster petatest
{
"state": "active+undersized+degraded",
"snap_trimq": "[]",
"epoch": 194,
"up": [
7,
3
],
"acting": [
7,
3
],
"actingbackfill": [
"3",
"7"
],
"info": {
"pgid": "1.124",
"last_update": "118'111",
"last_complete": "118'111",
"log_tail": "0'0",
"last_user_version": 111,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
}
],
"history": {
"epoch_created": 50,
"epoch_pool_created": 50,
"last_epoch_started": 194,
"last_interval_started": 193,
"last_epoch_clean": 107,
"last_interval_clean": 106,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 193,
"same_interval_since": 193,
"same_primary_since": 189,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215"
},
"stats": {
"version": "118'111",
"reported_seq": "292",
"reported_epoch": "194",
"state": "active+undersized+degraded",
"last_fresh": "2018-04-09 17:29:29.009300",
"last_change": "2018-04-09 17:29:28.496103",
"last_active": "2018-04-09 17:29:29.009300",
"last_peered": "2018-04-09 17:29:29.009300",
"last_clean": "2018-04-06 18:02:04.441860",
"last_became_active": "2018-04-09 17:29:28.496103",
"last_became_peered": "2018-04-09 17:29:28.496103",
"last_unstale": "2018-04-09 17:29:29.009300",
"last_undegraded": "2018-04-09 17:29:27.638571",
"last_fullsized": "2018-04-09 17:29:27.638571",
"mapping_epoch": 193,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 50,
"last_epoch_clean": 107,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215",
"log_size": 111,
"ondisk_log_size": 111,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 25165824,
"num_objects": 6,
"num_object_clones": 0,
"num_object_copies": 18,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 6,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 6,
"num_whiteouts": 0,
"num_read": 2,
"num_read_kb": 16,
"num_write": 222,
"num_write_kb": 24575,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
7,
3
],
"acting": [
7,
3
],
"blocked_by": [],
"up_primary": 7,
"acting_primary": 7
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 194,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "3",
"pgid": "1.124",
"last_update": "118'111",
"last_complete": "118'111",
"log_tail": "0'0",
"last_user_version": 111,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 50,
"epoch_pool_created": 50,
"last_epoch_started": 194,
"last_interval_started": 193,
"last_epoch_clean": 107,
"last_interval_clean": 106,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 193,
"same_interval_since": 193,
"same_primary_since": 189,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215"
},
"stats": {
"version": "118'111",
"reported_seq": "158",
"reported_epoch": "167",
"state": "undersized+degraded+peered",
"last_fresh": "2018-04-06 20:31:45.747376",
"last_change": "2018-04-06 20:31:45.746970",
"last_active": "2018-04-06 20:31:44.567939",
"last_peered": "2018-04-06 20:31:45.747376",
"last_clean": "2018-04-06 17:32:09.807794",
"last_became_active": "2018-04-05 16:08:27.282054",
"last_became_peered": "2018-04-06 20:31:45.746970",
"last_unstale": "2018-04-06 20:31:45.747376",
"last_undegraded": "2018-04-06 20:31:45.738409",
"last_fullsized": "2018-04-06 20:31:45.738409",
"mapping_epoch": 193,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 50,
"last_epoch_clean": 107,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215",
"log_size": 111,
"ondisk_log_size": 111,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 25165824,
"num_objects": 6,
"num_object_clones": 0,
"num_object_copies": 18,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 12,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 6,
"num_whiteouts": 0,
"num_read": 2,
"num_read_kb": 16,
"num_write": 222,
"num_write_kb": 24575,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
7,
3
],
"acting": [
7,
3
],
"blocked_by": [],
"up_primary": 7,
"acting_primary": 7
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 194,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-04-09 17:29:27.638611",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-04-09 17:29:26.024602"
}
],
"agent_state": {}
}

I'm not sure if this is the point, but in "up" and "acting" there are only two numbers, while there should be 3 copies, correct ? And why ?

If it can be useful, the iSCSI disk is 100GB big, two hosts have 4 OSD with a 280GB disk each, while the 3rd node has 3 OSD with 68GB disk each.

Hi, all OSD are up (11 up/0 down) and disk usage very low: 66.84 GB / 2.38 TB (2.74%). The output of the command follows:

root@petatest01:~# ceph pg dump --cluster petatest | grep degraded
dumped all
1.124 6 0 6 0 0 25165824 111 111 active+undersized+degraded 2018-04-09 17:29:28.496103 118'111 194:291 [7,3] 7 [7,3] 7 0'0 2018-04-05 11:56:17.194215 0'0 2018-04-05 11:56:17.194215

root@petatest01:~# ceph pg 1.124 query --cluster petatest
{
"state": "active+undersized+degraded",
"snap_trimq": "[]",
"epoch": 194,
"up": [
7,
3
],
"acting": [
7,
3
],
"actingbackfill": [
"3",
"7"
],
"info": {
"pgid": "1.124",
"last_update": "118'111",
"last_complete": "118'111",
"log_tail": "0'0",
"last_user_version": 111,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [
{
"start": "1",
"length": "3"
}
],
"history": {
"epoch_created": 50,
"epoch_pool_created": 50,
"last_epoch_started": 194,
"last_interval_started": 193,
"last_epoch_clean": 107,
"last_interval_clean": 106,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 193,
"same_interval_since": 193,
"same_primary_since": 189,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215"
},
"stats": {
"version": "118'111",
"reported_seq": "292",
"reported_epoch": "194",
"state": "active+undersized+degraded",
"last_fresh": "2018-04-09 17:29:29.009300",
"last_change": "2018-04-09 17:29:28.496103",
"last_active": "2018-04-09 17:29:29.009300",
"last_peered": "2018-04-09 17:29:29.009300",
"last_clean": "2018-04-06 18:02:04.441860",
"last_became_active": "2018-04-09 17:29:28.496103",
"last_became_peered": "2018-04-09 17:29:28.496103",
"last_unstale": "2018-04-09 17:29:29.009300",
"last_undegraded": "2018-04-09 17:29:27.638571",
"last_fullsized": "2018-04-09 17:29:27.638571",
"mapping_epoch": 193,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 50,
"last_epoch_clean": 107,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215",
"log_size": 111,
"ondisk_log_size": 111,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 25165824,
"num_objects": 6,
"num_object_clones": 0,
"num_object_copies": 18,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 6,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 6,
"num_whiteouts": 0,
"num_read": 2,
"num_read_kb": 16,
"num_write": 222,
"num_write_kb": 24575,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
7,
3
],
"acting": [
7,
3
],
"blocked_by": [],
"up_primary": 7,
"acting_primary": 7
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 194,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "3",
"pgid": "1.124",
"last_update": "118'111",
"last_complete": "118'111",
"log_tail": "0'0",
"last_user_version": 111,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 50,
"epoch_pool_created": 50,
"last_epoch_started": 194,
"last_interval_started": 193,
"last_epoch_clean": 107,
"last_interval_clean": 106,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 193,
"same_interval_since": 193,
"same_primary_since": 189,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215"
},
"stats": {
"version": "118'111",
"reported_seq": "158",
"reported_epoch": "167",
"state": "undersized+degraded+peered",
"last_fresh": "2018-04-06 20:31:45.747376",
"last_change": "2018-04-06 20:31:45.746970",
"last_active": "2018-04-06 20:31:44.567939",
"last_peered": "2018-04-06 20:31:45.747376",
"last_clean": "2018-04-06 17:32:09.807794",
"last_became_active": "2018-04-05 16:08:27.282054",
"last_became_peered": "2018-04-06 20:31:45.746970",
"last_unstale": "2018-04-06 20:31:45.747376",
"last_undegraded": "2018-04-06 20:31:45.738409",
"last_fullsized": "2018-04-06 20:31:45.738409",
"mapping_epoch": 193,
"log_start": "0'0",
"ondisk_log_start": "0'0",
"created": 50,
"last_epoch_clean": 107,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "0'0",
"last_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_deep_scrub": "0'0",
"last_deep_scrub_stamp": "2018-04-05 11:56:17.194215",
"last_clean_scrub_stamp": "2018-04-05 11:56:17.194215",
"log_size": 111,
"ondisk_log_size": 111,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 25165824,
"num_objects": 6,
"num_object_clones": 0,
"num_object_copies": 18,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 12,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 6,
"num_whiteouts": 0,
"num_read": 2,
"num_read_kb": 16,
"num_write": 222,
"num_write_kb": 24575,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0
},
"up": [
7,
3
],
"acting": [
7,
3
],
"blocked_by": [],
"up_primary": 7,
"acting_primary": 7
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 194,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2018-04-09 17:29:27.638611",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.seed": 0,
"scrubber.waiting_on": 0,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2018-04-09 17:29:26.024602"
}
],
"agent_state": {}
}

I'm not sure if this is the point, but in "up" and "acting" there are only two numbers, while there should be 3 copies, correct ? And why ?

If it can be useful, the iSCSI disk is 100GB big, two hosts have 4 OSD with a 280GB disk each, while the 3rd node has 3 OSD with 68GB disk each.

#3

admin
2,966 Posts

April 13, 2018, 9:48 am
Quote from admin on April 13, 2018, 9:48 am
There is no hard errors in the pg info, but from

two hosts have 4 OSD with a 280GB disk each, while the 3rd node has 3 OSD with 68GB disk each.

I suspect this is too much weight unbalance given you only have 3 nodes and need 3 copies. It is possible the ceph crush placement algorithm is failing to find all 3 replicas for these 6 pgs based on current weights.

Can you physically move 1 osd disk from any of the first 2 nodes and place it in the third.

There is no hard errors in the pg info, but from

two hosts have 4 OSD with a 280GB disk each, while the 3rd node has 3 OSD with 68GB disk each.

I suspect this is too much weight unbalance given you only have 3 nodes and need 3 copies. It is possible the ceph crush placement algorithm is failing to find all 3 replicas for these 6 pgs based on current weights.

Can you physically move 1 osd disk from any of the first 2 nodes and place it in the third.

#4

Ste
136 Posts

April 13, 2018, 10:54 am
Quote from Ste on April 13, 2018, 10:54 am
Unfortunately they are completely different: the 280GB disks are 3.5" SCSI, while the 68GB are 2.5" SAS. 🙂

Anyway, I don't care about data, for now, but to understand what can happen once in production and how I can avoid/fix it. Thank you for the hints. 😉

Unfortunately they are completely different: the 280GB disks are 3.5" SCSI, while the 68GB are 2.5" SAS. 🙂

Anyway, I don't care about data, for now, but to understand what can happen once in production and how I can avoid/fix it. Thank you for the hints. 😉

#5

Post Reply: Health warnings after changing number of replicas

Cancel