replaced cache and journal
rophee
8 Posts
January 6, 2024, 12:49 amQuote from rophee on January 6, 2024, 12:49 ami have a 4 node 3 replication cluster all hdd. replaced cache and journal with ssd. removed all 8 OSD. i'm trying to add OSD back to the node. one OSD was added successfully the second one is still on 'Adding'. is there anything i need to change to make this go faster?
i have a 4 node 3 replication cluster all hdd. replaced cache and journal with ssd. removed all 8 OSD. i'm trying to add OSD back to the node. one OSD was added successfully the second one is still on 'Adding'. is there anything i need to change to make this go faster?
admin
2,930 Posts
January 6, 2024, 1:29 amQuote from admin on January 6, 2024, 1:29 amreplaced cache and journal with ssd : you replaced them or you want to replace them ? they were not ssd ?
removed all 8 OSD: from how many nodes ? 1 or all ? when you did this what is the cluster status ? are pgs clean or recovering or down ?
what is your backfill recovery speed set at ? do you see any high % utilisation in disk or cpu on the node stats charts ?
replaced cache and journal with ssd : you replaced them or you want to replace them ? they were not ssd ?
removed all 8 OSD: from how many nodes ? 1 or all ? when you did this what is the cluster status ? are pgs clean or recovering or down ?
what is your backfill recovery speed set at ? do you see any high % utilisation in disk or cpu on the node stats charts ?
rophee
8 Posts
January 6, 2024, 6:22 amQuote from rophee on January 6, 2024, 6:22 ambefore replacing both cache and journal, cluster status is OK. i have 4 nodes and only replaced it on the first node. backfill was at 'very slow'. now set to 'medium' upon experiencing the issue. also set off all OSD settings except 'scrub' and 'deep scrub' thinking that it will correct itself. CPU at 63%, no high disk utilization.
before replacing both cache and journal, cluster status is OK. i have 4 nodes and only replaced it on the first node. backfill was at 'very slow'. now set to 'medium' upon experiencing the issue. also set off all OSD settings except 'scrub' and 'deep scrub' thinking that it will correct itself. CPU at 63%, no high disk utilization.
admin
2,930 Posts
January 6, 2024, 9:41 amQuote from admin on January 6, 2024, 9:41 amthe current status ?
is the issue that your new osd is not adding or do you have other cluster issues after deleting the osds earlier ?
note you could set scrub to off, but the others osd settings should not be off
the current status ?
is the issue that your new osd is not adding or do you have other cluster issues after deleting the osds earlier ?
note you could set scrub to off, but the others osd settings should not be off
Last edited on January 6, 2024, 9:48 am by admin · #4
rophee
8 Posts
January 6, 2024, 1:55 pmQuote from rophee on January 6, 2024, 1:55 pmcluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_WARN
nodown,noout,nobackfill,norebalance,norecover flag(s) set
Reduced data availability: 104 pgs inactive
Degraded data redundancy: 581576/3736689 objects degraded (15.564%), 286 pgs degraded, 302 pgs undersized
18 slow ops, oldest one blocked for 67422 sec, daemons [osd.19,osd.23,osd.27] have slow ops.
cluster has no other issues before, it's only now while adding OSD back in the node.
i've set all OSD settings to on except 'scrub' and 'deep scrub' as you noted.
i didn't check the Physical Disk List page before replying. the 2nd OSD that i'm trying to add change it's 'Usage' from 'Mounted' to blank and in the 'Action' the plus sign reappeared. i clicked on the '+' sign to add it again. now the OSD status is 'Adding' and 'Usage' is 'OSD9'.
#systemctl --type=service >> no ceph-osd@9.service
#/usr/bin/systemctl is-active ceph-osd@9 >> inactive
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_WARN
nodown,noout,nobackfill,norebalance,norecover flag(s) set
Reduced data availability: 104 pgs inactive
Degraded data redundancy: 581576/3736689 objects degraded (15.564%), 286 pgs degraded, 302 pgs undersized
18 slow ops, oldest one blocked for 67422 sec, daemons [osd.19,osd.23,osd.27] have slow ops.
cluster has no other issues before, it's only now while adding OSD back in the node.
i've set all OSD settings to on except 'scrub' and 'deep scrub' as you noted.
i didn't check the Physical Disk List page before replying. the 2nd OSD that i'm trying to add change it's 'Usage' from 'Mounted' to blank and in the 'Action' the plus sign reappeared. i clicked on the '+' sign to add it again. now the OSD status is 'Adding' and 'Usage' is 'OSD9'.
#systemctl --type=service >> no ceph-osd@9.service
#/usr/bin/systemctl is-active ceph-osd@9 >> inactive
Last edited on January 6, 2024, 2:14 pm by rophee · #5
admin
2,930 Posts
January 6, 2024, 4:44 pmQuote from admin on January 6, 2024, 4:44 pmThe primary issue is solving the inactive pgs. These were probably caused by osd deletions and other things like setting the few noxxx flags on osds.
Slow adding osd is not the issue to worry about now, it is not the cause of the problem.
You can refer to the ceph docs on solving inactive pgs.
The primary issue is solving the inactive pgs. These were probably caused by osd deletions and other things like setting the few noxxx flags on osds.
Slow adding osd is not the issue to worry about now, it is not the cause of the problem.
You can refer to the ceph docs on solving inactive pgs.
Last edited on January 6, 2024, 4:46 pm by admin · #6
rophee
8 Posts
January 8, 2024, 6:08 pmQuote from rophee on January 8, 2024, 6:08 pmi've rebooted all nodes and that seemed to clear whatever process is stuck during balancing/recovery.
~# ceph -s
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_OK
services:
mon: 3 daemons, quorum ps-node-03,ps-node-01,ps-node-02 (age 100m)
mgr: ps-node-03(active, since 2h), standbys: ps-node-02, ps-node-01
mds: 1/1 daemons up, 2 standby
osd: 32 osds: 32 up (since 88m), 32 in (since 89m); 340 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 2080 pgs
objects: 1.25M objects, 4.7 TiB
usage: 16 TiB used, 12 TiB / 28 TiB avail
pgs: 1076133/3736902 objects misplaced (28.797%)
1740 active+clean
331 active+remapped+backfill_wait
9 active+remapped+backfilling
io:
client: 9.0 MiB/s rd, 9.1 MiB/s wr, 40 op/s rd, 47 op/s wr
recovery: 109 MiB/s, 27 objects/s
##
i'd like to replace the cache and journal on the 2nd node since they're not SSD. should i wait until it's fully balanced?
i've rebooted all nodes and that seemed to clear whatever process is stuck during balancing/recovery.
~# ceph -s
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_OK
services:
mon: 3 daemons, quorum ps-node-03,ps-node-01,ps-node-02 (age 100m)
mgr: ps-node-03(active, since 2h), standbys: ps-node-02, ps-node-01
mds: 1/1 daemons up, 2 standby
osd: 32 osds: 32 up (since 88m), 32 in (since 89m); 340 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 2080 pgs
objects: 1.25M objects, 4.7 TiB
usage: 16 TiB used, 12 TiB / 28 TiB avail
pgs: 1076133/3736902 objects misplaced (28.797%)
1740 active+clean
331 active+remapped+backfill_wait
9 active+remapped+backfilling
io:
client: 9.0 MiB/s rd, 9.1 MiB/s wr, 40 op/s rd, 47 op/s wr
recovery: 109 MiB/s, 27 objects/s
##
i'd like to replace the cache and journal on the 2nd node since they're not SSD. should i wait until it's fully balanced?
admin
2,930 Posts
January 8, 2024, 10:45 pmQuote from admin on January 8, 2024, 10:45 pmYes wait until all is balanced before replacing other node osds.
Yes wait until all is balanced before replacing other node osds.
replaced cache and journal
rophee
8 Posts
Quote from rophee on January 6, 2024, 12:49 ami have a 4 node 3 replication cluster all hdd. replaced cache and journal with ssd. removed all 8 OSD. i'm trying to add OSD back to the node. one OSD was added successfully the second one is still on 'Adding'. is there anything i need to change to make this go faster?
i have a 4 node 3 replication cluster all hdd. replaced cache and journal with ssd. removed all 8 OSD. i'm trying to add OSD back to the node. one OSD was added successfully the second one is still on 'Adding'. is there anything i need to change to make this go faster?
admin
2,930 Posts
Quote from admin on January 6, 2024, 1:29 amreplaced cache and journal with ssd : you replaced them or you want to replace them ? they were not ssd ?
removed all 8 OSD: from how many nodes ? 1 or all ? when you did this what is the cluster status ? are pgs clean or recovering or down ?
what is your backfill recovery speed set at ? do you see any high % utilisation in disk or cpu on the node stats charts ?
replaced cache and journal with ssd : you replaced them or you want to replace them ? they were not ssd ?
removed all 8 OSD: from how many nodes ? 1 or all ? when you did this what is the cluster status ? are pgs clean or recovering or down ?
what is your backfill recovery speed set at ? do you see any high % utilisation in disk or cpu on the node stats charts ?
rophee
8 Posts
Quote from rophee on January 6, 2024, 6:22 ambefore replacing both cache and journal, cluster status is OK. i have 4 nodes and only replaced it on the first node. backfill was at 'very slow'. now set to 'medium' upon experiencing the issue. also set off all OSD settings except 'scrub' and 'deep scrub' thinking that it will correct itself. CPU at 63%, no high disk utilization.
before replacing both cache and journal, cluster status is OK. i have 4 nodes and only replaced it on the first node. backfill was at 'very slow'. now set to 'medium' upon experiencing the issue. also set off all OSD settings except 'scrub' and 'deep scrub' thinking that it will correct itself. CPU at 63%, no high disk utilization.
admin
2,930 Posts
Quote from admin on January 6, 2024, 9:41 amthe current status ?
is the issue that your new osd is not adding or do you have other cluster issues after deleting the osds earlier ?
note you could set scrub to off, but the others osd settings should not be off
the current status ?
is the issue that your new osd is not adding or do you have other cluster issues after deleting the osds earlier ?
note you could set scrub to off, but the others osd settings should not be off
rophee
8 Posts
Quote from rophee on January 6, 2024, 1:55 pmcluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_WARN
nodown,noout,nobackfill,norebalance,norecover flag(s) set
Reduced data availability: 104 pgs inactive
Degraded data redundancy: 581576/3736689 objects degraded (15.564%), 286 pgs degraded, 302 pgs undersized
18 slow ops, oldest one blocked for 67422 sec, daemons [osd.19,osd.23,osd.27] have slow ops.cluster has no other issues before, it's only now while adding OSD back in the node.
i've set all OSD settings to on except 'scrub' and 'deep scrub' as you noted.
i didn't check the Physical Disk List page before replying. the 2nd OSD that i'm trying to add change it's 'Usage' from 'Mounted' to blank and in the 'Action' the plus sign reappeared. i clicked on the '+' sign to add it again. now the OSD status is 'Adding' and 'Usage' is 'OSD9'.
#systemctl --type=service >> no ceph-osd@9.service
#/usr/bin/systemctl is-active ceph-osd@9 >> inactive
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_WARN
nodown,noout,nobackfill,norebalance,norecover flag(s) set
Reduced data availability: 104 pgs inactive
Degraded data redundancy: 581576/3736689 objects degraded (15.564%), 286 pgs degraded, 302 pgs undersized
18 slow ops, oldest one blocked for 67422 sec, daemons [osd.19,osd.23,osd.27] have slow ops.
cluster has no other issues before, it's only now while adding OSD back in the node.
i've set all OSD settings to on except 'scrub' and 'deep scrub' as you noted.
i didn't check the Physical Disk List page before replying. the 2nd OSD that i'm trying to add change it's 'Usage' from 'Mounted' to blank and in the 'Action' the plus sign reappeared. i clicked on the '+' sign to add it again. now the OSD status is 'Adding' and 'Usage' is 'OSD9'.
#systemctl --type=service >> no ceph-osd@9.service
#/usr/bin/systemctl is-active ceph-osd@9 >> inactive
admin
2,930 Posts
Quote from admin on January 6, 2024, 4:44 pmThe primary issue is solving the inactive pgs. These were probably caused by osd deletions and other things like setting the few noxxx flags on osds.
Slow adding osd is not the issue to worry about now, it is not the cause of the problem.
You can refer to the ceph docs on solving inactive pgs.
The primary issue is solving the inactive pgs. These were probably caused by osd deletions and other things like setting the few noxxx flags on osds.
Slow adding osd is not the issue to worry about now, it is not the cause of the problem.
You can refer to the ceph docs on solving inactive pgs.
rophee
8 Posts
Quote from rophee on January 8, 2024, 6:08 pmi've rebooted all nodes and that seemed to clear whatever process is stuck during balancing/recovery.
~# ceph -s
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_OKservices:
mon: 3 daemons, quorum ps-node-03,ps-node-01,ps-node-02 (age 100m)
mgr: ps-node-03(active, since 2h), standbys: ps-node-02, ps-node-01
mds: 1/1 daemons up, 2 standby
osd: 32 osds: 32 up (since 88m), 32 in (since 89m); 340 remapped pgsdata:
volumes: 1/1 healthy
pools: 4 pools, 2080 pgs
objects: 1.25M objects, 4.7 TiB
usage: 16 TiB used, 12 TiB / 28 TiB avail
pgs: 1076133/3736902 objects misplaced (28.797%)
1740 active+clean
331 active+remapped+backfill_wait
9 active+remapped+backfillingio:
client: 9.0 MiB/s rd, 9.1 MiB/s wr, 40 op/s rd, 47 op/s wr
recovery: 109 MiB/s, 27 objects/s##
i'd like to replace the cache and journal on the 2nd node since they're not SSD. should i wait until it's fully balanced?
i've rebooted all nodes and that seemed to clear whatever process is stuck during balancing/recovery.
~# ceph -s
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_OK
services:
mon: 3 daemons, quorum ps-node-03,ps-node-01,ps-node-02 (age 100m)
mgr: ps-node-03(active, since 2h), standbys: ps-node-02, ps-node-01
mds: 1/1 daemons up, 2 standby
osd: 32 osds: 32 up (since 88m), 32 in (since 89m); 340 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 2080 pgs
objects: 1.25M objects, 4.7 TiB
usage: 16 TiB used, 12 TiB / 28 TiB avail
pgs: 1076133/3736902 objects misplaced (28.797%)
1740 active+clean
331 active+remapped+backfill_wait
9 active+remapped+backfilling
io:
client: 9.0 MiB/s rd, 9.1 MiB/s wr, 40 op/s rd, 47 op/s wr
recovery: 109 MiB/s, 27 objects/s
##
i'd like to replace the cache and journal on the 2nd node since they're not SSD. should i wait until it's fully balanced?
admin
2,930 Posts
Quote from admin on January 8, 2024, 10:45 pmYes wait until all is balanced before replacing other node osds.
Yes wait until all is balanced before replacing other node osds.