Forums - PetaSAN

ForumGeneral Discussionreplaced cache and journal
You need to log in to create posts and topics. Login · Register
replaced cache and journal

rophee
8 Posts

January 6, 2024, 12:49 am
Quote from rophee on January 6, 2024, 12:49 am
i have a 4 node 3 replication cluster all hdd. replaced cache and journal with ssd. removed all 8 OSD. i'm trying to add OSD back to the node. one OSD was added successfully the second one is still on 'Adding'. is there anything i need to change to make this go faster?

i have a 4 node 3 replication cluster all hdd. replaced cache and journal with ssd. removed all 8 OSD. i'm trying to add OSD back to the node. one OSD was added successfully the second one is still on 'Adding'. is there anything i need to change to make this go faster?

#1

admin
2,930 Posts

January 6, 2024, 1:29 am
Quote from admin on January 6, 2024, 1:29 am
replaced cache and journal with ssd : you replaced them or you want to replace them ? they were not ssd ?

removed all 8 OSD: from how many nodes ? 1 or all ? when you did this what is the cluster status ? are pgs clean or recovering or down ?

what is your backfill recovery speed set at ? do you see any high % utilisation in disk or cpu on the node stats charts ?

replaced cache and journal with ssd : you replaced them or you want to replace them ? they were not ssd ?

removed all 8 OSD: from how many nodes ? 1 or all ? when you did this what is the cluster status ? are pgs clean or recovering or down ?

what is your backfill recovery speed set at ? do you see any high % utilisation in disk or cpu on the node stats charts ?

#2

rophee
8 Posts

January 6, 2024, 6:22 am
Quote from rophee on January 6, 2024, 6:22 am
before replacing both cache and journal, cluster status is OK. i have 4 nodes and only replaced it on the first node. backfill was at 'very slow'. now set to 'medium' upon experiencing the issue. also set off all OSD settings except 'scrub' and 'deep scrub' thinking that it will correct itself. CPU at 63%, no high disk utilization.

before replacing both cache and journal, cluster status is OK. i have 4 nodes and only replaced it on the first node. backfill was at 'very slow'. now set to 'medium' upon experiencing the issue. also set off all OSD settings except 'scrub' and 'deep scrub' thinking that it will correct itself. CPU at 63%, no high disk utilization.

#3

admin
2,930 Posts

January 6, 2024, 9:41 am
Quote from admin on January 6, 2024, 9:41 am
the current status ?

is the issue that your new osd is not adding or do you have other cluster issues after deleting the osds earlier ?

note you could set scrub to off, but the others osd settings should not be off

the current status ?

is the issue that your new osd is not adding or do you have other cluster issues after deleting the osds earlier ?

note you could set scrub to off, but the others osd settings should not be off

Last edited on January 6, 2024, 9:48 am by admin · #4

rophee
8 Posts

January 6, 2024, 1:55 pm
Quote from rophee on January 6, 2024, 1:55 pm
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_WARN
nodown,noout,nobackfill,norebalance,norecover flag(s) set
Reduced data availability: 104 pgs inactive
Degraded data redundancy: 581576/3736689 objects degraded (15.564%), 286 pgs degraded, 302 pgs undersized
18 slow ops, oldest one blocked for 67422 sec, daemons [osd.19,osd.23,osd.27] have slow ops.

cluster has no other issues before, it's only now while adding OSD back in the node.

i've set all OSD settings to on except 'scrub' and 'deep scrub' as you noted.

i didn't check the Physical Disk List page before replying. the 2nd OSD that i'm trying to add change it's 'Usage' from 'Mounted' to blank and in the 'Action' the plus sign reappeared. i clicked on the '+' sign to add it again. now the OSD status is 'Adding' and 'Usage' is 'OSD9'.

#systemctl --type=service >> no ceph-osd@9.service

#/usr/bin/systemctl is-active ceph-osd@9 >> inactive

cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_WARN
nodown,noout,nobackfill,norebalance,norecover flag(s) set
Reduced data availability: 104 pgs inactive
Degraded data redundancy: 581576/3736689 objects degraded (15.564%), 286 pgs degraded, 302 pgs undersized
18 slow ops, oldest one blocked for 67422 sec, daemons [osd.19,osd.23,osd.27] have slow ops.

cluster has no other issues before, it's only now while adding OSD back in the node.

i've set all OSD settings to on except 'scrub' and 'deep scrub' as you noted.

i didn't check the Physical Disk List page before replying. the 2nd OSD that i'm trying to add change it's 'Usage' from 'Mounted' to blank and in the 'Action' the plus sign reappeared. i clicked on the '+' sign to add it again. now the OSD status is 'Adding' and 'Usage' is 'OSD9'.

#systemctl --type=service >> no ceph-osd@9.service

#/usr/bin/systemctl is-active ceph-osd@9 >> inactive

Last edited on January 6, 2024, 2:14 pm by rophee · #5

admin
2,930 Posts

January 6, 2024, 4:44 pm
Quote from admin on January 6, 2024, 4:44 pm
The primary issue is solving the inactive pgs. These were probably caused by osd deletions and other things like setting the few noxxx flags on osds.

Slow adding osd is not the issue to worry about now, it is not the cause of the problem.

You can refer to the ceph docs on solving inactive pgs.

The primary issue is solving the inactive pgs. These were probably caused by osd deletions and other things like setting the few noxxx flags on osds.

Slow adding osd is not the issue to worry about now, it is not the cause of the problem.

You can refer to the ceph docs on solving inactive pgs.

Last edited on January 6, 2024, 4:46 pm by admin · #6

rophee
8 Posts

January 8, 2024, 6:08 pm
Quote from rophee on January 8, 2024, 6:08 pm
i've rebooted all nodes and that seemed to clear whatever process is stuck during balancing/recovery.

~# ceph -s
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_OK

services:
mon: 3 daemons, quorum ps-node-03,ps-node-01,ps-node-02 (age 100m)
mgr: ps-node-03(active, since 2h), standbys: ps-node-02, ps-node-01
mds: 1/1 daemons up, 2 standby
osd: 32 osds: 32 up (since 88m), 32 in (since 89m); 340 remapped pgs

data:
volumes: 1/1 healthy
pools: 4 pools, 2080 pgs
objects: 1.25M objects, 4.7 TiB
usage: 16 TiB used, 12 TiB / 28 TiB avail
pgs: 1076133/3736902 objects misplaced (28.797%)
1740 active+clean
331 active+remapped+backfill_wait
9 active+remapped+backfilling

io:
client: 9.0 MiB/s rd, 9.1 MiB/s wr, 40 op/s rd, 47 op/s wr
recovery: 109 MiB/s, 27 objects/s

##

i'd like to replace the cache and journal on the 2nd node since they're not SSD. should i wait until it's fully balanced?

i've rebooted all nodes and that seemed to clear whatever process is stuck during balancing/recovery.

~# ceph -s
cluster:
id: b544c7c8-9f3b-4572-b999-d863ed0f99dc
health: HEALTH_OK

services:
mon: 3 daemons, quorum ps-node-03,ps-node-01,ps-node-02 (age 100m)
mgr: ps-node-03(active, since 2h), standbys: ps-node-02, ps-node-01
mds: 1/1 daemons up, 2 standby
osd: 32 osds: 32 up (since 88m), 32 in (since 89m); 340 remapped pgs

data:
volumes: 1/1 healthy
pools: 4 pools, 2080 pgs
objects: 1.25M objects, 4.7 TiB
usage: 16 TiB used, 12 TiB / 28 TiB avail
pgs: 1076133/3736902 objects misplaced (28.797%)
1740 active+clean
331 active+remapped+backfill_wait
9 active+remapped+backfilling

io:
client: 9.0 MiB/s rd, 9.1 MiB/s wr, 40 op/s rd, 47 op/s wr
recovery: 109 MiB/s, 27 objects/s

##

i'd like to replace the cache and journal on the 2nd node since they're not SSD. should i wait until it's fully balanced?

#7

admin
2,930 Posts

January 8, 2024, 10:45 pm
Quote from admin on January 8, 2024, 10:45 pm
Yes wait until all is balanced before replacing other node osds.

Yes wait until all is balanced before replacing other node osds.

#8

Post Reply: replaced cache and journal

Cancel