ForumGeneral DiscussionProblems after adding OSD nodes

You need to log in to create posts and topics. Login · Register

Problems after adding OSD nodes

kpiti
26 Posts

November 26, 2024, 5:01 pm

Quote from kpiti on November 26, 2024, 5:01 pm
Hi,

we had 5 nodes (3 mgr, 3metadata, 5 osd) with 6x8TB disks/osds each + 2 cache nvme + journal on ssd. We decided to upgrade with 5 more nodes with the same configuration and add bigger disks to all in the free slots. So each node got an additional 4x16TB disks.

I started adding nodes and I decided I'll just add nodes to the cluster and add/create OSDs when everything was up and healthy so I didn't mark any disks as OSDs/cache/journal in the install as I wanted to do that later. When all 5 new nodes were installed and joined I went to the dashboard and saw they are listed there but all are marked as down and there isn't any action available (like disk list etc..)

At the same time I got a warning about some PGs not deep scrubbed and heath went to WARN and I saw a lot of PG rebalancing going on. There was a warning about setting norebalance when I installed the new nodes but I (foolishly) thought as I'm not marking disks as OSDs, nothing is going to happen. And now this rebalancing is going on for a couple of days I started digging and I found out that we have 60 OSDs now instead of 30 and in osd tree I can see all the new nodes being full of OSDs.

Now for starters I still can't see the new nodes in the gui (I can see them but marked as Down), I also didn't wan't to add the new disks automatically as we wanted to make some changes and the big disks would help us move the data off the current ones so we could recreate the old pools anew.. As it seems the new nodes got automatically configured the same way the old ones were and the OSDs were activated and rebalancing was hyper active. I turned on norebalance now but I think a lot of data has been transferred to the new osds already..

The cluster looks like this at the moment:
# ceph -s
  cluster:
    id:     2e7a0a56-89a1-481d-b78b-7ed5a44f1881
    health: HEALTH_WARN
            noout,norebalance,norecover flag(s) set
            703 pgs not deep-scrubbed in time
            971 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum CEPH03,CEPH01,CEPH02 (age 10w)
    mgr: CEPH01(active, since 10w), standbys: CEPH02, CEPH03
    mds: 2/2 daemons up, 1 standby
    osd: 60 osds: 60 up (since 5d), 60 in (since 5d); 480 remapped pgs
         flags noout,norebalance,norecover

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 2113 pgs
    objects: 33.95M objects, 41 TiB
    usage:   127 TiB used, 313 TiB / 440 TiB avail
    pgs:     2193991/101839914 objects misplaced (2.154%)
             1633 active+clean
             479  active+remapped+backfill_wait
             1    active+remapped+backfilling

  io:
    client:   0 B/s rd, 6.6 KiB/s wr, 0 op/s rd, 1 op/s wr

  progress:
    Global Recovery Event (10d)
      [=====================.......] (remaining: 3d)
How can I get the cluster to use just the first 5 nodes (30 OSDs) and leave the new stuff clean for the time being?

I suppose the 5 new nodes/OSDs should get active/Up after the cluster gets to OK health in the node list or is there a problem already?

Luckily the work people do is largely uninterrupted, some just noticed the free space/capacity was increasing..

Thanks, any help appreciated..

Cheers, Jure

Hi,

we had 5 nodes (3 mgr, 3metadata, 5 osd) with 6x8TB disks/osds each + 2 cache nvme + journal on ssd. We decided to upgrade with 5 more nodes with the same configuration and add bigger disks to all in the free slots. So each node got an additional 4x16TB disks.

I started adding nodes and I decided I'll just add nodes to the cluster and add/create OSDs when everything was up and healthy so I didn't mark any disks as OSDs/cache/journal in the install as I wanted to do that later. When all 5 new nodes were installed and joined I went to the dashboard and saw they are listed there but all are marked as down and there isn't any action available (like disk list etc..)

At the same time I got a warning about some PGs not deep scrubbed and heath went to WARN and I saw a lot of PG rebalancing going on. There was a warning about setting norebalance when I installed the new nodes but I (foolishly) thought as I'm not marking disks as OSDs, nothing is going to happen. And now this rebalancing is going on for a couple of days I started digging and I found out that we have 60 OSDs now instead of 30 and in osd tree I can see all the new nodes being full of OSDs.

Now for starters I still can't see the new nodes in the gui (I can see them but marked as Down), I also didn't wan't to add the new disks automatically as we wanted to make some changes and the big disks would help us move the data off the current ones so we could recreate the old pools anew.. As it seems the new nodes got automatically configured the same way the old ones were and the OSDs were activated and rebalancing was hyper active. I turned on norebalance now but I think a lot of data has been transferred to the new osds already..

The cluster looks like this at the moment:

# ceph -s
  cluster:
    id:     2e7a0a56-89a1-481d-b78b-7ed5a44f1881
    health: HEALTH_WARN
            noout,norebalance,norecover flag(s) set
            703 pgs not deep-scrubbed in time
            971 pgs not scrubbed in time

  services:
    mon: 3 daemons, quorum CEPH03,CEPH01,CEPH02 (age 10w)
    mgr: CEPH01(active, since 10w), standbys: CEPH02, CEPH03
    mds: 2/2 daemons up, 1 standby
    osd: 60 osds: 60 up (since 5d), 60 in (since 5d); 480 remapped pgs
         flags noout,norebalance,norecover

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 2113 pgs
    objects: 33.95M objects, 41 TiB
    usage:   127 TiB used, 313 TiB / 440 TiB avail
    pgs:     2193991/101839914 objects misplaced (2.154%)
             1633 active+clean
             479  active+remapped+backfill_wait
             1    active+remapped+backfilling

  io:
    client:   0 B/s rd, 6.6 KiB/s wr, 0 op/s rd, 1 op/s wr

  progress:
    Global Recovery Event (10d)
      [=====================.......] (remaining: 3d)

How can I get the cluster to use just the first 5 nodes (30 OSDs) and leave the new stuff clean for the time being?

I suppose the 5 new nodes/OSDs should get active/Up after the cluster gets to OK health in the node list or is there a problem already?

Luckily the work people do is largely uninterrupted, some just noticed the free space/capacity was increasing..

Thanks, any help appreciated..

Cheers, Jure

admin
2,959 Posts

November 26, 2024, 7:08 pm

you can try to set the OSD crush weight of the new OSDs to 0, via the UI or via cli. Then when things are stable you can increase the weight gradually.

it is very strange the OSDs were automatically added. were these old OSDs that were used before or were they new drives?

kpiti
26 Posts

November 26, 2024, 7:41 pm

Brand new boxes & drives. I did expect I'd have a say in configuring them and was surprised to see them all active.. The extra bigger drives seem not to be activated..

If I set crush weight to 0 do I have to release the rebalancing so it will move the pgs off them?

admin
2,959 Posts

November 26, 2024, 8:02 pm

yes you should set back to normal the rebalance, noout flags

you can set the backfill speed to very slow, then slowly increase, look at the charts for % disk util as well as cpu and net, make sure they are not stressed before increasing speed.

we will try to reproduce your issue, but i doubt we will. if you have a virtual test environment, can you try to reproduce it ?

kpiti
26 Posts

November 26, 2024, 8:10 pm

Unfortunately this is a physical system. I am happy to assist you with the investigation if that is feasible (as long as we don't crush the cluster ;-)..

What was expected and should be the result at the end is old functiong cluster with active OSDs and pools (5 boxes) and 5 similar vanilla boxes that shouldn't have data on.. I can send you some logs from the install if it is any help..

Thanks..

Jure

kpiti
26 Posts

November 26, 2024, 8:15 pm

Btw, I found this in ceph docs:

ADJUSTING OSD WEIGHT

Note

Under normal conditions, OSDs automatically add themselves to the CRUSH map with the correct weight when they are created. The command in this section is rarely needed.

But I'd expect it to be applicable *if* OSDs are created first..

admin
2,959 Posts

November 28, 2024, 11:46 am

We did tests and could not reproduce this.

Would it be possible for you to share the following taken from any of the nodes with issue (1 node is enough), please share them on a shared storage link.

Contents of log files
/opt/petasan/log/PetaSAN.log
/opt/petasan/log/ceph-volume.log

Output of following commads:

ceph-volume lvm list
consul members
consul kv get -recurse PetaSAN/Nodes

kpiti
26 Posts

November 29, 2024, 10:11 am

Hi, the logs are here - https://fl.forensis.si/logs.tgz

It seems consul is not running on the new nodes, how can I start it...

I've managed to remap all the data from the new nodes and finished all the scrubbing that has got stuck in the meantime.. Tnx..

admin
2,959 Posts

November 29, 2024, 7:13 pm

Thanks for the logs, unfortunately the PetaSAN.log is truncated probably due to log file rotation, if you can find the first log covering when the node was added it will help a lot, try the other nodes and see if the initial log file exists.

You can start consul manually using the script

/opt/petasan/scripts/consul_client_start_up.py

it should have started automatically, so there may be other problem preventing it from starting. /var/log/syslog may have info in this case.

kpiti
26 Posts

December 1, 2024, 11:05 am

Hi,

I've included the rotation logs as well, consul started up without any issues so the commands went through ok. The link is the same..