Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

bluestore_block_db_size ignored

Pages: 1 2

I have a cluster with 3 nodes. Each node was configured with 2 hard disks (8TB) and 2 ssds (800GB). When building the cluster I firts changed the bluestore_block_db_size to 64424509440 (600 GB).

Then I created the OSDs (each HDD with one Journal SSD). As we are storing S3 Data form Veeam Backup & Replication we have lot of small objects (we use 4MB Backup Blocks). Everythink works as expected.

The I added 2 additional HDDs and 2 additional SSDs to each node (8TB HDDs / 800GB SSDs) and created the OSDs the same way.

No I have BLUEFS_SPILLOVER on all new OSDs telling the teh db dviec uses 32G of 60GB. Why is bluestore_böock_db_size ignored when adding OSDs?

From the Ceph Configuration menu, what is the value of bluestore_block_db_size

bluestore_block_db_size = 64424509440

This was changed before the first OSD was created as I had this spillover problem in my testenvironment with the 60GB db.

The bluestore_block_db-size was configured with 644245094400 when installing the cluster. But I have never changed this parameter. Does a shutdown of the cluster reset this value or anything other?

Is there a way to check which size was used when creating the original OSDs.

Yes its a bug, the parameter is reset on reboot. Best make sure the parameter is set to your desire value before adding new OSD and set it if needed.

For the OSDs having the issue, you can either

  1. use a script /opt/petasan/scripts/util/migrate_db.sh to move the journal partition to larger value, i suggest you look at what the script does and it has comments on creating a larger partition. Test it on a test cluster first.
  2. Or (easier): delete the OSD with issue and re-add it, do this one at a time waiting for cluster to rebalance before moving to the next. You can delete an OSD by first stopping it via systemctl stop then delete the stopped OSD using the UI.

You can view the size the OSD reports using

ceph daemon osd.XX perf dump

look for:

db_total_bytes # size of db partition
db_used_bytes # used db size
slow_used_bytes # any spill over size

i am also surprised if you are indeed saving 4MB object, are you sure ? large objects do not require a partition greater than 30 GB

 

I have created new partitions with 740 GB on all 6 ssds and changed the parameter bluestore_block_db-size to 644245094400.

The I used the script migrate_dh.sh to move the journals to the new partitions. This works fine and the new partitions are shown in the GUI as Journal:sdi2 / Journal:sdh2.

But all 6 OSD show BlueFS spillover. So part of the metadata is still on HDD. Is there a way to move this metadata to SSD Journal?

Can you double check that the manually created partitions using parted are the correct size ?

On a specific OSD (example osd.X, replace X)
what is the value of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8

try to force a rocksdb compaction manually

systemctl stop ceph-osd@X
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-X compact
systemctl start ceph-osd@X

what is new values of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8

{
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 644245086208,
"db_used_bytes": 100856225792,
"wal_total_bytes": 0,
"wal_used_bytes": 0,

 

I think the db_total_bytes shows the 740GB Partition I used. I try the compact when the storage is not doing archiving. At the moment Veeam is archiving the backups.

Sorry, it seams you are running a different version than me (3.2), can you perform the commands without the

| head -n 8

at the end, please run before and after compaction

Pages: 1 2