bluestore_block_db_size ignored
Pages: 1 2
Ingo Hellmer
18 Posts
July 5, 2023, 7:42 amQuote from Ingo Hellmer on July 5, 2023, 7:42 amI have a cluster with 3 nodes. Each node was configured with 2 hard disks (8TB) and 2 ssds (800GB). When building the cluster I firts changed the bluestore_block_db_size to 64424509440 (600 GB).
Then I created the OSDs (each HDD with one Journal SSD). As we are storing S3 Data form Veeam Backup & Replication we have lot of small objects (we use 4MB Backup Blocks). Everythink works as expected.
The I added 2 additional HDDs and 2 additional SSDs to each node (8TB HDDs / 800GB SSDs) and created the OSDs the same way.
No I have BLUEFS_SPILLOVER on all new OSDs telling the teh db dviec uses 32G of 60GB. Why is bluestore_böock_db_size ignored when adding OSDs?
I have a cluster with 3 nodes. Each node was configured with 2 hard disks (8TB) and 2 ssds (800GB). When building the cluster I firts changed the bluestore_block_db_size to 64424509440 (600 GB).
Then I created the OSDs (each HDD with one Journal SSD). As we are storing S3 Data form Veeam Backup & Replication we have lot of small objects (we use 4MB Backup Blocks). Everythink works as expected.
The I added 2 additional HDDs and 2 additional SSDs to each node (8TB HDDs / 800GB SSDs) and created the OSDs the same way.
No I have BLUEFS_SPILLOVER on all new OSDs telling the teh db dviec uses 32G of 60GB. Why is bluestore_böock_db_size ignored when adding OSDs?
admin
2,918 Posts
July 5, 2023, 8:52 amQuote from admin on July 5, 2023, 8:52 amFrom the Ceph Configuration menu, what is the value of bluestore_block_db_size
From the Ceph Configuration menu, what is the value of bluestore_block_db_size
Ingo Hellmer
18 Posts
July 5, 2023, 8:58 amQuote from Ingo Hellmer on July 5, 2023, 8:58 ambluestore_block_db_size = 64424509440
This was changed before the first OSD was created as I had this spillover problem in my testenvironment with the 60GB db.
bluestore_block_db_size = 64424509440
This was changed before the first OSD was created as I had this spillover problem in my testenvironment with the 60GB db.
Ingo Hellmer
18 Posts
July 5, 2023, 10:58 amQuote from Ingo Hellmer on July 5, 2023, 10:58 amThe bluestore_block_db-size was configured with 644245094400 when installing the cluster. But I have never changed this parameter. Does a shutdown of the cluster reset this value or anything other?
The bluestore_block_db-size was configured with 644245094400 when installing the cluster. But I have never changed this parameter. Does a shutdown of the cluster reset this value or anything other?
Ingo Hellmer
18 Posts
July 5, 2023, 11:00 amQuote from Ingo Hellmer on July 5, 2023, 11:00 amIs there a way to check which size was used when creating the original OSDs.
Is there a way to check which size was used when creating the original OSDs.
admin
2,918 Posts
July 5, 2023, 12:27 pmQuote from admin on July 5, 2023, 12:27 pmYes its a bug, the parameter is reset on reboot. Best make sure the parameter is set to your desire value before adding new OSD and set it if needed.
For the OSDs having the issue, you can either
- use a script /opt/petasan/scripts/util/migrate_db.sh to move the journal partition to larger value, i suggest you look at what the script does and it has comments on creating a larger partition. Test it on a test cluster first.
- Or (easier): delete the OSD with issue and re-add it, do this one at a time waiting for cluster to rebalance before moving to the next. You can delete an OSD by first stopping it via systemctl stop then delete the stopped OSD using the UI.
You can view the size the OSD reports using
ceph daemon osd.XX perf dump
look for:
db_total_bytes # size of db partition
db_used_bytes # used db size
slow_used_bytes # any spill over size
i am also surprised if you are indeed saving 4MB object, are you sure ? large objects do not require a partition greater than 30 GB
Yes its a bug, the parameter is reset on reboot. Best make sure the parameter is set to your desire value before adding new OSD and set it if needed.
For the OSDs having the issue, you can either
- use a script /opt/petasan/scripts/util/migrate_db.sh to move the journal partition to larger value, i suggest you look at what the script does and it has comments on creating a larger partition. Test it on a test cluster first.
- Or (easier): delete the OSD with issue and re-add it, do this one at a time waiting for cluster to rebalance before moving to the next. You can delete an OSD by first stopping it via systemctl stop then delete the stopped OSD using the UI.
You can view the size the OSD reports using
ceph daemon osd.XX perf dump
look for:
db_total_bytes # size of db partition
db_used_bytes # used db size
slow_used_bytes # any spill over size
i am also surprised if you are indeed saving 4MB object, are you sure ? large objects do not require a partition greater than 30 GB
Ingo Hellmer
18 Posts
July 7, 2023, 6:10 amQuote from Ingo Hellmer on July 7, 2023, 6:10 amI have created new partitions with 740 GB on all 6 ssds and changed the parameter bluestore_block_db-size to 644245094400.
The I used the script migrate_dh.sh to move the journals to the new partitions. This works fine and the new partitions are shown in the GUI as Journal:sdi2 / Journal:sdh2.
But all 6 OSD show BlueFS spillover. So part of the metadata is still on HDD. Is there a way to move this metadata to SSD Journal?
I have created new partitions with 740 GB on all 6 ssds and changed the parameter bluestore_block_db-size to 644245094400.
The I used the script migrate_dh.sh to move the journals to the new partitions. This works fine and the new partitions are shown in the GUI as Journal:sdi2 / Journal:sdh2.
But all 6 OSD show BlueFS spillover. So part of the metadata is still on HDD. Is there a way to move this metadata to SSD Journal?
admin
2,918 Posts
July 7, 2023, 7:36 amQuote from admin on July 7, 2023, 7:36 amCan you double check that the manually created partitions using parted are the correct size ?
On a specific OSD (example osd.X, replace X)
what is the value of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
try to force a rocksdb compaction manually
systemctl stop ceph-osd@X
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-X compact
systemctl start ceph-osd@X
what is new values of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
Can you double check that the manually created partitions using parted are the correct size ?
On a specific OSD (example osd.X, replace X)
what is the value of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
try to force a rocksdb compaction manually
systemctl stop ceph-osd@X
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-X compact
systemctl start ceph-osd@X
what is new values of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
Ingo Hellmer
18 Posts
July 7, 2023, 2:28 pmQuote from Ingo Hellmer on July 7, 2023, 2:28 pm{
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 644245086208,
"db_used_bytes": 100856225792,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
I think the db_total_bytes shows the 740GB Partition I used. I try the compact when the storage is not doing archiving. At the moment Veeam is archiving the backups.
{
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 644245086208,
"db_used_bytes": 100856225792,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
I think the db_total_bytes shows the 740GB Partition I used. I try the compact when the storage is not doing archiving. At the moment Veeam is archiving the backups.
admin
2,918 Posts
July 7, 2023, 2:49 pmQuote from admin on July 7, 2023, 2:49 pmSorry, it seams you are running a different version than me (3.2), can you perform the commands without the
| head -n 8
at the end, please run before and after compaction
Sorry, it seams you are running a different version than me (3.2), can you perform the commands without the
| head -n 8
at the end, please run before and after compaction
Pages: 1 2
bluestore_block_db_size ignored
Ingo Hellmer
18 Posts
Quote from Ingo Hellmer on July 5, 2023, 7:42 amI have a cluster with 3 nodes. Each node was configured with 2 hard disks (8TB) and 2 ssds (800GB). When building the cluster I firts changed the bluestore_block_db_size to 64424509440 (600 GB).
Then I created the OSDs (each HDD with one Journal SSD). As we are storing S3 Data form Veeam Backup & Replication we have lot of small objects (we use 4MB Backup Blocks). Everythink works as expected.
The I added 2 additional HDDs and 2 additional SSDs to each node (8TB HDDs / 800GB SSDs) and created the OSDs the same way.
No I have BLUEFS_SPILLOVER on all new OSDs telling the teh db dviec uses 32G of 60GB. Why is bluestore_böock_db_size ignored when adding OSDs?
I have a cluster with 3 nodes. Each node was configured with 2 hard disks (8TB) and 2 ssds (800GB). When building the cluster I firts changed the bluestore_block_db_size to 64424509440 (600 GB).
Then I created the OSDs (each HDD with one Journal SSD). As we are storing S3 Data form Veeam Backup & Replication we have lot of small objects (we use 4MB Backup Blocks). Everythink works as expected.
The I added 2 additional HDDs and 2 additional SSDs to each node (8TB HDDs / 800GB SSDs) and created the OSDs the same way.
No I have BLUEFS_SPILLOVER on all new OSDs telling the teh db dviec uses 32G of 60GB. Why is bluestore_böock_db_size ignored when adding OSDs?
admin
2,918 Posts
Quote from admin on July 5, 2023, 8:52 amFrom the Ceph Configuration menu, what is the value of bluestore_block_db_size
From the Ceph Configuration menu, what is the value of bluestore_block_db_size
Ingo Hellmer
18 Posts
Quote from Ingo Hellmer on July 5, 2023, 8:58 ambluestore_block_db_size = 64424509440
This was changed before the first OSD was created as I had this spillover problem in my testenvironment with the 60GB db.
bluestore_block_db_size = 64424509440
This was changed before the first OSD was created as I had this spillover problem in my testenvironment with the 60GB db.
Ingo Hellmer
18 Posts
Quote from Ingo Hellmer on July 5, 2023, 10:58 amThe bluestore_block_db-size was configured with 644245094400 when installing the cluster. But I have never changed this parameter. Does a shutdown of the cluster reset this value or anything other?
The bluestore_block_db-size was configured with 644245094400 when installing the cluster. But I have never changed this parameter. Does a shutdown of the cluster reset this value or anything other?
Ingo Hellmer
18 Posts
Quote from Ingo Hellmer on July 5, 2023, 11:00 amIs there a way to check which size was used when creating the original OSDs.
Is there a way to check which size was used when creating the original OSDs.
admin
2,918 Posts
Quote from admin on July 5, 2023, 12:27 pmYes its a bug, the parameter is reset on reboot. Best make sure the parameter is set to your desire value before adding new OSD and set it if needed.
For the OSDs having the issue, you can either
- use a script /opt/petasan/scripts/util/migrate_db.sh to move the journal partition to larger value, i suggest you look at what the script does and it has comments on creating a larger partition. Test it on a test cluster first.
- Or (easier): delete the OSD with issue and re-add it, do this one at a time waiting for cluster to rebalance before moving to the next. You can delete an OSD by first stopping it via systemctl stop then delete the stopped OSD using the UI.
You can view the size the OSD reports using
ceph daemon osd.XX perf dump
look for:
db_total_bytes # size of db partition
db_used_bytes # used db size
slow_used_bytes # any spill over sizei am also surprised if you are indeed saving 4MB object, are you sure ? large objects do not require a partition greater than 30 GB
Yes its a bug, the parameter is reset on reboot. Best make sure the parameter is set to your desire value before adding new OSD and set it if needed.
For the OSDs having the issue, you can either
- use a script /opt/petasan/scripts/util/migrate_db.sh to move the journal partition to larger value, i suggest you look at what the script does and it has comments on creating a larger partition. Test it on a test cluster first.
- Or (easier): delete the OSD with issue and re-add it, do this one at a time waiting for cluster to rebalance before moving to the next. You can delete an OSD by first stopping it via systemctl stop then delete the stopped OSD using the UI.
You can view the size the OSD reports using
ceph daemon osd.XX perf dump
look for:
db_total_bytes # size of db partition
db_used_bytes # used db size
slow_used_bytes # any spill over size
i am also surprised if you are indeed saving 4MB object, are you sure ? large objects do not require a partition greater than 30 GB
Ingo Hellmer
18 Posts
Quote from Ingo Hellmer on July 7, 2023, 6:10 amI have created new partitions with 740 GB on all 6 ssds and changed the parameter bluestore_block_db-size to 644245094400.
The I used the script migrate_dh.sh to move the journals to the new partitions. This works fine and the new partitions are shown in the GUI as Journal:sdi2 / Journal:sdh2.
But all 6 OSD show BlueFS spillover. So part of the metadata is still on HDD. Is there a way to move this metadata to SSD Journal?
I have created new partitions with 740 GB on all 6 ssds and changed the parameter bluestore_block_db-size to 644245094400.
The I used the script migrate_dh.sh to move the journals to the new partitions. This works fine and the new partitions are shown in the GUI as Journal:sdi2 / Journal:sdh2.
But all 6 OSD show BlueFS spillover. So part of the metadata is still on HDD. Is there a way to move this metadata to SSD Journal?
admin
2,918 Posts
Quote from admin on July 7, 2023, 7:36 amCan you double check that the manually created partitions using parted are the correct size ?
On a specific OSD (example osd.X, replace X)
what is the value of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8try to force a rocksdb compaction manually
systemctl stop ceph-osd@X
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-X compact
systemctl start ceph-osd@Xwhat is new values of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
Can you double check that the manually created partitions using parted are the correct size ?
On a specific OSD (example osd.X, replace X)
what is the value of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
try to force a rocksdb compaction manually
systemctl stop ceph-osd@X
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-X compact
systemctl start ceph-osd@X
what is new values of:
ceph daemon osd.X perf dump | jq '{bluefs:.bluefs}' | head -n 8
Ingo Hellmer
18 Posts
Quote from Ingo Hellmer on July 7, 2023, 2:28 pm{
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 644245086208,
"db_used_bytes": 100856225792,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
I think the db_total_bytes shows the 740GB Partition I used. I try the compact when the storage is not doing archiving. At the moment Veeam is archiving the backups.
{
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 644245086208,
"db_used_bytes": 100856225792,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
I think the db_total_bytes shows the 740GB Partition I used. I try the compact when the storage is not doing archiving. At the moment Veeam is archiving the backups.
admin
2,918 Posts
Quote from admin on July 7, 2023, 2:49 pmSorry, it seams you are running a different version than me (3.2), can you perform the commands without the
| head -n 8
at the end, please run before and after compaction
Sorry, it seams you are running a different version than me (3.2), can you perform the commands without the
| head -n 8
at the end, please run before and after compaction