Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Actions required for adding journal

On an existing PS2.0.0 cluster, what actions are required to adding a journal? I have 4 nodes with 6 OSDs each, and I just added an SSD to each node(depending on performance results, i'll add another to meet the 4:1 recommendation), and added the journal for each in the WebUI, but cluster benchmarks and writes over iSCSI remained the same.

Output of ceph-disk list shows the journal, but not showing it as used for any OSDs:

root@bd-ceph-sd1:~# ceph-disk list
/dev/sda :
/dev/sda1 ceph data, active, cluster BD-Ceph-Cluster1, osd.18, block /dev/sda2
/dev/sda2 ceph block, for /dev/sda1
/dev/sdb :
/dev/sdb1 ceph data, active, cluster BD-Ceph-Cluster1, osd.19, block /dev/sdb2
/dev/sdb2 ceph block, for /dev/sdb1
/dev/sdc :
/dev/sdc1 ceph data, active, cluster BD-Ceph-Cluster1, osd.20, block /dev/sdc2
/dev/sdc2 ceph block, for /dev/sdc1
/dev/sdd :
/dev/sdd1 ceph data, active, cluster BD-Ceph-Cluster1, osd.21, block /dev/sdd2
/dev/sdd2 ceph block, for /dev/sdd1
/dev/sde :
/dev/sde1 ceph data, active, cluster BD-Ceph-Cluster1, osd.22, block /dev/sde2
/dev/sde2 ceph block, for /dev/sde1
/dev/sdf :
/dev/sdf1 ceph data, active, cluster BD-Ceph-Cluster1, osd.23, block /dev/sdf2
/dev/sdf2 ceph block, for /dev/sdf1
/dev/sdg :
/dev/sdg1 other, ext4, mounted on /boot
/dev/sdg2 other, ext4, mounted on /
/dev/sdg3 other, ext4, mounted on /var/lib/ceph
/dev/sdg4 other, ext4, mounted on /opt/petasan/config
/dev/sdh :
/dev/sdh1 ceph journal
root@bd-ceph-sd1:~#

In the WebUI Node Disk list, all OSD are "Up". The journal does not have a status icon, and none of the OSDs show anything in the journal column.

It's been marinating for about an hour with no change. Is there something else required?

Thanks!

The journals can be used when you add/create new osds, but will not be used for current osds. see:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022495.html

If you have existing osds, the only way is to remove them and re-add them one by one and waiting in between. If will do this in PetaSAN:

first stop 1 osd by running the following on the containing node:

systemctl stop ceph-osd@OSD_ID

Once stopped, the cluster will report a warning state that some PGs are not clean (ie they are missing a replica) and after a few minutes will start the recovery process to recreate the missing replicas on other osds, you need to wait until the PGs are all active+clean before proceeding. You can observe the progress via the PG Status cluster chart, it will give you an idea when the recovery will complete.

Once done, it is safe to delete the osd from the PetaSAN ui, once deleted you can re-add it as a new osd, if you have 1 or more journals you will be able to choose to add the osd using an external journal. When this is done, the cluster will start rebalancing PGs, you need to wait until this is complete via the PG Status chart, when all is active+clean repeat the process with another osd.

If i add 6 OSDs to a journal that way now, can I later move two OSD to another journal, or will i have to go through this whole process again?

With 24 OSDs and no production data, i'm leaning towards blowing away the cluster and re-installing from scratch.

Maybe better to build with 4 osds per node then later add the other 2 with the new ssd.

Else it is possible to start with 6 osds on 1 ssd then move the journal later,  the following should help:

https://www.spinics.net/lists/ceph-users/msg43012.html

I would recommend you test this process first on dummy data before filling the cluster with read data.

Ceph have an open work item for this:

https://trello.com/c/9cxTgG50/324-bluestore-add-remove-resize-wal-db

Ok, if i'm going to start over, when choosing the cluster size, should I choose how many disks i plan to have, or how many i'm starting with? I think i read elsewhere to choose for future growth?

It sounds like i'd be starting with only 16 OSDs and 4 journals, then growing to 48 very shortly, with 12 journals. I can fit 144 total, and I intend on using all eventually.

Thanks!

The range of disk count is used to choose the count of placement groups (PGs) used. An OSD disk works best if it has a certain number of PGs to handle, too much or too little  are not good. It is ok to start with x disks then grow to 4 or 5 x but going above 10 x is too much for Ceph to handle without some drawbacks. 16 -> 144 is too much of range, it could be handled (with drawbacks) as follows:

  • Choose an initial value of 15-50 disks when you create your cluster, when you reach around 100 disks we need to re-adjust the number of PGs manually via cli commands, however this will produce huge data re-balance.
  • Choose an initial value of 50-200 disks, the drawback is that at the initial stages when you have 16 disks, you will get a warning from Ceph that the number of PGs are too high and if you have any cases of recovery (  a node is down or a disk is down ) it puts a lot of stress on existing OSDs during the recovery process since each OSD will have to handle a lot of PG recovery itself, if your hardware is not fast, it will make the system very slow.

Future versions of Ceph will try to handle PG numbers in a more flexible way.