Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Replacing Journals Causes Inactive PGs

Hello,

I have a 4 node 2.7.3 Petasan Cluster running on dell R730XDs with 64 GB RAM, 2 CPU E5-2680v3.  Each node had 12 HDD , with 2 NVMes and one SSD as journals on each server.  All servers have the same number of the same size disks, the same NVMes and the same SSDs.   The same 4 HDDs are assigned to the same journal device on each server node.  I have Dell's Intel X520 10 GB NICs on board.

Due to long run-times, I decided to replace the SSD and one NVMe on each of the servers (the other NVMe was only recently added).  To do this, I shut down a node, removed the NVMe and SSD, and replaced them with 2 new NVMEs, rebooted and found 8 OSDs of 12 on the node were down, as expected.

I have only the RBD pool configured with 1024 PGs, one single 100 TB iSCSI drive using 3-2 (default) resiliency.

After adding the new NVMes to PetaSAN as journal disks, the iSCSI disk was unaffected up to this point.  PG Status show 1024 Active PG and the disk was attached at the Hyper-V Failover Cluster.

However, when I added my first OSD back and assigned it to the first new NVMe for journaling, my pool went inactive, my iSCSI disk disappeared and PG status show 1987 PCs active.

"ceph health detail" showed some PGs stuck inactive with many PGs stuck Degraded.  The degraded one were being unstuck by the backfill process, which took a long time, but so be it.  The pool and iSCSI disk were down because of the stuck inactive PGs.

Since my Hyper-V Cluster was completely down, I elected to stop/start one OSD process associated with each PG that was stuck inactive and this resulted in the restoration of the pool and the iSCSI disk.

The rebuild process took 8+ hours.  I saw I/O rates of 800 MB/s on the upgrade PetaSAN node, and in the end, I could find no reference to this problem except perhaps a ceph bug report in 2018 where inactive PGs were not re-examined until a bad OSD was STARTED and STOPPED (the opposite of my circumstance - also this bug is reported in ceph 14.2.0 and is supposed to be resolved given that we're using ceph 14.2.11).

I'm asking for help only because I have 4 servers to upgrade similarly.  I've completed the 2nd of 4 and it did exactly the same thing - Active PGs dropped and the pool and iSCSI disk went down - and so did my Hyper-V cluster.  Since I expected it this time, I had my VMs all off and the Hyper-V cluster paused.

BUT - an additional "not-supposed-to-happen" thing happened.

While the PetaSAN cluster we recovering and before I cycled the OSD processes for the Stuck Inactive PGs, the iSCSI disk became accessible from to the Hyper-V cluster on iSCSI - even thought the PetaSAN UI showed the pool as Inactive and no iSCSI disk appeared on the iSCSI page.

Can you give me any idea what's happening?

Can you tell me how to replace journal disks without losing the iSCSI disk(s) and the pool (as in without causing Stuck Inactive PGs).  I have RBD set to 3 copies, requiring 2 to be up and I have 4 nodes. How could I possibly lose active PGs?

What am I misunderstanding here?

Any help you can give will be appreciated.

Jim Graczyk

You could use ceph-bluestore-tool to migrate the journal to new device without OSD recreation.

As for what happened, hard to say exactly without investigation deeper, but one likely scenario is recovery load/speed is too fast for hardware you have. When you recreate 1 node out of 4, you are re-copying 25% of your stored data to the new node, this could be too much load for the existing hardware. You can check if your disks were stressed from the disk % util chart. It such cases it is better to reduce the recovery speed from Maintenance menu and then increase it in steps if you find you  client workload is not affected or disk % util are not too high. Another method is to reduce the OSD crush weight from the Maintenance menu and gradually increase it.

BUT - an additional "not-supposed-to-happen" thing happened.

While the PetaSAN cluster we recovering and before I cycled the OSD processes for the Stuck Inactive PGs, the iSCSI disk became accessible from to the Hyper-V cluster on iSCSI - even thought the PetaSAN UI showed the pool as Inactive and no iSCSI disk appeared on the iSCSI page.

Can you give me any idea what's happening?

Again cannot say for sure without looking in more detail, but as a guess: PetaSAN does indicate a pool as inactive if it has some inactive PGs, Ceph has no concept of inactive pool but of inactive PGs. If iSCSI i/o map to the active PGs, the OSD will respond, if i/o maps to inactive PGs they will not. The iSCSI Disk List page which lists the available disks reads from specific PGs so it could wither work or not depending what PGs are inactive.

Thanks for your reply.  It's good to hear from an informed admin and thanks for the ceph-bluestone-tool tip.  I'll see what I can find going that route.  This might help my migration but, as we all know, all disks fail, and this approach won't apply when a journal disk fails during normal operations.

As for the lost PGs, I don't know what happened either but it seems very repeatable (I'm 2 for 2).  The SAN's storage is only about 20% utilized, so while 25% of all used space was re-copied, it's only a about 5 TB of I/O.  During both recopies, the disks were not stressed.  On the second node journal replacement, I only added back single OSD and the pool and volume disappeared.  For both node journal replacements, initially, Backfill was set to Medium and I was seeing all of 25 MB/s of total disk I/O on each node at the disk level.  The SAN was not in Maintenance mode for either SAN node shutdowns, since one never knows how long it'll take to swap hardware and get a node back on line.  After it was clear I'd lost the pool and the iSCSI disk each time, I up'd Backfill to fast and the hardware delivered 200-500 MB/s total I/O (Reads and Writes) on the 3 nodes that were not altered (higher reads than writes) and then, as I added back OSDS, the storage subsystem on the node with the "new" OSDs delivered about 800 MB/s (more writes than reads) with no problems, and with a drastic up-tick in increase in clean PGs per unit time.  Even during the rebuild of each node, when I/O on that node was highest, no OSD or journal showed more the 75% disk utilization and when OSDs were seen at 75% utilization, it wasn't constant or prolonged.

I do appreciate knowing that ceph only knows inactive PGs, not Pools or iSCSI disks.  Unfortunately, this implies to me that PetaSAN's has some serious design flaws. It seems that if PetaSAN can detect Inactive PGs associated with a pool and/or iSCSI disk and is capable of listing the pool as inactive and removing any associate iSCSI disk from the admin UI, it should also stop servicing requests to that iSCSI disk until the pool is up and all Inactive PGs are active.  If PetaSAN allows I/O (writes, in particular) to PGs that are available, once the system using the iSCSI disk hits a PG that's unavailable, volume/disk inconsistencies will be "manufactured" and could continue, ad nauseam.

Also, for the 2nd node journal replacement, the iSCSI disk initially went unavailable, but before all Stuck Inactive PGS were Unstuck, the iSCSI disk became available.  Perhaps, again, my ignorance is showing, but the last thing I want is to take my chances that a database or email server won't hit a Stuck Inactive PG before the lengthy rebuild completes.  It seems more reasonable to simply stop servicing all iSCSI disks using a pool that's inactive, which might cause problems with the ultimate user of the iSCSI disk, but would be far more like an unexpected shutdown/server crash and less a scenario that prolongs the opportunity for corruption of an iSCSI disk.

In the end, I was aware that PetaSAN provides the code for the iSCSI cluster capabilities, but I was unaware that PetaSAN didn't shut down iSCSI and keep it down as soon as the pool supporting the pool was inactive and until it became active again.  We will have to automate our hosts shutdown or even power-off when we detects ceph has any inactive PGs.  The problem will be knowing what iSCSI disks are in which pool.

Is there any way to show the PetaSAN Pool that supports any given iSCSI disk from the command line of a PetaSAN node?

If any of my conclusions or suppositions are incorrect, please tell me.

I'm trying to divine simple lifecycle procedures for a PetaSAN cluster.  If there is any documentation that you know of that addresses this, please let me know.

Thanks,

Jim Graczyk

 

 

 

 

 

 

The issue is not related to iSCSI but at the Ceph layer, again without looking deep, my guess is recovery speed setting may be too high for the load this generates as well as hardware dependent. If you will try the third node, i suggest you set the recovery speed to "slow" and  see if you get stuck PGs, and gradually increase it if all Ok, you can also custom set speed values over the preset speeds.

I do appreciate knowing that ceph only knows inactive PGs, not Pools or iSCSI disks.  Unfortunately, this implies to me that PetaSAN's has some serious design flaws. It seems that if PetaSAN can detect Inactive PGs associated with a pool and/or iSCSI disk and is capable of listing the pool as inactive and removing any associate iSCSI disk from the admin UI, it should also stop servicing requests to that iSCSI disk until the pool is up and all Inactive PGs are active.  If PetaSAN allows I/O (writes, in particular) to PGs that are available, once the system using the iSCSI disk hits a PG that's unavailable, volume/disk inconsistencies will be "manufactured" and could continue, ad nauseam.

Do not agree there, first your problems were due to stuck PGs at the OSD layer and not because of this but to answer this theoretically, all storage systems do this: PetaSAN, Ceph, even a raw hard drive will serve i/o from sectors it can read and could get stuck or timeout on other sectors with problems, it does not check the drive health on each i/o and stops i/o globally, besides performance hit, it would not make life better for the clients, you could argue the reverse is better, in either case clients i/o will get errors which their storage stack will handle.

Thinking more about the second issue of why hyper-v io resumed while PetaSAN ui did not show the disks, it was probably not from the above but just from the timeouts we use in the UI, if the UI does not read the data withing 10 sec it will not show the iSCSI disks in that pool and will be deemed inactive, this is so we do not have the user sit forever, but Hyper-v and the applications it runs probably have higher i/o timeouts.

Again the main issue is to avoid getting stuck PGs at the OSD level.

 

When I get up the nerve to upgrade my 3rd SAN node's NVMe & SSD, I will set my backfill parameter to VERY SLOW before beginning the upgrade.

If the stuck inactive PGs recur, can we agree it's not a hardware performance issue?  If it doesn't recur, I'll concede that it is...

If you're right, I just wonder how I ever managed to get 400+ MB/s writes, as I saw on the last 2 upgrades after the iSCSI disk and pools were lost and I up'd backfil from MEDUIM to FAST......

I'll update this thread as soon as I have more information.

Thanks,

Jim  Graczyk

Just an update -  Performed journal replacements on the 3rd of 4 nodes.  In the process I also replaced the Dell M330 HBA in this server (Dell R730XD) with a PERC H730P.  I has able to retain the OSDs not affected by the journal replacements (one journal was left unchanged), by putting the PERC in RAID mode and setting the HDDs as Non-RAID disks.

As I said above, I set my SAN Cluster Backfill value to it's lowest (Very Slow) and when I added my first OSD back to the cluster, I didn't lose any Active PGs.  So it WAS a matter of how fast the system was rebuilding that caused the loss of PGs in the previous 2 attempts.  Once I added back all OSDs affected by the journal replacements, I increased Backfill to Med, as it had been.  The rebuild is SLOW and will take more than 2 days, but the disk utilization on the changed server is ~100-120 MB/s writes (20-25% disk utilization) and only 30-40 MB/s reads from the other 3 nodes (about 40% disk utilization).  The difference in % utilization could easily be due to the changed server have a caching disk controller, with the other 3 do not (yet).

So my takeaway here is the INITIAL Backfill speed being low is a good thing.  It stops a lot of work during whatever outage you  from occurring, only to be undone as soon as the node comes back online.  The real question is the Backfill setting once all nodes are back.  Medium Backfill leaves plenty of I/O for the SAN users, but it does leave my data out of redundancy for more than a day.  Since Fast Backfill resulted in shorter rebuild times (like about 25% of what I'm seeing now), once I have PERC controllers in all nodes, this may be an option the SAN Cluster can manage after all OSDs are alive and well.

Thanks,

Jim

Thanks for the update.

curious, why didnt you use ceph and bluestor to down your journals and then swap them live? This would be the same as a journal drive failure and it would quickly resolve once the new drive is pinned as a journal drive?