Stopping One Node Causes All iSCSI Targets to Go Down
Jim.Graczyk
7 Posts
March 23, 2021, 4:32 amQuote from Jim.Graczyk on March 23, 2021, 4:32 amI have a 4 node PetaSAN cluster with 66 OSDs, all with SSD Jounals for each disk and no Cache SSDs. Disks vary in size and count per node, but each node has roughly the same total storage. For example, one node originally had 16 2TB drives, while 2 others had 8 4TB drives and the fourth had 16 3TB drives. Each has 64GB RAM, and 12-16 XEON cores w clock speeds from 2.2 GHz to 2.6 GHz. I have two Pools, one using the default rbd replicated method, and the other an ec-31 (3 data, 1 parity, size=4 and min size = 3). The Replicated Pool was assigned 2048 PGs and the ec-31 was assigned 1024 PG. Doing the math, this means ~155 PGs per OSD). I have 3 100TB iSCSI targets using the replicated pool (3 copies of the data, 2 copies required for iSCSI to stay up) and one 100TB iSCSI target using an ec-31 pool (3 data, 1 parity, requiring 3 nodes to be up). Each iSCSI target is assigned 4 paths / IPs each. All network traffic is traversing a single switch that has unique vlans for frontside, backend, iSCSI-1 and iSCSI2.
The SAN appears to operate well enough with all nodes running. When I recently added 3 8TB drives and 1 SSD to each Node, I saw disk I/O of over 1.5 GB/s at times total thruput (reading and writing combined). I believe this traffic was caused by the balancer since I watched the usage of each old disk in each node go down as the usage of the new disks go up from zero. The iSCSI disks performance isn't routinely great, but acceptable with 100 - 200 MB/s of read/write I/O seen at the Hyper-V server that uses the iSCSI storage for CVFS volumes.
The problem is that every time I take down a single node (like to add disks), the SAN appears to run fine for several minutes, but before long, all iSCSI targets are gone (nothing displays in the iSCSI page), all pools go inactive, and the number of clean PGs is cut in half, many hundreds of PGs show Undersized and roughly the same number show Degraded. Before the node of the SAN is shutdown, the SAN is at peace, so to speak, with Active PGs at 3072 and Clean PGs at the same number and nothing else going on but 10 MB/s of VM disk usage and some scrubbing or deep scrubbing going on. Prior to shutting down a SAN node, I've set several combination of Maintenance Setting, and even paused the SAN for one node shutdown attempt to no availe.
In a bug report post, the Admin stated that any PGs going inactive will cause iSCSI to go down. On my SAN, shutting down a single node appears to cause some small number of PGs to go inactive after any Node is shut down.
- Is it true that any inactive PGs kill the iSCSI Targets?
- Given that I have 4 nodes and only have required pool sizes 2 of 3 for replicated and 3 of 4 for ec, how can losing one node cause PGs to g inactive.
- How does one look for this in logs? What logs, where?
- I used the existing Crush Rule "Replaced" for most data stored on the SAN and created a new rule for the ec-31 method, using ec-by-host-hdd to create it. Do any of these rules require more than 3 of 4 nodes?
- Any idea how to prevent PGs goinf inactive?
Any suggestions or asistance woule be appreciated.
Jim Graczyk
I have a 4 node PetaSAN cluster with 66 OSDs, all with SSD Jounals for each disk and no Cache SSDs. Disks vary in size and count per node, but each node has roughly the same total storage. For example, one node originally had 16 2TB drives, while 2 others had 8 4TB drives and the fourth had 16 3TB drives. Each has 64GB RAM, and 12-16 XEON cores w clock speeds from 2.2 GHz to 2.6 GHz. I have two Pools, one using the default rbd replicated method, and the other an ec-31 (3 data, 1 parity, size=4 and min size = 3). The Replicated Pool was assigned 2048 PGs and the ec-31 was assigned 1024 PG. Doing the math, this means ~155 PGs per OSD). I have 3 100TB iSCSI targets using the replicated pool (3 copies of the data, 2 copies required for iSCSI to stay up) and one 100TB iSCSI target using an ec-31 pool (3 data, 1 parity, requiring 3 nodes to be up). Each iSCSI target is assigned 4 paths / IPs each. All network traffic is traversing a single switch that has unique vlans for frontside, backend, iSCSI-1 and iSCSI2.
The SAN appears to operate well enough with all nodes running. When I recently added 3 8TB drives and 1 SSD to each Node, I saw disk I/O of over 1.5 GB/s at times total thruput (reading and writing combined). I believe this traffic was caused by the balancer since I watched the usage of each old disk in each node go down as the usage of the new disks go up from zero. The iSCSI disks performance isn't routinely great, but acceptable with 100 - 200 MB/s of read/write I/O seen at the Hyper-V server that uses the iSCSI storage for CVFS volumes.
The problem is that every time I take down a single node (like to add disks), the SAN appears to run fine for several minutes, but before long, all iSCSI targets are gone (nothing displays in the iSCSI page), all pools go inactive, and the number of clean PGs is cut in half, many hundreds of PGs show Undersized and roughly the same number show Degraded. Before the node of the SAN is shutdown, the SAN is at peace, so to speak, with Active PGs at 3072 and Clean PGs at the same number and nothing else going on but 10 MB/s of VM disk usage and some scrubbing or deep scrubbing going on. Prior to shutting down a SAN node, I've set several combination of Maintenance Setting, and even paused the SAN for one node shutdown attempt to no availe.
In a bug report post, the Admin stated that any PGs going inactive will cause iSCSI to go down. On my SAN, shutting down a single node appears to cause some small number of PGs to go inactive after any Node is shut down.
- Is it true that any inactive PGs kill the iSCSI Targets?
- Given that I have 4 nodes and only have required pool sizes 2 of 3 for replicated and 3 of 4 for ec, how can losing one node cause PGs to g inactive.
- How does one look for this in logs? What logs, where?
- I used the existing Crush Rule "Replaced" for most data stored on the SAN and created a new rule for the ec-31 method, using ec-by-host-hdd to create it. Do any of these rules require more than 3 of 4 nodes?
- Any idea how to prevent PGs goinf inactive?
Any suggestions or asistance woule be appreciated.
Jim Graczyk
admin
2,930 Posts
March 23, 2021, 7:51 amQuote from admin on March 23, 2021, 7:51 amThis is not normal, can you double check in Maintenance menu that the Backfill speed is not accidentally set too high else try setting it to "Slow". Also if you had done any other manual Ceph configuration settings, do recheck. I also presume you use 10 Gbps network.
This is not normal, can you double check in Maintenance menu that the Backfill speed is not accidentally set too high else try setting it to "Slow". Also if you had done any other manual Ceph configuration settings, do recheck. I also presume you use 10 Gbps network.
Last edited on March 23, 2021, 7:58 am by admin · #2
Jim.Graczyk
7 Posts
March 23, 2021, 3:28 pmQuote from Jim.Graczyk on March 23, 2021, 3:28 pmI did have Backfill set to Fast. I changed from the default of Medium because it was taking days to recover from taking a single node down. I can see that this could load up the nodes during recovery, but all SAN node shutdowns were done when the SAN was quiescent - and many times when Backfill was off in the Maintenance Menu. So if a node is shut down while Backfill, Recovery, Rebalance, Mark Down and Mark Out are all off in the Maintenance menu, is it possible that the Backfill rate setting could matter?
I've also shut down nodes with everything but Fencing on in the Maintenance menu and seen the same result.
I cannot risk shutting the SAN down at this time due to a very long backup that is running so I can't test a single node shutdown now. I need to know if I should attempt a single node shutdown now that the Backfill setting is set to Slow? Is there anything else I should do that would can help the iSCSI disks stay active during the stoppage of a single node?
Also - Some of my questions remain unanswered.
Does a single inactive PG take all Pools offline and all iSCSI disks?
What can cause the loss of a node to cause active PGs to drop momentarily?
When I created the ec-31 EC profile and pool, I used the ec-by-host-hdd template. Does this mean what I thought it meant - that EC was independent by node, such that each node would store only one piece of the 4 pieces in the 3 data, 1 parity scheme?
Thanks for any help you can give.
Jim
I did have Backfill set to Fast. I changed from the default of Medium because it was taking days to recover from taking a single node down. I can see that this could load up the nodes during recovery, but all SAN node shutdowns were done when the SAN was quiescent - and many times when Backfill was off in the Maintenance Menu. So if a node is shut down while Backfill, Recovery, Rebalance, Mark Down and Mark Out are all off in the Maintenance menu, is it possible that the Backfill rate setting could matter?
I've also shut down nodes with everything but Fencing on in the Maintenance menu and seen the same result.
I cannot risk shutting the SAN down at this time due to a very long backup that is running so I can't test a single node shutdown now. I need to know if I should attempt a single node shutdown now that the Backfill setting is set to Slow? Is there anything else I should do that would can help the iSCSI disks stay active during the stoppage of a single node?
Also - Some of my questions remain unanswered.
Does a single inactive PG take all Pools offline and all iSCSI disks?
What can cause the loss of a node to cause active PGs to drop momentarily?
When I created the ec-31 EC profile and pool, I used the ec-by-host-hdd template. Does this mean what I thought it meant - that EC was independent by node, such that each node would store only one piece of the 4 pieces in the 3 data, 1 parity scheme?
Thanks for any help you can give.
Jim
admin
2,930 Posts
March 23, 2021, 8:02 pmQuote from admin on March 23, 2021, 8:02 pm
ec-by-host template is correct, and will place each ec chunk on a different node. However 3+1 is only using 1 parity chunk so the loss of only 1 node will leave the pool without any redundancy. When this happens (no redundancy) it is not recommended for the pool to continue doing io so the pool will be inactive until it re-creates the lost chunks. If you want to allow the pool to function without any redundancy, set the min_size to 3 from 4 but this could cause data loss or inconsistency if a further error occur. Ceph recommends the min_size to be k+1. For a 4 node System, better use ec2+2 profile.
Depending on the hardware, Fast recovery speed could to stressful. Look at the %util for your disks, were they near 100% when recovery started ? if so you should lower the value. If you can run tests, set it to Slow and do a node failover and see if you still have issues.
iSCSI stores data in Ceph, if a pool has 1 PG inactive, it will not respond to i/o and the client will timeout and fail the io. If you have 1024 PGs and only 1 is inactive any io to that PG will fail, so you can consider the entire pool as down. inactive PGs is one of the nastiest Ceph errors.Recovery/backfill traffic should not cause your cluster to go down, but if it is set too high and your disk saturate, your OSD will not communicate with one another correctly, heartbeats will fail and they can flap up/down and it may take a while for a PG to figure out the consistency of the data.
You should try to benchmark your cluster performance and how ir handles the different client loads/ recovery/ scrub load, ideally before production.
ec-by-host template is correct, and will place each ec chunk on a different node. However 3+1 is only using 1 parity chunk so the loss of only 1 node will leave the pool without any redundancy. When this happens (no redundancy) it is not recommended for the pool to continue doing io so the pool will be inactive until it re-creates the lost chunks. If you want to allow the pool to function without any redundancy, set the min_size to 3 from 4 but this could cause data loss or inconsistency if a further error occur. Ceph recommends the min_size to be k+1. For a 4 node System, better use ec2+2 profile.
Depending on the hardware, Fast recovery speed could to stressful. Look at the %util for your disks, were they near 100% when recovery started ? if so you should lower the value. If you can run tests, set it to Slow and do a node failover and see if you still have issues.
iSCSI stores data in Ceph, if a pool has 1 PG inactive, it will not respond to i/o and the client will timeout and fail the io. If you have 1024 PGs and only 1 is inactive any io to that PG will fail, so you can consider the entire pool as down. inactive PGs is one of the nastiest Ceph errors.Recovery/backfill traffic should not cause your cluster to go down, but if it is set too high and your disk saturate, your OSD will not communicate with one another correctly, heartbeats will fail and they can flap up/down and it may take a while for a PG to figure out the consistency of the data.
You should try to benchmark your cluster performance and how ir handles the different client loads/ recovery/ scrub load, ideally before production.
Last edited on March 23, 2021, 8:03 pm by admin · #4
Jim.Graczyk
7 Posts
March 23, 2021, 9:22 pmQuote from Jim.Graczyk on March 23, 2021, 9:22 pmThanks for the information.
The EC-31 profile is only for data that doesn't change much over time. I understand the risk of running without parity. The data set is large (will grow to 20TB), it's only read from and not written to and is backed up, so it can be restored. It's just time consuming getting the data back into an accessible form (out of backups and into a running VM). The backup solution also offers the ability to mount the VHDX files directly from the backup repository to allow for read access.
Are you're telling me that the EC-31 profile I made, which was created with K=3 and M=1 and a Size of 4 and a Min Size of 3, will go offline when a single server is shutdown?
I'm assuming it is not, until you tell me otherwise.
I'm not talking about recommendations. I will reconsider EC-2-2. I think I may have the necessary space now, but didn't when I created SAN, pools and iSCSI disks and moved 11 TB into it. I am asking about function. I expect the ec-31 iSCSI disk to remain started when one PetaSAN node is down. - again until you tell me otherwise.
Also, when one PetaSAN node is shut down, the clients are not timing out the iSCSI disks. The iSCSI disk disappear and both pools go Inactive, then the clients time out the iSCSI connection. Does a single PG going inactive CAUSE all pools to go inactive?
It seems reasonable that the pool going inactive would cause any iSCSI disk using that pool to go offline.
Do the maintenance switches that turn off Backfill, Recovery, and Rebalance prevent backfill, recovery, and rebalance from occurring? I'm Assuming they do. So since the SAN is in maintenance mode, Backfill, Recovery, and Rebalance can't be responsible for high load that precipitate the loss of iSCSI disks.
When you said that when a single PG goes down, I can consider the entire pool as being down. I'm sorry for being specific, but I'm asking if PetaSAN/Ceph Toggles the pool as down or not. Again, assuming that iSCSI disk specs are a the min size of 3, this means I can lose 1 of the 4 Nodes and the iSCSI Disk will remain up, albeit with no protection.
When the SAN is in a quiescent state, I'm seeing less than 30 MB/s traffic max and mostly 1-5 MB/s, backend network, total disk io, individual disk traffic and mostly low disk latencies with an occasional spike on a per disk basis. In general, the SAN Nodes run at 25% CPU load or less, 50% of memory use or less. When it's recovering from a single node shutdown that ends up dropping all iSCSI disks and all pools go inactive, I see 200-500 MB/s total disk I/O per PetaSAN node during recovery, with no increase memory use and a slight uptick in CPU usage. When I added disk, rebalancing pushed the SAN to 1.5 GB/s moving data from the existing 8 or 16 drives to 3 8TB drives that were added and still no increase RAM Use and the same slight uptick in CPU.
Based on these observations, I believe the PetaSAN cluster is unbusy at quiescence. After shutting down a single node -and the loss of all pools and iSCSI disk - upon restart, recovery, rebalance and backfill actions produce all the stress I see on the system, but this occurs after a node is shut down. Since it's not pre-existing when the single node shutdown action is taken, I don't understand how it can be the reason for the iSCSI disks to go offline and the pools to go inactive.
I benchmarked the SAN before using it, but didn't pull disk and reinsert then while doing so to examine rebalancing, backfill and recovery loads.
During all time with PetaSAN, I saw nothing to make me believe my systems are overloaded. All performance data I can see indicates that it's very underloaded. Since I've shut down nodes under a wide variety of conditions, with a wide variety of Maintenance switches and a under low and medium I/O loads, and the SAN has behaved exactly the same every time - a few minutes after the single node is shutdown, the iSCSI disks dissapear and the pools go inactive - I am disinclined to believe it's related to load. If it were a load issue, I'd have had one shutdown w/o losing the disks.
As soon as I can, I'll shut down all VMs, put the SAN into maintenance mode AND pause the SAN, prior to shutting down a single node. If the SAN doesn't behaves as it has every time so far and the pools don't go offline and the iSCSI disk don't disappear. I'll believe it's load. Otherwise, I guess I'll be looking elsewhere.
Thanks for all your help.
Jim Graczyk
Thanks for the information.
The EC-31 profile is only for data that doesn't change much over time. I understand the risk of running without parity. The data set is large (will grow to 20TB), it's only read from and not written to and is backed up, so it can be restored. It's just time consuming getting the data back into an accessible form (out of backups and into a running VM). The backup solution also offers the ability to mount the VHDX files directly from the backup repository to allow for read access.
Are you're telling me that the EC-31 profile I made, which was created with K=3 and M=1 and a Size of 4 and a Min Size of 3, will go offline when a single server is shutdown?
I'm assuming it is not, until you tell me otherwise.
I'm not talking about recommendations. I will reconsider EC-2-2. I think I may have the necessary space now, but didn't when I created SAN, pools and iSCSI disks and moved 11 TB into it. I am asking about function. I expect the ec-31 iSCSI disk to remain started when one PetaSAN node is down. - again until you tell me otherwise.
Also, when one PetaSAN node is shut down, the clients are not timing out the iSCSI disks. The iSCSI disk disappear and both pools go Inactive, then the clients time out the iSCSI connection. Does a single PG going inactive CAUSE all pools to go inactive?
It seems reasonable that the pool going inactive would cause any iSCSI disk using that pool to go offline.
Do the maintenance switches that turn off Backfill, Recovery, and Rebalance prevent backfill, recovery, and rebalance from occurring? I'm Assuming they do. So since the SAN is in maintenance mode, Backfill, Recovery, and Rebalance can't be responsible for high load that precipitate the loss of iSCSI disks.
When you said that when a single PG goes down, I can consider the entire pool as being down. I'm sorry for being specific, but I'm asking if PetaSAN/Ceph Toggles the pool as down or not. Again, assuming that iSCSI disk specs are a the min size of 3, this means I can lose 1 of the 4 Nodes and the iSCSI Disk will remain up, albeit with no protection.
When the SAN is in a quiescent state, I'm seeing less than 30 MB/s traffic max and mostly 1-5 MB/s, backend network, total disk io, individual disk traffic and mostly low disk latencies with an occasional spike on a per disk basis. In general, the SAN Nodes run at 25% CPU load or less, 50% of memory use or less. When it's recovering from a single node shutdown that ends up dropping all iSCSI disks and all pools go inactive, I see 200-500 MB/s total disk I/O per PetaSAN node during recovery, with no increase memory use and a slight uptick in CPU usage. When I added disk, rebalancing pushed the SAN to 1.5 GB/s moving data from the existing 8 or 16 drives to 3 8TB drives that were added and still no increase RAM Use and the same slight uptick in CPU.
Based on these observations, I believe the PetaSAN cluster is unbusy at quiescence. After shutting down a single node -and the loss of all pools and iSCSI disk - upon restart, recovery, rebalance and backfill actions produce all the stress I see on the system, but this occurs after a node is shut down. Since it's not pre-existing when the single node shutdown action is taken, I don't understand how it can be the reason for the iSCSI disks to go offline and the pools to go inactive.
I benchmarked the SAN before using it, but didn't pull disk and reinsert then while doing so to examine rebalancing, backfill and recovery loads.
During all time with PetaSAN, I saw nothing to make me believe my systems are overloaded. All performance data I can see indicates that it's very underloaded. Since I've shut down nodes under a wide variety of conditions, with a wide variety of Maintenance switches and a under low and medium I/O loads, and the SAN has behaved exactly the same every time - a few minutes after the single node is shutdown, the iSCSI disks dissapear and the pools go inactive - I am disinclined to believe it's related to load. If it were a load issue, I'd have had one shutdown w/o losing the disks.
As soon as I can, I'll shut down all VMs, put the SAN into maintenance mode AND pause the SAN, prior to shutting down a single node. If the SAN doesn't behaves as it has every time so far and the pools don't go offline and the iSCSI disk don't disappear. I'll believe it's load. Otherwise, I guess I'll be looking elsewhere.
Thanks for all your help.
Jim Graczyk
admin
2,930 Posts
March 23, 2021, 10:11 pmQuote from admin on March 23, 2021, 10:11 pmHave you looked at %disk util as per prev post ? if it is high you need to lower the backfill speed. As recommended set it to Slow and run a test
If you set min_size to 3 in an EC3+1 then the pool will remain active even if 1 node fails. As stated the default/recommendation is to have min_size = k+1 so it should be set to 4 for reason per prev post.
As stated if you have a pool with 1024 pgs and 1 pg goes inactive, your iSCSI clients can talk to data in 1023 no problem but once they try to access data on an inactive pg the io will timeout and return an error..if your client is doing 5k iops, the time to hit the unlucky pg is on average 0.2 sec so for all purpose the entire pool could be considered down.
Have you looked at %disk util as per prev post ? if it is high you need to lower the backfill speed. As recommended set it to Slow and run a test
If you set min_size to 3 in an EC3+1 then the pool will remain active even if 1 node fails. As stated the default/recommendation is to have min_size = k+1 so it should be set to 4 for reason per prev post.
As stated if you have a pool with 1024 pgs and 1 pg goes inactive, your iSCSI clients can talk to data in 1023 no problem but once they try to access data on an inactive pg the io will timeout and return an error..if your client is doing 5k iops, the time to hit the unlucky pg is on average 0.2 sec so for all purpose the entire pool could be considered down.
Jim.Graczyk
7 Posts
March 24, 2021, 5:02 amQuote from Jim.Graczyk on March 24, 2021, 5:02 amThanks, that's much clearer. I will test with backfill set to slow as soon as I can. I did review the last several weeks for inactive PGs and they are/were happening right after I shut down a single node. I'll set up monitors to try to catch what's happening first, Inactive PGs or iSCSI disks stopping. I'll post here as soon as I have any results.
Jim Graczyk
Thanks, that's much clearer. I will test with backfill set to slow as soon as I can. I did review the last several weeks for inactive PGs and they are/were happening right after I shut down a single node. I'll set up monitors to try to catch what's happening first, Inactive PGs or iSCSI disks stopping. I'll post here as soon as I have any results.
Jim Graczyk
Stopping One Node Causes All iSCSI Targets to Go Down
Jim.Graczyk
7 Posts
Quote from Jim.Graczyk on March 23, 2021, 4:32 amI have a 4 node PetaSAN cluster with 66 OSDs, all with SSD Jounals for each disk and no Cache SSDs. Disks vary in size and count per node, but each node has roughly the same total storage. For example, one node originally had 16 2TB drives, while 2 others had 8 4TB drives and the fourth had 16 3TB drives. Each has 64GB RAM, and 12-16 XEON cores w clock speeds from 2.2 GHz to 2.6 GHz. I have two Pools, one using the default rbd replicated method, and the other an ec-31 (3 data, 1 parity, size=4 and min size = 3). The Replicated Pool was assigned 2048 PGs and the ec-31 was assigned 1024 PG. Doing the math, this means ~155 PGs per OSD). I have 3 100TB iSCSI targets using the replicated pool (3 copies of the data, 2 copies required for iSCSI to stay up) and one 100TB iSCSI target using an ec-31 pool (3 data, 1 parity, requiring 3 nodes to be up). Each iSCSI target is assigned 4 paths / IPs each. All network traffic is traversing a single switch that has unique vlans for frontside, backend, iSCSI-1 and iSCSI2.
The SAN appears to operate well enough with all nodes running. When I recently added 3 8TB drives and 1 SSD to each Node, I saw disk I/O of over 1.5 GB/s at times total thruput (reading and writing combined). I believe this traffic was caused by the balancer since I watched the usage of each old disk in each node go down as the usage of the new disks go up from zero. The iSCSI disks performance isn't routinely great, but acceptable with 100 - 200 MB/s of read/write I/O seen at the Hyper-V server that uses the iSCSI storage for CVFS volumes.
The problem is that every time I take down a single node (like to add disks), the SAN appears to run fine for several minutes, but before long, all iSCSI targets are gone (nothing displays in the iSCSI page), all pools go inactive, and the number of clean PGs is cut in half, many hundreds of PGs show Undersized and roughly the same number show Degraded. Before the node of the SAN is shutdown, the SAN is at peace, so to speak, with Active PGs at 3072 and Clean PGs at the same number and nothing else going on but 10 MB/s of VM disk usage and some scrubbing or deep scrubbing going on. Prior to shutting down a SAN node, I've set several combination of Maintenance Setting, and even paused the SAN for one node shutdown attempt to no availe.
In a bug report post, the Admin stated that any PGs going inactive will cause iSCSI to go down. On my SAN, shutting down a single node appears to cause some small number of PGs to go inactive after any Node is shut down.
- Is it true that any inactive PGs kill the iSCSI Targets?
- Given that I have 4 nodes and only have required pool sizes 2 of 3 for replicated and 3 of 4 for ec, how can losing one node cause PGs to g inactive.
- How does one look for this in logs? What logs, where?
- I used the existing Crush Rule "Replaced" for most data stored on the SAN and created a new rule for the ec-31 method, using ec-by-host-hdd to create it. Do any of these rules require more than 3 of 4 nodes?
- Any idea how to prevent PGs goinf inactive?
Any suggestions or asistance woule be appreciated.
Jim Graczyk
I have a 4 node PetaSAN cluster with 66 OSDs, all with SSD Jounals for each disk and no Cache SSDs. Disks vary in size and count per node, but each node has roughly the same total storage. For example, one node originally had 16 2TB drives, while 2 others had 8 4TB drives and the fourth had 16 3TB drives. Each has 64GB RAM, and 12-16 XEON cores w clock speeds from 2.2 GHz to 2.6 GHz. I have two Pools, one using the default rbd replicated method, and the other an ec-31 (3 data, 1 parity, size=4 and min size = 3). The Replicated Pool was assigned 2048 PGs and the ec-31 was assigned 1024 PG. Doing the math, this means ~155 PGs per OSD). I have 3 100TB iSCSI targets using the replicated pool (3 copies of the data, 2 copies required for iSCSI to stay up) and one 100TB iSCSI target using an ec-31 pool (3 data, 1 parity, requiring 3 nodes to be up). Each iSCSI target is assigned 4 paths / IPs each. All network traffic is traversing a single switch that has unique vlans for frontside, backend, iSCSI-1 and iSCSI2.
The SAN appears to operate well enough with all nodes running. When I recently added 3 8TB drives and 1 SSD to each Node, I saw disk I/O of over 1.5 GB/s at times total thruput (reading and writing combined). I believe this traffic was caused by the balancer since I watched the usage of each old disk in each node go down as the usage of the new disks go up from zero. The iSCSI disks performance isn't routinely great, but acceptable with 100 - 200 MB/s of read/write I/O seen at the Hyper-V server that uses the iSCSI storage for CVFS volumes.
The problem is that every time I take down a single node (like to add disks), the SAN appears to run fine for several minutes, but before long, all iSCSI targets are gone (nothing displays in the iSCSI page), all pools go inactive, and the number of clean PGs is cut in half, many hundreds of PGs show Undersized and roughly the same number show Degraded. Before the node of the SAN is shutdown, the SAN is at peace, so to speak, with Active PGs at 3072 and Clean PGs at the same number and nothing else going on but 10 MB/s of VM disk usage and some scrubbing or deep scrubbing going on. Prior to shutting down a SAN node, I've set several combination of Maintenance Setting, and even paused the SAN for one node shutdown attempt to no availe.
In a bug report post, the Admin stated that any PGs going inactive will cause iSCSI to go down. On my SAN, shutting down a single node appears to cause some small number of PGs to go inactive after any Node is shut down.
- Is it true that any inactive PGs kill the iSCSI Targets?
- Given that I have 4 nodes and only have required pool sizes 2 of 3 for replicated and 3 of 4 for ec, how can losing one node cause PGs to g inactive.
- How does one look for this in logs? What logs, where?
- I used the existing Crush Rule "Replaced" for most data stored on the SAN and created a new rule for the ec-31 method, using ec-by-host-hdd to create it. Do any of these rules require more than 3 of 4 nodes?
- Any idea how to prevent PGs goinf inactive?
Any suggestions or asistance woule be appreciated.
Jim Graczyk
admin
2,930 Posts
Quote from admin on March 23, 2021, 7:51 amThis is not normal, can you double check in Maintenance menu that the Backfill speed is not accidentally set too high else try setting it to "Slow". Also if you had done any other manual Ceph configuration settings, do recheck. I also presume you use 10 Gbps network.
This is not normal, can you double check in Maintenance menu that the Backfill speed is not accidentally set too high else try setting it to "Slow". Also if you had done any other manual Ceph configuration settings, do recheck. I also presume you use 10 Gbps network.
Jim.Graczyk
7 Posts
Quote from Jim.Graczyk on March 23, 2021, 3:28 pmI did have Backfill set to Fast. I changed from the default of Medium because it was taking days to recover from taking a single node down. I can see that this could load up the nodes during recovery, but all SAN node shutdowns were done when the SAN was quiescent - and many times when Backfill was off in the Maintenance Menu. So if a node is shut down while Backfill, Recovery, Rebalance, Mark Down and Mark Out are all off in the Maintenance menu, is it possible that the Backfill rate setting could matter?
I've also shut down nodes with everything but Fencing on in the Maintenance menu and seen the same result.
I cannot risk shutting the SAN down at this time due to a very long backup that is running so I can't test a single node shutdown now. I need to know if I should attempt a single node shutdown now that the Backfill setting is set to Slow? Is there anything else I should do that would can help the iSCSI disks stay active during the stoppage of a single node?
Also - Some of my questions remain unanswered.
Does a single inactive PG take all Pools offline and all iSCSI disks?
What can cause the loss of a node to cause active PGs to drop momentarily?
When I created the ec-31 EC profile and pool, I used the ec-by-host-hdd template. Does this mean what I thought it meant - that EC was independent by node, such that each node would store only one piece of the 4 pieces in the 3 data, 1 parity scheme?
Thanks for any help you can give.
Jim
I did have Backfill set to Fast. I changed from the default of Medium because it was taking days to recover from taking a single node down. I can see that this could load up the nodes during recovery, but all SAN node shutdowns were done when the SAN was quiescent - and many times when Backfill was off in the Maintenance Menu. So if a node is shut down while Backfill, Recovery, Rebalance, Mark Down and Mark Out are all off in the Maintenance menu, is it possible that the Backfill rate setting could matter?
I've also shut down nodes with everything but Fencing on in the Maintenance menu and seen the same result.
I cannot risk shutting the SAN down at this time due to a very long backup that is running so I can't test a single node shutdown now. I need to know if I should attempt a single node shutdown now that the Backfill setting is set to Slow? Is there anything else I should do that would can help the iSCSI disks stay active during the stoppage of a single node?
Also - Some of my questions remain unanswered.
Does a single inactive PG take all Pools offline and all iSCSI disks?
What can cause the loss of a node to cause active PGs to drop momentarily?
When I created the ec-31 EC profile and pool, I used the ec-by-host-hdd template. Does this mean what I thought it meant - that EC was independent by node, such that each node would store only one piece of the 4 pieces in the 3 data, 1 parity scheme?
Thanks for any help you can give.
Jim
admin
2,930 Posts
Quote from admin on March 23, 2021, 8:02 pm
ec-by-host template is correct, and will place each ec chunk on a different node. However 3+1 is only using 1 parity chunk so the loss of only 1 node will leave the pool without any redundancy. When this happens (no redundancy) it is not recommended for the pool to continue doing io so the pool will be inactive until it re-creates the lost chunks. If you want to allow the pool to function without any redundancy, set the min_size to 3 from 4 but this could cause data loss or inconsistency if a further error occur. Ceph recommends the min_size to be k+1. For a 4 node System, better use ec2+2 profile.
Depending on the hardware, Fast recovery speed could to stressful. Look at the %util for your disks, were they near 100% when recovery started ? if so you should lower the value. If you can run tests, set it to Slow and do a node failover and see if you still have issues.
iSCSI stores data in Ceph, if a pool has 1 PG inactive, it will not respond to i/o and the client will timeout and fail the io. If you have 1024 PGs and only 1 is inactive any io to that PG will fail, so you can consider the entire pool as down. inactive PGs is one of the nastiest Ceph errors.Recovery/backfill traffic should not cause your cluster to go down, but if it is set too high and your disk saturate, your OSD will not communicate with one another correctly, heartbeats will fail and they can flap up/down and it may take a while for a PG to figure out the consistency of the data.
You should try to benchmark your cluster performance and how ir handles the different client loads/ recovery/ scrub load, ideally before production.
ec-by-host template is correct, and will place each ec chunk on a different node. However 3+1 is only using 1 parity chunk so the loss of only 1 node will leave the pool without any redundancy. When this happens (no redundancy) it is not recommended for the pool to continue doing io so the pool will be inactive until it re-creates the lost chunks. If you want to allow the pool to function without any redundancy, set the min_size to 3 from 4 but this could cause data loss or inconsistency if a further error occur. Ceph recommends the min_size to be k+1. For a 4 node System, better use ec2+2 profile.
Depending on the hardware, Fast recovery speed could to stressful. Look at the %util for your disks, were they near 100% when recovery started ? if so you should lower the value. If you can run tests, set it to Slow and do a node failover and see if you still have issues.
iSCSI stores data in Ceph, if a pool has 1 PG inactive, it will not respond to i/o and the client will timeout and fail the io. If you have 1024 PGs and only 1 is inactive any io to that PG will fail, so you can consider the entire pool as down. inactive PGs is one of the nastiest Ceph errors.Recovery/backfill traffic should not cause your cluster to go down, but if it is set too high and your disk saturate, your OSD will not communicate with one another correctly, heartbeats will fail and they can flap up/down and it may take a while for a PG to figure out the consistency of the data.
You should try to benchmark your cluster performance and how ir handles the different client loads/ recovery/ scrub load, ideally before production.
Jim.Graczyk
7 Posts
Quote from Jim.Graczyk on March 23, 2021, 9:22 pmThanks for the information.
The EC-31 profile is only for data that doesn't change much over time. I understand the risk of running without parity. The data set is large (will grow to 20TB), it's only read from and not written to and is backed up, so it can be restored. It's just time consuming getting the data back into an accessible form (out of backups and into a running VM). The backup solution also offers the ability to mount the VHDX files directly from the backup repository to allow for read access.
Are you're telling me that the EC-31 profile I made, which was created with K=3 and M=1 and a Size of 4 and a Min Size of 3, will go offline when a single server is shutdown?
I'm assuming it is not, until you tell me otherwise.
I'm not talking about recommendations. I will reconsider EC-2-2. I think I may have the necessary space now, but didn't when I created SAN, pools and iSCSI disks and moved 11 TB into it. I am asking about function. I expect the ec-31 iSCSI disk to remain started when one PetaSAN node is down. - again until you tell me otherwise.
Also, when one PetaSAN node is shut down, the clients are not timing out the iSCSI disks. The iSCSI disk disappear and both pools go Inactive, then the clients time out the iSCSI connection. Does a single PG going inactive CAUSE all pools to go inactive?
It seems reasonable that the pool going inactive would cause any iSCSI disk using that pool to go offline.
Do the maintenance switches that turn off Backfill, Recovery, and Rebalance prevent backfill, recovery, and rebalance from occurring? I'm Assuming they do. So since the SAN is in maintenance mode, Backfill, Recovery, and Rebalance can't be responsible for high load that precipitate the loss of iSCSI disks.
When you said that when a single PG goes down, I can consider the entire pool as being down. I'm sorry for being specific, but I'm asking if PetaSAN/Ceph Toggles the pool as down or not. Again, assuming that iSCSI disk specs are a the min size of 3, this means I can lose 1 of the 4 Nodes and the iSCSI Disk will remain up, albeit with no protection.
When the SAN is in a quiescent state, I'm seeing less than 30 MB/s traffic max and mostly 1-5 MB/s, backend network, total disk io, individual disk traffic and mostly low disk latencies with an occasional spike on a per disk basis. In general, the SAN Nodes run at 25% CPU load or less, 50% of memory use or less. When it's recovering from a single node shutdown that ends up dropping all iSCSI disks and all pools go inactive, I see 200-500 MB/s total disk I/O per PetaSAN node during recovery, with no increase memory use and a slight uptick in CPU usage. When I added disk, rebalancing pushed the SAN to 1.5 GB/s moving data from the existing 8 or 16 drives to 3 8TB drives that were added and still no increase RAM Use and the same slight uptick in CPU.
Based on these observations, I believe the PetaSAN cluster is unbusy at quiescence. After shutting down a single node -and the loss of all pools and iSCSI disk - upon restart, recovery, rebalance and backfill actions produce all the stress I see on the system, but this occurs after a node is shut down. Since it's not pre-existing when the single node shutdown action is taken, I don't understand how it can be the reason for the iSCSI disks to go offline and the pools to go inactive.
I benchmarked the SAN before using it, but didn't pull disk and reinsert then while doing so to examine rebalancing, backfill and recovery loads.
During all time with PetaSAN, I saw nothing to make me believe my systems are overloaded. All performance data I can see indicates that it's very underloaded. Since I've shut down nodes under a wide variety of conditions, with a wide variety of Maintenance switches and a under low and medium I/O loads, and the SAN has behaved exactly the same every time - a few minutes after the single node is shutdown, the iSCSI disks dissapear and the pools go inactive - I am disinclined to believe it's related to load. If it were a load issue, I'd have had one shutdown w/o losing the disks.
As soon as I can, I'll shut down all VMs, put the SAN into maintenance mode AND pause the SAN, prior to shutting down a single node. If the SAN doesn't behaves as it has every time so far and the pools don't go offline and the iSCSI disk don't disappear. I'll believe it's load. Otherwise, I guess I'll be looking elsewhere.
Thanks for all your help.
Jim Graczyk
Thanks for the information.
The EC-31 profile is only for data that doesn't change much over time. I understand the risk of running without parity. The data set is large (will grow to 20TB), it's only read from and not written to and is backed up, so it can be restored. It's just time consuming getting the data back into an accessible form (out of backups and into a running VM). The backup solution also offers the ability to mount the VHDX files directly from the backup repository to allow for read access.
Are you're telling me that the EC-31 profile I made, which was created with K=3 and M=1 and a Size of 4 and a Min Size of 3, will go offline when a single server is shutdown?
I'm assuming it is not, until you tell me otherwise.
I'm not talking about recommendations. I will reconsider EC-2-2. I think I may have the necessary space now, but didn't when I created SAN, pools and iSCSI disks and moved 11 TB into it. I am asking about function. I expect the ec-31 iSCSI disk to remain started when one PetaSAN node is down. - again until you tell me otherwise.
Also, when one PetaSAN node is shut down, the clients are not timing out the iSCSI disks. The iSCSI disk disappear and both pools go Inactive, then the clients time out the iSCSI connection. Does a single PG going inactive CAUSE all pools to go inactive?
It seems reasonable that the pool going inactive would cause any iSCSI disk using that pool to go offline.
Do the maintenance switches that turn off Backfill, Recovery, and Rebalance prevent backfill, recovery, and rebalance from occurring? I'm Assuming they do. So since the SAN is in maintenance mode, Backfill, Recovery, and Rebalance can't be responsible for high load that precipitate the loss of iSCSI disks.
When you said that when a single PG goes down, I can consider the entire pool as being down. I'm sorry for being specific, but I'm asking if PetaSAN/Ceph Toggles the pool as down or not. Again, assuming that iSCSI disk specs are a the min size of 3, this means I can lose 1 of the 4 Nodes and the iSCSI Disk will remain up, albeit with no protection.
When the SAN is in a quiescent state, I'm seeing less than 30 MB/s traffic max and mostly 1-5 MB/s, backend network, total disk io, individual disk traffic and mostly low disk latencies with an occasional spike on a per disk basis. In general, the SAN Nodes run at 25% CPU load or less, 50% of memory use or less. When it's recovering from a single node shutdown that ends up dropping all iSCSI disks and all pools go inactive, I see 200-500 MB/s total disk I/O per PetaSAN node during recovery, with no increase memory use and a slight uptick in CPU usage. When I added disk, rebalancing pushed the SAN to 1.5 GB/s moving data from the existing 8 or 16 drives to 3 8TB drives that were added and still no increase RAM Use and the same slight uptick in CPU.
Based on these observations, I believe the PetaSAN cluster is unbusy at quiescence. After shutting down a single node -and the loss of all pools and iSCSI disk - upon restart, recovery, rebalance and backfill actions produce all the stress I see on the system, but this occurs after a node is shut down. Since it's not pre-existing when the single node shutdown action is taken, I don't understand how it can be the reason for the iSCSI disks to go offline and the pools to go inactive.
I benchmarked the SAN before using it, but didn't pull disk and reinsert then while doing so to examine rebalancing, backfill and recovery loads.
During all time with PetaSAN, I saw nothing to make me believe my systems are overloaded. All performance data I can see indicates that it's very underloaded. Since I've shut down nodes under a wide variety of conditions, with a wide variety of Maintenance switches and a under low and medium I/O loads, and the SAN has behaved exactly the same every time - a few minutes after the single node is shutdown, the iSCSI disks dissapear and the pools go inactive - I am disinclined to believe it's related to load. If it were a load issue, I'd have had one shutdown w/o losing the disks.
As soon as I can, I'll shut down all VMs, put the SAN into maintenance mode AND pause the SAN, prior to shutting down a single node. If the SAN doesn't behaves as it has every time so far and the pools don't go offline and the iSCSI disk don't disappear. I'll believe it's load. Otherwise, I guess I'll be looking elsewhere.
Thanks for all your help.
Jim Graczyk
admin
2,930 Posts
Quote from admin on March 23, 2021, 10:11 pmHave you looked at %disk util as per prev post ? if it is high you need to lower the backfill speed. As recommended set it to Slow and run a test
If you set min_size to 3 in an EC3+1 then the pool will remain active even if 1 node fails. As stated the default/recommendation is to have min_size = k+1 so it should be set to 4 for reason per prev post.
As stated if you have a pool with 1024 pgs and 1 pg goes inactive, your iSCSI clients can talk to data in 1023 no problem but once they try to access data on an inactive pg the io will timeout and return an error..if your client is doing 5k iops, the time to hit the unlucky pg is on average 0.2 sec so for all purpose the entire pool could be considered down.
Have you looked at %disk util as per prev post ? if it is high you need to lower the backfill speed. As recommended set it to Slow and run a test
If you set min_size to 3 in an EC3+1 then the pool will remain active even if 1 node fails. As stated the default/recommendation is to have min_size = k+1 so it should be set to 4 for reason per prev post.
As stated if you have a pool with 1024 pgs and 1 pg goes inactive, your iSCSI clients can talk to data in 1023 no problem but once they try to access data on an inactive pg the io will timeout and return an error..if your client is doing 5k iops, the time to hit the unlucky pg is on average 0.2 sec so for all purpose the entire pool could be considered down.
Jim.Graczyk
7 Posts
Quote from Jim.Graczyk on March 24, 2021, 5:02 amThanks, that's much clearer. I will test with backfill set to slow as soon as I can. I did review the last several weeks for inactive PGs and they are/were happening right after I shut down a single node. I'll set up monitors to try to catch what's happening first, Inactive PGs or iSCSI disks stopping. I'll post here as soon as I have any results.
Jim Graczyk
Thanks, that's much clearer. I will test with backfill set to slow as soon as I can. I did review the last several weeks for inactive PGs and they are/were happening right after I shut down a single node. I'll set up monitors to try to catch what's happening first, Inactive PGs or iSCSI disks stopping. I'll post here as soon as I have any results.
Jim Graczyk