Maintenance Mode - Need to take down 1 of 6 nodes for fan replacement
RobertH
27 Posts
January 23, 2023, 1:54 pmQuote from RobertH on January 23, 2023, 1:54 pmRan into an issue this weekend. We have a Petasan cluster that has 6 nodes (v3.0.1), in the infinite wisdom of the powers that be, they did not order cables that were sufficiently long enough to slide the servers out of the rack so they can be operated on while running for internal hot-swap parts like a fan.
Had to shutdown one of the 6 nodes (not a manager node) to replace a fan in it. Went into maintenance menu in petasan console and toggled all the items into maintenance mode, went into the iscsi path assignment and move all the iscsi paths off the node. Being overly cautious I also logged into the hyper-v cluster that uses petasan and shutdown all but 2 critical VMs that are running on the pool, the two running VMs were still fine after the path changes and everything looked fine
Used the console on the node with the problematic fan to shutdown the system, it did a graceful shutdown, had a SSH session open to one of the manager nodes running "ceph -w" and it almost immediately started complaining about delayed writes, while I was waiting for the node to shutdown completely looked at hyper-v and storage had gone offline saying it was not accessible
Needless to say I swapped that fan in as fast as humanly possible and powered the node back up, once it came on line and the OSDs started checking back in the delay messages in the ceph console went away and the storage in hyperv came back online. Fortunately I had powered off most of the VMs before doing this otherwise I would likely have had a good deal of data corruption, the 2 critical VMs that I had running both had kernel panic/blue screens from loss of disk IO and had to be restarted after the storage came back online
Im guessing that not all of the toggles need to be toggled into maintenance mode for something like this, where a node is being taken offline and is expected to be back online relatively quickly. I looked at the admin guide and it doesnt explain what any of the toggles really do, the popup help on each of them doesnt really do a good job of describing that either.
Is there a more detailed document on what the switches do, or just a quick and dirty of which ones should be toggled for a maintenance event such as this??
Ran into an issue this weekend. We have a Petasan cluster that has 6 nodes (v3.0.1), in the infinite wisdom of the powers that be, they did not order cables that were sufficiently long enough to slide the servers out of the rack so they can be operated on while running for internal hot-swap parts like a fan.
Had to shutdown one of the 6 nodes (not a manager node) to replace a fan in it. Went into maintenance menu in petasan console and toggled all the items into maintenance mode, went into the iscsi path assignment and move all the iscsi paths off the node. Being overly cautious I also logged into the hyper-v cluster that uses petasan and shutdown all but 2 critical VMs that are running on the pool, the two running VMs were still fine after the path changes and everything looked fine
Used the console on the node with the problematic fan to shutdown the system, it did a graceful shutdown, had a SSH session open to one of the manager nodes running "ceph -w" and it almost immediately started complaining about delayed writes, while I was waiting for the node to shutdown completely looked at hyper-v and storage had gone offline saying it was not accessible
Needless to say I swapped that fan in as fast as humanly possible and powered the node back up, once it came on line and the OSDs started checking back in the delay messages in the ceph console went away and the storage in hyperv came back online. Fortunately I had powered off most of the VMs before doing this otherwise I would likely have had a good deal of data corruption, the 2 critical VMs that I had running both had kernel panic/blue screens from loss of disk IO and had to be restarted after the storage came back online
Im guessing that not all of the toggles need to be toggled into maintenance mode for something like this, where a node is being taken offline and is expected to be back online relatively quickly. I looked at the admin guide and it doesnt explain what any of the toggles really do, the popup help on each of them doesnt really do a good job of describing that either.
Is there a more detailed document on what the switches do, or just a quick and dirty of which ones should be toggled for a maintenance event such as this??
admin
2,930 Posts
January 23, 2023, 2:54 pmQuote from admin on January 23, 2023, 2:54 pmit is safest to just leave in normal mode without maintenance switches. it will waste some data movement and bandwidth and will lower performance a bit but in many cases it is fine. Just make sure you backfill speed is not set too high from maintenance menu.
Else you could use the no out switch, so the OSD will be marked down but will not be replaced by crush.
it is safest to just leave in normal mode without maintenance switches. it will waste some data movement and bandwidth and will lower performance a bit but in many cases it is fine. Just make sure you backfill speed is not set too high from maintenance menu.
Else you could use the no out switch, so the OSD will be marked down but will not be replaced by crush.
Maintenance Mode - Need to take down 1 of 6 nodes for fan replacement
RobertH
27 Posts
Quote from RobertH on January 23, 2023, 1:54 pmRan into an issue this weekend. We have a Petasan cluster that has 6 nodes (v3.0.1), in the infinite wisdom of the powers that be, they did not order cables that were sufficiently long enough to slide the servers out of the rack so they can be operated on while running for internal hot-swap parts like a fan.
Had to shutdown one of the 6 nodes (not a manager node) to replace a fan in it. Went into maintenance menu in petasan console and toggled all the items into maintenance mode, went into the iscsi path assignment and move all the iscsi paths off the node. Being overly cautious I also logged into the hyper-v cluster that uses petasan and shutdown all but 2 critical VMs that are running on the pool, the two running VMs were still fine after the path changes and everything looked fine
Used the console on the node with the problematic fan to shutdown the system, it did a graceful shutdown, had a SSH session open to one of the manager nodes running "ceph -w" and it almost immediately started complaining about delayed writes, while I was waiting for the node to shutdown completely looked at hyper-v and storage had gone offline saying it was not accessible
Needless to say I swapped that fan in as fast as humanly possible and powered the node back up, once it came on line and the OSDs started checking back in the delay messages in the ceph console went away and the storage in hyperv came back online. Fortunately I had powered off most of the VMs before doing this otherwise I would likely have had a good deal of data corruption, the 2 critical VMs that I had running both had kernel panic/blue screens from loss of disk IO and had to be restarted after the storage came back online
Im guessing that not all of the toggles need to be toggled into maintenance mode for something like this, where a node is being taken offline and is expected to be back online relatively quickly. I looked at the admin guide and it doesnt explain what any of the toggles really do, the popup help on each of them doesnt really do a good job of describing that either.
Is there a more detailed document on what the switches do, or just a quick and dirty of which ones should be toggled for a maintenance event such as this??
Ran into an issue this weekend. We have a Petasan cluster that has 6 nodes (v3.0.1), in the infinite wisdom of the powers that be, they did not order cables that were sufficiently long enough to slide the servers out of the rack so they can be operated on while running for internal hot-swap parts like a fan.
Had to shutdown one of the 6 nodes (not a manager node) to replace a fan in it. Went into maintenance menu in petasan console and toggled all the items into maintenance mode, went into the iscsi path assignment and move all the iscsi paths off the node. Being overly cautious I also logged into the hyper-v cluster that uses petasan and shutdown all but 2 critical VMs that are running on the pool, the two running VMs were still fine after the path changes and everything looked fine
Used the console on the node with the problematic fan to shutdown the system, it did a graceful shutdown, had a SSH session open to one of the manager nodes running "ceph -w" and it almost immediately started complaining about delayed writes, while I was waiting for the node to shutdown completely looked at hyper-v and storage had gone offline saying it was not accessible
Needless to say I swapped that fan in as fast as humanly possible and powered the node back up, once it came on line and the OSDs started checking back in the delay messages in the ceph console went away and the storage in hyperv came back online. Fortunately I had powered off most of the VMs before doing this otherwise I would likely have had a good deal of data corruption, the 2 critical VMs that I had running both had kernel panic/blue screens from loss of disk IO and had to be restarted after the storage came back online
Im guessing that not all of the toggles need to be toggled into maintenance mode for something like this, where a node is being taken offline and is expected to be back online relatively quickly. I looked at the admin guide and it doesnt explain what any of the toggles really do, the popup help on each of them doesnt really do a good job of describing that either.
Is there a more detailed document on what the switches do, or just a quick and dirty of which ones should be toggled for a maintenance event such as this??
admin
2,930 Posts
Quote from admin on January 23, 2023, 2:54 pmit is safest to just leave in normal mode without maintenance switches. it will waste some data movement and bandwidth and will lower performance a bit but in many cases it is fine. Just make sure you backfill speed is not set too high from maintenance menu.
Else you could use the no out switch, so the OSD will be marked down but will not be replaced by crush.
it is safest to just leave in normal mode without maintenance switches. it will waste some data movement and bandwidth and will lower performance a bit but in many cases it is fine. Just make sure you backfill speed is not set too high from maintenance menu.
Else you could use the no out switch, so the OSD will be marked down but will not be replaced by crush.