PoC VM Shuts Down
seanp
9 Posts
August 5, 2021, 10:00 pmQuote from seanp on August 5, 2021, 10:00 pmHello. I don't know that this is a bug so I didn't want to put it in that forum. I have Proof of Concept PetaSAN install going. I have 2 PS nodes on 1 ESX host, a 1 more PS node on a second host.Each has 4x 64GB VMDK, 1 for OS and 3 for OSD.
For the most part the server I have running on the iSCSI export I made on that cluster is just fine (I haven't tested performance, but that is irrelevant in this particular scenario), for whatever ever reason every day or two one of the PS nodes is just powered down without my interaction.
I look at the dashboard and it says 3 OSDs are missing, then under Manage Nodes a node will be down. This tends to just be the second (one of the two on 1 ESX host), or the third (on an ESX host by itself). I power it back on and after it syncs up (I assume?) all is well again.
Is there a good way to troubleshoot this issue, or is it intentional? I have 3x physical nodes I want to install PS on, but do not want to have to log into IPMI to power one back on should it shut down. Could this have something to do with VMware specifically? Like an OOM (out of memory) killer or something? None of the other VMs (all Windows-based) ever have any issues with shutting down. I don't have an nix-based VMs in this setup other than PS at the moment.
Thanks!
Hello. I don't know that this is a bug so I didn't want to put it in that forum. I have Proof of Concept PetaSAN install going. I have 2 PS nodes on 1 ESX host, a 1 more PS node on a second host.Each has 4x 64GB VMDK, 1 for OS and 3 for OSD.
For the most part the server I have running on the iSCSI export I made on that cluster is just fine (I haven't tested performance, but that is irrelevant in this particular scenario), for whatever ever reason every day or two one of the PS nodes is just powered down without my interaction.
I look at the dashboard and it says 3 OSDs are missing, then under Manage Nodes a node will be down. This tends to just be the second (one of the two on 1 ESX host), or the third (on an ESX host by itself). I power it back on and after it syncs up (I assume?) all is well again.
Is there a good way to troubleshoot this issue, or is it intentional? I have 3x physical nodes I want to install PS on, but do not want to have to log into IPMI to power one back on should it shut down. Could this have something to do with VMware specifically? Like an OOM (out of memory) killer or something? None of the other VMs (all Windows-based) ever have any issues with shutting down. I don't have an nix-based VMs in this setup other than PS at the moment.
Thanks!
admin
2,930 Posts
August 5, 2021, 11:16 pmQuote from admin on August 5, 2021, 11:16 pmmost likely it is the Fencing action. You can turn off fencing from Maintenance menu. but not recommended.
Fencing kills a node if it does not respond to cluster heartbeats in time ( apporx 15 sec) so that its resources ( ip / storage access) could be failed over to other nodes in a secure way.
It is better rather than switch fencing off to identify the root cause of why the nodes does not respond to heartbeats in time. could be network connection issues or overload or hardware that is underpowered, see our recommended hardware guide.
most likely it is the Fencing action. You can turn off fencing from Maintenance menu. but not recommended.
Fencing kills a node if it does not respond to cluster heartbeats in time ( apporx 15 sec) so that its resources ( ip / storage access) could be failed over to other nodes in a secure way.
It is better rather than switch fencing off to identify the root cause of why the nodes does not respond to heartbeats in time. could be network connection issues or overload or hardware that is underpowered, see our recommended hardware guide.
seanp
9 Posts
August 5, 2021, 11:18 pmQuote from seanp on August 5, 2021, 11:18 pmOh, that is fantastic. This is certainly not up to spec as it's just one NIC and 1GB at that, plus it is certainly underpowered. That is most likely whats going on. I'll turn that function off and see if it stays up for a few days. Thanks!
Oh, that is fantastic. This is certainly not up to spec as it's just one NIC and 1GB at that, plus it is certainly underpowered. That is most likely whats going on. I'll turn that function off and see if it stays up for a few days. Thanks!
PoC VM Shuts Down
seanp
9 Posts
Quote from seanp on August 5, 2021, 10:00 pmHello. I don't know that this is a bug so I didn't want to put it in that forum. I have Proof of Concept PetaSAN install going. I have 2 PS nodes on 1 ESX host, a 1 more PS node on a second host.Each has 4x 64GB VMDK, 1 for OS and 3 for OSD.
For the most part the server I have running on the iSCSI export I made on that cluster is just fine (I haven't tested performance, but that is irrelevant in this particular scenario), for whatever ever reason every day or two one of the PS nodes is just powered down without my interaction.
I look at the dashboard and it says 3 OSDs are missing, then under Manage Nodes a node will be down. This tends to just be the second (one of the two on 1 ESX host), or the third (on an ESX host by itself). I power it back on and after it syncs up (I assume?) all is well again.
Is there a good way to troubleshoot this issue, or is it intentional? I have 3x physical nodes I want to install PS on, but do not want to have to log into IPMI to power one back on should it shut down. Could this have something to do with VMware specifically? Like an OOM (out of memory) killer or something? None of the other VMs (all Windows-based) ever have any issues with shutting down. I don't have an nix-based VMs in this setup other than PS at the moment.
Thanks!
Hello. I don't know that this is a bug so I didn't want to put it in that forum. I have Proof of Concept PetaSAN install going. I have 2 PS nodes on 1 ESX host, a 1 more PS node on a second host.Each has 4x 64GB VMDK, 1 for OS and 3 for OSD.
For the most part the server I have running on the iSCSI export I made on that cluster is just fine (I haven't tested performance, but that is irrelevant in this particular scenario), for whatever ever reason every day or two one of the PS nodes is just powered down without my interaction.
I look at the dashboard and it says 3 OSDs are missing, then under Manage Nodes a node will be down. This tends to just be the second (one of the two on 1 ESX host), or the third (on an ESX host by itself). I power it back on and after it syncs up (I assume?) all is well again.
Is there a good way to troubleshoot this issue, or is it intentional? I have 3x physical nodes I want to install PS on, but do not want to have to log into IPMI to power one back on should it shut down. Could this have something to do with VMware specifically? Like an OOM (out of memory) killer or something? None of the other VMs (all Windows-based) ever have any issues with shutting down. I don't have an nix-based VMs in this setup other than PS at the moment.
Thanks!
admin
2,930 Posts
Quote from admin on August 5, 2021, 11:16 pmmost likely it is the Fencing action. You can turn off fencing from Maintenance menu. but not recommended.
Fencing kills a node if it does not respond to cluster heartbeats in time ( apporx 15 sec) so that its resources ( ip / storage access) could be failed over to other nodes in a secure way.
It is better rather than switch fencing off to identify the root cause of why the nodes does not respond to heartbeats in time. could be network connection issues or overload or hardware that is underpowered, see our recommended hardware guide.
most likely it is the Fencing action. You can turn off fencing from Maintenance menu. but not recommended.
Fencing kills a node if it does not respond to cluster heartbeats in time ( apporx 15 sec) so that its resources ( ip / storage access) could be failed over to other nodes in a secure way.
It is better rather than switch fencing off to identify the root cause of why the nodes does not respond to heartbeats in time. could be network connection issues or overload or hardware that is underpowered, see our recommended hardware guide.
seanp
9 Posts
Quote from seanp on August 5, 2021, 11:18 pmOh, that is fantastic. This is certainly not up to spec as it's just one NIC and 1GB at that, plus it is certainly underpowered. That is most likely whats going on. I'll turn that function off and see if it stays up for a few days. Thanks!
Oh, that is fantastic. This is certainly not up to spec as it's just one NIC and 1GB at that, plus it is certainly underpowered. That is most likely whats going on. I'll turn that function off and see if it stays up for a few days. Thanks!