Help start troubleshooting with iSCSI / ESXi
minimos
4 Posts
January 11, 2024, 1:29 pmQuote from minimos on January 11, 2024, 1:29 pmHello,
we have a 3-nodes PetaSAN 3.1.0 system. It offers storage space to a 3-nodes ESXi system, with the storage mounted via iSCSI.
One repeating issue we detected is that (apparently exactly every month, i.e. every 30 days) the access from one specific ESXi node is kind of stuck.
The only way to quickly resolve the issue is to restart the ESXi server.
I won't ask for a detailed troubleshooting, but I'd like some suggestions in which direction should I investigate in order to find out the cause of this issue?
At least from the metrics shown in the PetasSAN dashboard, I can't spot any clear problem (there an unusual peak in the commit time for some of OSDs around the time when the storage access is very slow, but I'd think it's more the effect of the outstanding access requests than the cause of it)
The periodicity of the issue looks also suspicious: does PetaSAN has some kind of monthly scrub/cleaning job that might interfere with disk access?
Thanks
Hello,
we have a 3-nodes PetaSAN 3.1.0 system. It offers storage space to a 3-nodes ESXi system, with the storage mounted via iSCSI.
One repeating issue we detected is that (apparently exactly every month, i.e. every 30 days) the access from one specific ESXi node is kind of stuck.
The only way to quickly resolve the issue is to restart the ESXi server.
I won't ask for a detailed troubleshooting, but I'd like some suggestions in which direction should I investigate in order to find out the cause of this issue?
At least from the metrics shown in the PetasSAN dashboard, I can't spot any clear problem (there an unusual peak in the commit time for some of OSDs around the time when the storage access is very slow, but I'd think it's more the effect of the outstanding access requests than the cause of it)
The periodicity of the issue looks also suspicious: does PetaSAN has some kind of monthly scrub/cleaning job that might interfere with disk access?
Thanks
Last edited on January 11, 2024, 1:29 pm by minimos · #1
admin
2,930 Posts
January 11, 2024, 7:36 pmQuote from admin on January 11, 2024, 7:36 pmWe have a large number ESXi installations running for years with no issues. If you need to restart the ESXi side not PetaSAN side, i would think the PetaSAN side is working as expected. We do not have a monthly scheduled process.
In dashboard Node Statistics, do you see high disk % busy during the time of problem? If so it could be high load compared to available hardware. What type of disk setup do you have: ssd (model, enterprise/consumer type), pure hdd, hdd with journal ? How many total disks ? In Cluster Statistics what is the cluster Throughput and IOPS load during that time ? Did you set your scrub speed and.or backfill speed too high from the UI Maintenance page ? Generally ESXi is less forgiving to i/o delay (letancy) than other clients like Windows or Linux and may stopped the datastore if the delay is too high, if you do not have sufficient hardware, you may run into issues.
We have a large number ESXi installations running for years with no issues. If you need to restart the ESXi side not PetaSAN side, i would think the PetaSAN side is working as expected. We do not have a monthly scheduled process.
In dashboard Node Statistics, do you see high disk % busy during the time of problem? If so it could be high load compared to available hardware. What type of disk setup do you have: ssd (model, enterprise/consumer type), pure hdd, hdd with journal ? How many total disks ? In Cluster Statistics what is the cluster Throughput and IOPS load during that time ? Did you set your scrub speed and.or backfill speed too high from the UI Maintenance page ? Generally ESXi is less forgiving to i/o delay (letancy) than other clients like Windows or Linux and may stopped the datastore if the delay is too high, if you do not have sufficient hardware, you may run into issues.
Last edited on January 11, 2024, 7:40 pm by admin · #2
Help start troubleshooting with iSCSI / ESXi
minimos
4 Posts
Quote from minimos on January 11, 2024, 1:29 pmHello,
we have a 3-nodes PetaSAN 3.1.0 system. It offers storage space to a 3-nodes ESXi system, with the storage mounted via iSCSI.One repeating issue we detected is that (apparently exactly every month, i.e. every 30 days) the access from one specific ESXi node is kind of stuck.
The only way to quickly resolve the issue is to restart the ESXi server.I won't ask for a detailed troubleshooting, but I'd like some suggestions in which direction should I investigate in order to find out the cause of this issue?
At least from the metrics shown in the PetasSAN dashboard, I can't spot any clear problem (there an unusual peak in the commit time for some of OSDs around the time when the storage access is very slow, but I'd think it's more the effect of the outstanding access requests than the cause of it)The periodicity of the issue looks also suspicious: does PetaSAN has some kind of monthly scrub/cleaning job that might interfere with disk access?
Thanks
Hello,
we have a 3-nodes PetaSAN 3.1.0 system. It offers storage space to a 3-nodes ESXi system, with the storage mounted via iSCSI.
One repeating issue we detected is that (apparently exactly every month, i.e. every 30 days) the access from one specific ESXi node is kind of stuck.
The only way to quickly resolve the issue is to restart the ESXi server.
I won't ask for a detailed troubleshooting, but I'd like some suggestions in which direction should I investigate in order to find out the cause of this issue?
At least from the metrics shown in the PetasSAN dashboard, I can't spot any clear problem (there an unusual peak in the commit time for some of OSDs around the time when the storage access is very slow, but I'd think it's more the effect of the outstanding access requests than the cause of it)
The periodicity of the issue looks also suspicious: does PetaSAN has some kind of monthly scrub/cleaning job that might interfere with disk access?
Thanks
admin
2,930 Posts
Quote from admin on January 11, 2024, 7:36 pmWe have a large number ESXi installations running for years with no issues. If you need to restart the ESXi side not PetaSAN side, i would think the PetaSAN side is working as expected. We do not have a monthly scheduled process.
In dashboard Node Statistics, do you see high disk % busy during the time of problem? If so it could be high load compared to available hardware. What type of disk setup do you have: ssd (model, enterprise/consumer type), pure hdd, hdd with journal ? How many total disks ? In Cluster Statistics what is the cluster Throughput and IOPS load during that time ? Did you set your scrub speed and.or backfill speed too high from the UI Maintenance page ? Generally ESXi is less forgiving to i/o delay (letancy) than other clients like Windows or Linux and may stopped the datastore if the delay is too high, if you do not have sufficient hardware, you may run into issues.
We have a large number ESXi installations running for years with no issues. If you need to restart the ESXi side not PetaSAN side, i would think the PetaSAN side is working as expected. We do not have a monthly scheduled process.
In dashboard Node Statistics, do you see high disk % busy during the time of problem? If so it could be high load compared to available hardware. What type of disk setup do you have: ssd (model, enterprise/consumer type), pure hdd, hdd with journal ? How many total disks ? In Cluster Statistics what is the cluster Throughput and IOPS load during that time ? Did you set your scrub speed and.or backfill speed too high from the UI Maintenance page ? Generally ESXi is less forgiving to i/o delay (letancy) than other clients like Windows or Linux and may stopped the datastore if the delay is too high, if you do not have sufficient hardware, you may run into issues.