aacraid - Adaptec panics
niqts
12 Posts
March 22, 2022, 5:13 pmQuote from niqts on March 22, 2022, 5:13 pmHi all,
we are running a PetaSAN cluster (newest version) consisting of 5 nodes (3 managers) all having the exact same hardware components.
We make use of Adaptec 5405z RAID controllers, PetaSAN OS is installed on two SSDs based on RAID1, all other disks are JBODs serving OSDs.
We almost have no load on our cluster, as this is still evaluation phase.
Our nodes (managers and non-managers) crash from time to time (very random), showing "aacraid panics".
The last thing we can witness until the servers get unresponsive:
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Adapter health - 217
AACO: adapter kernel failed to start, init status = 3.
Our firmware and drivers of the adaptec controller should already solve known issues that can lead to such panics (firmware: 5.2.0 BUILD 18950), driver "Adaptec aacraid driver 1.2.1[50877]-custom"). Though not super compatible with our model, we tried to update the driver https://storage.microsemi.com/en-us/speed/raid/aac/linux/aacraid-linux-src-1.2.1-60001_tgz.php but as PetaSAN is based on its own "tainted" kernel, we can not install required dependencies (linux-headers etc.).
We also checked timeouts are set to 45
cat /sys/block/DEVICE/device/timeout -> 45
Do you have any hint what could cause our issues?
In the beginning of our evaluation we also shutdown the cluster every evening, at some point some OSD physical disks of our nodes started being "ejected" the next morning we started the cluster nodes again (status "down", we had to manually remove and readd them).
Maybe regular graceful shutdown of the cluster broke it and its time for a reinstallation from scratch?
Our research often points out this could be related to smartctl commands (which also seem to be used by PetaSAN regarding the logs) and Ubuntu versions / newer kernels > 16.04.
Thanks in advance and kind regards
Hi all,
we are running a PetaSAN cluster (newest version) consisting of 5 nodes (3 managers) all having the exact same hardware components.
We make use of Adaptec 5405z RAID controllers, PetaSAN OS is installed on two SSDs based on RAID1, all other disks are JBODs serving OSDs.
We almost have no load on our cluster, as this is still evaluation phase.
Our nodes (managers and non-managers) crash from time to time (very random), showing "aacraid panics".
The last thing we can witness until the servers get unresponsive:
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Adapter health - 217
AACO: adapter kernel failed to start, init status = 3.
Our firmware and drivers of the adaptec controller should already solve known issues that can lead to such panics (firmware: 5.2.0 BUILD 18950), driver "Adaptec aacraid driver 1.2.1[50877]-custom"). Though not super compatible with our model, we tried to update the driver https://storage.microsemi.com/en-us/speed/raid/aac/linux/aacraid-linux-src-1.2.1-60001_tgz.php but as PetaSAN is based on its own "tainted" kernel, we can not install required dependencies (linux-headers etc.).
We also checked timeouts are set to 45
cat /sys/block/DEVICE/device/timeout -> 45
Do you have any hint what could cause our issues?
In the beginning of our evaluation we also shutdown the cluster every evening, at some point some OSD physical disks of our nodes started being "ejected" the next morning we started the cluster nodes again (status "down", we had to manually remove and readd them).
Maybe regular graceful shutdown of the cluster broke it and its time for a reinstallation from scratch?
Our research often points out this could be related to smartctl commands (which also seem to be used by PetaSAN regarding the logs) and Ubuntu versions / newer kernels > 16.04.
Thanks in advance and kind regards
Last edited on March 22, 2022, 5:19 pm by niqts · #1
niqts
12 Posts
June 14, 2022, 2:19 pmQuote from niqts on June 14, 2022, 2:19 pmHi all,
we completely reinstalled our petasan cluster with version 3.1.0.
Unfortunately we still have the exact issues.
We did a lot research and it seemed there were similar issues with similar kernel versions in Ubuntu, someone stated it started with the kernel switch from
4.12 to 4.13 -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586
We ran a test setup and installed Ubuntu 20.04 on one of the petasan nodes - we ran some random workloads and did not have any issues so far, though I admit it is hard to compare this setup.
We are also not sure, if the fix mentioned in launchpad discussion made it to the upstream.
Is there any chance, that someone of you guys could look into this issue with us?
Thanks in advance and kind regards.
Hi all,
we completely reinstalled our petasan cluster with version 3.1.0.
Unfortunately we still have the exact issues.
We did a lot research and it seemed there were similar issues with similar kernel versions in Ubuntu, someone stated it started with the kernel switch from
4.12 to 4.13 -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586
We ran a test setup and installed Ubuntu 20.04 on one of the petasan nodes - we ran some random workloads and did not have any issues so far, though I admit it is hard to compare this setup.
We are also not sure, if the fix mentioned in launchpad discussion made it to the upstream.
Is there any chance, that someone of you guys could look into this issue with us?
Thanks in advance and kind regards.
admin
2,930 Posts
June 15, 2022, 3:04 pmQuote from admin on June 15, 2022, 3:04 pmyou can download the kernel headers from
https://drive.google.com/file/d/1ZxvKLh5779LIcUSOHgIC8vqYcDRPHlNo/view?usp=sharing
you can also buy support from us if you need 🙂
you can download the kernel headers from
https://drive.google.com/file/d/1ZxvKLh5779LIcUSOHgIC8vqYcDRPHlNo/view?usp=sharing
you can also buy support from us if you need 🙂
niqts
12 Posts
June 22, 2022, 3:07 pmQuote from niqts on June 22, 2022, 3:07 pmThanks a lot for the kernel headers!
Thanks a lot for the kernel headers!
aacraid - Adaptec panics
niqts
12 Posts
Quote from niqts on March 22, 2022, 5:13 pmHi all,
we are running a PetaSAN cluster (newest version) consisting of 5 nodes (3 managers) all having the exact same hardware components.
We make use of Adaptec 5405z RAID controllers, PetaSAN OS is installed on two SSDs based on RAID1, all other disks are JBODs serving OSDs.
We almost have no load on our cluster, as this is still evaluation phase.
Our nodes (managers and non-managers) crash from time to time (very random), showing "aacraid panics".
The last thing we can witness until the servers get unresponsive:
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Adapter health - 217
AACO: adapter kernel failed to start, init status = 3.
Our firmware and drivers of the adaptec controller should already solve known issues that can lead to such panics (firmware:
5.2.0 BUILD 18950), driver "Adaptec aacraid driver 1.2.1[50877]-custom"). Though not super compatible with our model, we tried to update the driver https://storage.microsemi.com/en-us/speed/raid/aac/linux/aacraid-linux-src-1.2.1-60001_tgz.php but as PetaSAN is based on its own "tainted" kernel, we can not install required dependencies (linux-headers etc.).
We also checked timeouts are set to 45
cat /sys/block/DEVICE/device/timeout -> 45
Do you have any hint what could cause our issues?
In the beginning of our evaluation we also shutdown the cluster every evening, at some point some OSD physical disks of our nodes started being "ejected" the next morning we started the cluster nodes again (status "down", we had to manually remove and readd them).
Maybe regular graceful shutdown of the cluster broke it and its time for a reinstallation from scratch?
Our research often points out this could be related to smartctl commands (which also seem to be used by PetaSAN regarding the logs) and Ubuntu versions / newer kernels > 16.04.
Thanks in advance and kind regards
Hi all,
we are running a PetaSAN cluster (newest version) consisting of 5 nodes (3 managers) all having the exact same hardware components.
We make use of Adaptec 5405z RAID controllers, PetaSAN OS is installed on two SSDs based on RAID1, all other disks are JBODs serving OSDs.
We almost have no load on our cluster, as this is still evaluation phase.
Our nodes (managers and non-managers) crash from time to time (very random), showing "aacraid panics".
The last thing we can witness until the servers get unresponsive:
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
sd 0:1:27:0: rejecting I/0 to offline device
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Host adapter reset request. SCSI hang?
aacraid: Adapter health - 217
AACO: adapter kernel failed to start, init status = 3.
Our firmware and drivers of the adaptec controller should already solve known issues that can lead to such panics (firmware: 5.2.0 BUILD 18950), driver "Adaptec aacraid driver 1.2.1[50877]-custom"). Though not super compatible with our model, we tried to update the driver https://storage.microsemi.com/en-us/speed/raid/aac/linux/aacraid-linux-src-1.2.1-60001_tgz.php but as PetaSAN is based on its own "tainted" kernel, we can not install required dependencies (linux-headers etc.).
We also checked timeouts are set to 45
cat /sys/block/DEVICE/device/timeout -> 45
Do you have any hint what could cause our issues?
In the beginning of our evaluation we also shutdown the cluster every evening, at some point some OSD physical disks of our nodes started being "ejected" the next morning we started the cluster nodes again (status "down", we had to manually remove and readd them).
Maybe regular graceful shutdown of the cluster broke it and its time for a reinstallation from scratch?
Our research often points out this could be related to smartctl commands (which also seem to be used by PetaSAN regarding the logs) and Ubuntu versions / newer kernels > 16.04.
Thanks in advance and kind regards
niqts
12 Posts
Quote from niqts on June 14, 2022, 2:19 pmHi all,
we completely reinstalled our petasan cluster with version 3.1.0.
Unfortunately we still have the exact issues.
We did a lot research and it seemed there were similar issues with similar kernel versions in Ubuntu, someone stated it started with the kernel switch from
4.12 to 4.13 -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586
We ran a test setup and installed Ubuntu 20.04 on one of the petasan nodes - we ran some random workloads and did not have any issues so far, though I admit it is hard to compare this setup.
We are also not sure, if the fix mentioned in launchpad discussion made it to the upstream.
Is there any chance, that someone of you guys could look into this issue with us?
Thanks in advance and kind regards.
Hi all,
we completely reinstalled our petasan cluster with version 3.1.0.
Unfortunately we still have the exact issues.
We did a lot research and it seemed there were similar issues with similar kernel versions in Ubuntu, someone stated it started with the kernel switch from
4.12 to 4.13 -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586
We ran a test setup and installed Ubuntu 20.04 on one of the petasan nodes - we ran some random workloads and did not have any issues so far, though I admit it is hard to compare this setup.
We are also not sure, if the fix mentioned in launchpad discussion made it to the upstream.
Is there any chance, that someone of you guys could look into this issue with us?
Thanks in advance and kind regards.
admin
2,930 Posts
Quote from admin on June 15, 2022, 3:04 pmyou can download the kernel headers from
https://drive.google.com/file/d/1ZxvKLh5779LIcUSOHgIC8vqYcDRPHlNo/view?usp=sharing
you can also buy support from us if you need 🙂
you can download the kernel headers from
https://drive.google.com/file/d/1ZxvKLh5779LIcUSOHgIC8vqYcDRPHlNo/view?usp=sharing
you can also buy support from us if you need 🙂
niqts
12 Posts
Quote from niqts on June 22, 2022, 3:07 pmThanks a lot for the kernel headers!
Thanks a lot for the kernel headers!