Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

aacraid - Adaptec panics

Hi all,

we are running a PetaSAN cluster (newest version) consisting of 5 nodes (3 managers) all having the exact same hardware components.

We make use of Adaptec 5405z RAID controllers, PetaSAN OS is installed on two SSDs based on RAID1, all other disks are JBODs serving OSDs.

We almost have no load on our cluster, as this is still evaluation phase.

Our nodes (managers and non-managers) crash from time to time (very random), showing "aacraid panics".

The last thing we can witness until the servers get unresponsive:

sd 0:1:27:0: rejecting I/0 to offline device

sd 0:1:27:0: rejecting I/0 to offline device

sd 0:1:27:0: rejecting I/0 to offline device

sd 0:1:27:0: rejecting I/0 to offline device

aacraid: Host adapter reset request. SCSI hang?

aacraid: Host adapter reset request. SCSI hang?

aacraid: Host adapter reset request. SCSI hang?

aacraid: Host adapter reset request. SCSI hang?

aacraid: Adapter health - 217

AACO: adapter kernel failed to start, init status = 3.

Our firmware and drivers of the adaptec controller should already solve known issues that can lead to such panics (firmware: 5.2.0 BUILD 18950), driver "Adaptec aacraid driver 1.2.1[50877]-custom"). Though not super compatible with our model, we tried to update the driver https://storage.microsemi.com/en-us/speed/raid/aac/linux/aacraid-linux-src-1.2.1-60001_tgz.php but as PetaSAN is based on its own "tainted" kernel, we can not install required dependencies (linux-headers etc.).

We also checked timeouts are set to 45

cat /sys/block/DEVICE/device/timeout -> 45

Do you have any hint what could cause our issues?

In the beginning of our evaluation we also shutdown the cluster every evening, at some point some OSD physical disks of our nodes started being "ejected" the next morning we started the cluster nodes again (status "down", we had to manually remove and readd them).

Maybe regular graceful shutdown of the cluster broke it and its time for a reinstallation from scratch?

Our research often points out this could be related to smartctl commands (which also seem to be used by PetaSAN regarding the logs) and Ubuntu versions / newer kernels > 16.04.

Thanks in advance and kind regards

Hi all,

we completely reinstalled our petasan cluster with version 3.1.0.

Unfortunately we still have the exact issues.

We did a lot research and it seemed there were similar issues with similar kernel versions in Ubuntu, someone stated it started with the kernel switch from

4.12 to 4.13 -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777586

We ran a test setup and installed Ubuntu 20.04 on one of the petasan nodes - we ran some random workloads and did not have any issues so far, though I admit it is hard to compare this setup.

We are also not sure, if the fix mentioned in launchpad discussion made it to the upstream.

Is there any chance, that someone of you guys could look into this issue with us?

Thanks in advance and kind regards.

you can download the kernel headers from

https://drive.google.com/file/d/1ZxvKLh5779LIcUSOHgIC8vqYcDRPHlNo/view?usp=sharing

you can also buy support from us if you need 🙂

 

Thanks a lot for the kernel headers!