Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

PetaSAN Troubleshooting Crash

Hello,

We're running v3.3.0 of PetaSAN and every day our SAN seems to crash.  The only way to resolve the crash is to reboot one of the nodes.  Our systems are very beefy systems, we have 3 nodes which have about 6 OSD's each.  We don't see anything in any logs we've looked at to indicate anything particular is happening.  I will say there is a lot of scrub messages but our scrub level is medium.  We have 348GB of RAM and multiple Intel(R) Xeon(R) Gold 6242R CPU @ 3.10GHz CPUs.  We're looking for direction on what to look for and any optimizations we may be able to take to figure out what is going on.  What are the most common things to look out for?  Here is the output of ceph status:

cluster:
id: bb186e65-c7c9-4f1a-be6e-1f7f9a73d3c2
health: HEALTH_OK

services:
mon: 3 daemons, quorum es2-psan03,es2-psan01,es2-psan02 (age 8h)
mgr: es2-psan01(active, since 2d), standbys: es2-psan02, es2-psan03
mds: 1/1 daemons up, 2 standby
osd: 18 osds: 18 up (since 8h), 18 in (since 4M)

data:
volumes: 1/1 healthy
pools: 10 pools, 769 pgs
objects: 1.70M objects, 306 GiB
usage: 2.0 TiB used, 12 TiB / 14 TiB avail
pgs: 769 active+clean

 

In addition to these issues above, we are also getting a very specific issue related to CTDB.  At 6:25AM on random mornings (Central Time), we are seeing CTDB go into recovery mode.  There is usually an error related to the CTDB lock file followed by a banning of nodes.  Eventually this issue recovers about 5 minutes later.  Here are some logs:

2024/09/11 06:25:02.617646 ctdbd[8499]: ctdb_mutex_fcntl_helper: lock lost - lock file "/opt/petasan/config/shared/ctdb/lockfile" check failed (ret=2)
2024/09/11 06:25:02.623482 ctdb-recoverd[8635]: Recovery lock helper terminated, triggering an election
2024/09/11 06:25:02.623576 ctdbd[8499]: Recovery mode set to ACTIVE
2024/09/11 06:25:05.625607 ctdb-recoverd[8635]: Election period ended, master=2
2024/09/11 06:25:05.626050 ctdb-recoverd[8635]: Node:2 was in recovery mode. Start recovery process
2024/09/11 06:25:05.626075 ctdb-recoverd[8635]: Node:0 was in recovery mode. Start recovery process
2024/09/11 06:25:05.626092 ctdb-recoverd[8635]: Node:1 was in recovery mode. Start recovery process
2024/09/11 06:25:05.626098 ctdb-recoverd[8635]: ../../ctdb/server/ctdb_recoverd.c:1110 Starting do_recovery
2024/09/11 06:25:05.626102 ctdb-recoverd[8635]: Attempting to take recovery lock (/opt/petasan/config/shared/ctdb/lockfile)
2024/09/11 06:25:05.631764 ctdbd[8499]: /usr/lib/x86_64-linux-gnu/ctdb/ctdb_mutex_fcntl_helper: Unable to open /opt/petasan/config/shared/ctdb/lockfile - (No such file or directory)
2024/09/11 06:25:05.631773 ctdb-recoverd[8635]: Unable to take recover lock - unknown error
2024/09/11 06:25:05.631788 ctdb-recoverd[8635]: Banning this node
2024/09/11 06:25:05.631793 ctdb-recoverd[8635]: Banning node 2 for 300 seconds
2024/09/11 06:25:05.631813 ctdbd[8499]: Banning this node for 300 seconds
2024/09/11 06:25:05.631823 ctdbd[8499]: Making node INACTIVE
2024/09/11 06:25:05.631834 ctdbd[8499]: Dropping all public IP addresses
2024/09/11 06:25:05.631870 ctdbd[8499]: Freeze all
2024/09/11 06:25:05.631876 ctdbd[8499]: Freeze db: g_lock.tdb
2024/09/11 06:25:05.632006 ctdbd[8499]: Freeze db: netlogon_creds_cli.tdb
2024/09/11 06:25:05.632106 ctdbd[8499]: Freeze db: smbXsrv_version_global.tdb
2024/09/11 06:25:05.632201 ctdbd[8499]: Freeze db: smbXsrv_client_global.tdb

hhmm.. 6:25 am is when daily cron jobs run in
/etc/cron.daily/

one is a symlink to
/opt/petasan/scripts/cron-1d.py
edit that file and comment last line

# potential memory leak in gluster client
# unmounting the share will kill gluster client
# will be mounted by PetaSAN within 30 sec

#call_cmd('umount /opt/petasan/config/shared -l')

it is no longer needed. If that has no effect, try to see other daily scripts in system and what they do.

As i understand, CTDB experiences issues, but Ceph health is always OK ? your issues are only with CTDB ?

Thank you, we will try that.  We have a secondary crash that is happening (Seperate issue from 6:25AM).  The secondary crash usually happens around 11:22AM or so, and it requires us to reboot one of the node boxes to get it back up.  I will have IT take screenshots of GUI to send them over to see if we can figure it out.  We will also capture ceph status and ceph health after crash.  Any other things we could capture?  Thanks

What do you mean by crash ? is like the system hanging ? is it responsive ? can you ssh to it ? can you ping to it ? can you run Ceph commands on it and it responds ?

Do you see any high memory/ disk/ or cpu utilization in the dashboard charts  ?

Do you see any errors in /var/log/syslog ? dmesg ?

Do you use iSCSI/SMB/NFS/S3/pure Ceph rbd/CephFS? is the crash related to a specific workload ?

Are you using HDD / SSD / Hybrid ?

Could it be a hardware or network issue ?