Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Logrotate settings? (High disk useage on / partition)

Pages: 1 2 3

So, this failed again overnight - one of the nodes again ran out of disk space, and the CTDB database got corrupted.

It seems that when doing operations such as this it's not unusual for the /var/lib/ctdb/volatile/locking.tdb file to grow to many gigabytes on some nodes, in my case this file had grown to 4.1G.

This issue seems to be that with the limited constraints of this standard 15G root partition, combined with often many gigs of logs or other files, this is often enough to take this partition to 100% use.

Nothing seemed out of the ordinary with the log files, but I did need to cleanup the /var/log/collectl folder, as this has a few gigs worth of  .raw.gz files in going back a while. These appear to be in non-text based format and not just log files - is there something that should be cleaning these up, or is there a reason why they would be so large under normal operation?

 

root@gl-san-02b:/var/log/collectl# ls -toah
total 2.6G
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055320.raw.gz
-rw-r--r-- 1 root 2.1K Mar 15 05:53 gl-san-02b-collectl-202303.log
drwxr-xr-x 2 root 4.0K Mar 15 05:53 .
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055300.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055250.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055220.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055200.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055150.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055120.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055100.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055050.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055020.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055000.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-054950.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054920.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054900.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054850.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054820.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054800.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054750.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054720.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054700.raw.gz
-rw-r--r-- 1 root 87M Mar 15 05:47 gl-san-02b-20230315-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 14 23:59 gl-san-02b-20230314-000000.raw.gz
drwxrwxr-x 19 root 4.0K Mar 14 13:17 ..
-rw-r--r-- 1 root 362M Mar 13 23:59 gl-san-02b-20230313-000000.raw.gz
-rw-r--r-- 1 root 363M Mar 12 23:59 gl-san-02b-20230312-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 11 23:59 gl-san-02b-20230311-000000.raw.gz
-rw-r--r-- 1 root 362M Mar 10 23:59 gl-san-02b-20230310-000000.raw.gz
-rw-r--r-- 1 root 358M Mar 9 23:59 gl-san-02b-20230309-000000.raw.gz
-rw-r--r-- 1 root 354M Mar 9 00:00 gl-san-02b-20230308-000000.raw.gz
-rw-r--r-- 1 root 896 Feb 28 00:00 gl-san-02b-collectl-202302.log
-rw-r--r-- 1 root 992 Jan 30 23:59 gl-san-02b-collectl-202301.log
-rw-r--r-- 1 root 992 Dec 30 23:59 gl-san-02b-collectl-202212.log
-rw-r--r-- 1 root 960 Nov 29 23:59 gl-san-02b-collectl-202211.log
-rw-r--r-- 1 root 1.1K Oct 30 23:00 gl-san-02b-collectl-202210.log
-rw-r--r-- 1 root 960 Sep 29 23:59 gl-san-02b-collectl-202209.log
-rw-r--r-- 1 root 992 Aug 31 2022 gl-san-02b-collectl-202208.log
-rw-r--r-- 1 root 1.4K Jul 30 2022 gl-san-02b-collectl-202207.log
-rw-r--r-- 1 root 960 Jun 29 2022 gl-san-02b-collectl-202206.log
-rw-r--r-- 1 root 992 May 30 2022 gl-san-02b-collectl-202205.log
-rw-r--r-- 1 root 960 Apr 30 2022 gl-san-02b-collectl-202204.log
-rw-r--r-- 1 root 2.1K Mar 31 2022 gl-san-02b-collectl-202203.log

Note the links and suggestions in my earlier post were for the "High hopcount" messages in the logs which indicate database performance contention.

You can move the /var/lib/ctdb/volatile to a larger partition, like part 4 or 5 where you can remaining disk space, you can do so by creating a symlik. However it tend to think this may not be ideal, once contention starts and log starts to fill in many cases it is like an avalanche and giving it more space may not be the answer.

Maybe in case you are deleting millions of file, you may delete them from underlying ceph mount in /mnt/cepfs after making sure all your clients do not have open files. Another idea is to reduce the number of active CIFS servers by temporarily moving CIFS ips to 1 or 2 nodes using the PetaSAN UI, then stopping services via systemctl stop petasan-cifs and waiting for status to be OK then perform the delete from Windows (it could help to leave the ctdb master node as viewed from ctdb status else it may take more time from status to be Ok).  From the description of the High hopcount issue, one factor would be the number of active nodes.

Do you think changing the source code to suppress or rate limit the Hop count message would solve the issue ?

 

In this case I think the "high hopcount" were the symptom of the issue rather than the cause. I now believe that they started to appear *after* CTDB ran out of disk space and so the database got corrupted.

The problem last night was not caused by the log.ctdb file filling up with these messages from what I can see (as I now have some fairly aggressive log rotation on that file)

I think the issue is the limited space of the root partition, where often times there is not a spare 5G for CTDB to operate normally. I think I will try moving the /var/lib/ctdb/ to one of other partitions like you suggest and continue to monitor.

Thanks for your help on this!

It could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?

to add to above:

you could create symlink to:

/opt/petasan/config/var/lib/

we already did something similar for glusterd

Please let us know db size while deleting millions of files.

Quote from admin on March 15, 2023, 12:33 pm

It could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?

Kicked off a deletion of just under 1 million files, and it seems to sit around 4G-5G for the /var/lib/ctdb/volatile/locking.tdb.2 file, and another few hundred Mb for the other files in that folder.

I've been keeping a close eye, and it seems to be working fine as long as it doesn't run out of space (had to trim down some of the collectl .raw.gz files and other logs to ensure there's plenty space - should there be something else keeping these in check?)

So, for the symlink, something like this? (with petasan-cifs and ctdb services stopped)

cp /var/lib/ctdb /opt/petasan/config/var/lib/

ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb

Thanks for the info

it should be more like (i have not tried it)

systemctl stop petasan-cifs

cp -r /var/lib/ctdb /opt/petasan/config/var/lib/
rm -r /var/lib/ctdb
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb

systemctl start petasan-cifs

That seemed to work, thanks!

I've some more deletions to carry out (going through the process of cleaning out old customer documents in line with data retention polices), and will let you know how things behave

I did notice that stopping petasan-cifs didn't seem to also stop the ctdb service, so I also manually stopped that too

Very good thinks are working.

can you double check that stopping petasan-cifs does not stop ctdb, it should, there may some delay.

Pages: 1 2 3