Logrotate settings? (High disk useage on / partition)
wluke
66 Posts
March 15, 2023, 11:12 amQuote from wluke on March 15, 2023, 11:12 amSo, this failed again overnight - one of the nodes again ran out of disk space, and the CTDB database got corrupted.
It seems that when doing operations such as this it's not unusual for the /var/lib/ctdb/volatile/locking.tdb file to grow to many gigabytes on some nodes, in my case this file had grown to 4.1G.
This issue seems to be that with the limited constraints of this standard 15G root partition, combined with often many gigs of logs or other files, this is often enough to take this partition to 100% use.
Nothing seemed out of the ordinary with the log files, but I did need to cleanup the /var/log/collectl folder, as this has a few gigs worth of .raw.gz files in going back a while. These appear to be in non-text based format and not just log files - is there something that should be cleaning these up, or is there a reason why they would be so large under normal operation?
root@gl-san-02b:/var/log/collectl# ls -toah
total 2.6G
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055320.raw.gz
-rw-r--r-- 1 root 2.1K Mar 15 05:53 gl-san-02b-collectl-202303.log
drwxr-xr-x 2 root 4.0K Mar 15 05:53 .
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055300.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055250.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055220.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055200.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055150.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055120.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055100.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055050.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055020.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055000.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-054950.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054920.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054900.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054850.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054820.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054800.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054750.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054720.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054700.raw.gz
-rw-r--r-- 1 root 87M Mar 15 05:47 gl-san-02b-20230315-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 14 23:59 gl-san-02b-20230314-000000.raw.gz
drwxrwxr-x 19 root 4.0K Mar 14 13:17 ..
-rw-r--r-- 1 root 362M Mar 13 23:59 gl-san-02b-20230313-000000.raw.gz
-rw-r--r-- 1 root 363M Mar 12 23:59 gl-san-02b-20230312-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 11 23:59 gl-san-02b-20230311-000000.raw.gz
-rw-r--r-- 1 root 362M Mar 10 23:59 gl-san-02b-20230310-000000.raw.gz
-rw-r--r-- 1 root 358M Mar 9 23:59 gl-san-02b-20230309-000000.raw.gz
-rw-r--r-- 1 root 354M Mar 9 00:00 gl-san-02b-20230308-000000.raw.gz
-rw-r--r-- 1 root 896 Feb 28 00:00 gl-san-02b-collectl-202302.log
-rw-r--r-- 1 root 992 Jan 30 23:59 gl-san-02b-collectl-202301.log
-rw-r--r-- 1 root 992 Dec 30 23:59 gl-san-02b-collectl-202212.log
-rw-r--r-- 1 root 960 Nov 29 23:59 gl-san-02b-collectl-202211.log
-rw-r--r-- 1 root 1.1K Oct 30 23:00 gl-san-02b-collectl-202210.log
-rw-r--r-- 1 root 960 Sep 29 23:59 gl-san-02b-collectl-202209.log
-rw-r--r-- 1 root 992 Aug 31 2022 gl-san-02b-collectl-202208.log
-rw-r--r-- 1 root 1.4K Jul 30 2022 gl-san-02b-collectl-202207.log
-rw-r--r-- 1 root 960 Jun 29 2022 gl-san-02b-collectl-202206.log
-rw-r--r-- 1 root 992 May 30 2022 gl-san-02b-collectl-202205.log
-rw-r--r-- 1 root 960 Apr 30 2022 gl-san-02b-collectl-202204.log
-rw-r--r-- 1 root 2.1K Mar 31 2022 gl-san-02b-collectl-202203.log
So, this failed again overnight - one of the nodes again ran out of disk space, and the CTDB database got corrupted.
It seems that when doing operations such as this it's not unusual for the /var/lib/ctdb/volatile/locking.tdb file to grow to many gigabytes on some nodes, in my case this file had grown to 4.1G.
This issue seems to be that with the limited constraints of this standard 15G root partition, combined with often many gigs of logs or other files, this is often enough to take this partition to 100% use.
Nothing seemed out of the ordinary with the log files, but I did need to cleanup the /var/log/collectl folder, as this has a few gigs worth of .raw.gz files in going back a while. These appear to be in non-text based format and not just log files - is there something that should be cleaning these up, or is there a reason why they would be so large under normal operation?
root@gl-san-02b:/var/log/collectl# ls -toah
total 2.6G
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055320.raw.gz
-rw-r--r-- 1 root 2.1K Mar 15 05:53 gl-san-02b-collectl-202303.log
drwxr-xr-x 2 root 4.0K Mar 15 05:53 .
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055300.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055250.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055220.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055200.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055150.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055120.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055100.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055050.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055020.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055000.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-054950.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054920.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054900.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054850.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054820.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054800.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054750.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054720.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054700.raw.gz
-rw-r--r-- 1 root 87M Mar 15 05:47 gl-san-02b-20230315-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 14 23:59 gl-san-02b-20230314-000000.raw.gz
drwxrwxr-x 19 root 4.0K Mar 14 13:17 ..
-rw-r--r-- 1 root 362M Mar 13 23:59 gl-san-02b-20230313-000000.raw.gz
-rw-r--r-- 1 root 363M Mar 12 23:59 gl-san-02b-20230312-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 11 23:59 gl-san-02b-20230311-000000.raw.gz
-rw-r--r-- 1 root 362M Mar 10 23:59 gl-san-02b-20230310-000000.raw.gz
-rw-r--r-- 1 root 358M Mar 9 23:59 gl-san-02b-20230309-000000.raw.gz
-rw-r--r-- 1 root 354M Mar 9 00:00 gl-san-02b-20230308-000000.raw.gz
-rw-r--r-- 1 root 896 Feb 28 00:00 gl-san-02b-collectl-202302.log
-rw-r--r-- 1 root 992 Jan 30 23:59 gl-san-02b-collectl-202301.log
-rw-r--r-- 1 root 992 Dec 30 23:59 gl-san-02b-collectl-202212.log
-rw-r--r-- 1 root 960 Nov 29 23:59 gl-san-02b-collectl-202211.log
-rw-r--r-- 1 root 1.1K Oct 30 23:00 gl-san-02b-collectl-202210.log
-rw-r--r-- 1 root 960 Sep 29 23:59 gl-san-02b-collectl-202209.log
-rw-r--r-- 1 root 992 Aug 31 2022 gl-san-02b-collectl-202208.log
-rw-r--r-- 1 root 1.4K Jul 30 2022 gl-san-02b-collectl-202207.log
-rw-r--r-- 1 root 960 Jun 29 2022 gl-san-02b-collectl-202206.log
-rw-r--r-- 1 root 992 May 30 2022 gl-san-02b-collectl-202205.log
-rw-r--r-- 1 root 960 Apr 30 2022 gl-san-02b-collectl-202204.log
-rw-r--r-- 1 root 2.1K Mar 31 2022 gl-san-02b-collectl-202203.log
admin
2,930 Posts
March 15, 2023, 11:56 amQuote from admin on March 15, 2023, 11:56 amNote the links and suggestions in my earlier post were for the "High hopcount" messages in the logs which indicate database performance contention.
You can move the /var/lib/ctdb/volatile to a larger partition, like part 4 or 5 where you can remaining disk space, you can do so by creating a symlik. However it tend to think this may not be ideal, once contention starts and log starts to fill in many cases it is like an avalanche and giving it more space may not be the answer.
Maybe in case you are deleting millions of file, you may delete them from underlying ceph mount in /mnt/cepfs after making sure all your clients do not have open files. Another idea is to reduce the number of active CIFS servers by temporarily moving CIFS ips to 1 or 2 nodes using the PetaSAN UI, then stopping services via systemctl stop petasan-cifs and waiting for status to be OK then perform the delete from Windows (it could help to leave the ctdb master node as viewed from ctdb status else it may take more time from status to be Ok). From the description of the High hopcount issue, one factor would be the number of active nodes.
Do you think changing the source code to suppress or rate limit the Hop count message would solve the issue ?
Note the links and suggestions in my earlier post were for the "High hopcount" messages in the logs which indicate database performance contention.
You can move the /var/lib/ctdb/volatile to a larger partition, like part 4 or 5 where you can remaining disk space, you can do so by creating a symlik. However it tend to think this may not be ideal, once contention starts and log starts to fill in many cases it is like an avalanche and giving it more space may not be the answer.
Maybe in case you are deleting millions of file, you may delete them from underlying ceph mount in /mnt/cepfs after making sure all your clients do not have open files. Another idea is to reduce the number of active CIFS servers by temporarily moving CIFS ips to 1 or 2 nodes using the PetaSAN UI, then stopping services via systemctl stop petasan-cifs and waiting for status to be OK then perform the delete from Windows (it could help to leave the ctdb master node as viewed from ctdb status else it may take more time from status to be Ok). From the description of the High hopcount issue, one factor would be the number of active nodes.
Do you think changing the source code to suppress or rate limit the Hop count message would solve the issue ?
Last edited on March 15, 2023, 11:58 am by admin · #12
wluke
66 Posts
March 15, 2023, 12:23 pmQuote from wluke on March 15, 2023, 12:23 pmIn this case I think the "high hopcount" were the symptom of the issue rather than the cause. I now believe that they started to appear *after* CTDB ran out of disk space and so the database got corrupted.
The problem last night was not caused by the log.ctdb file filling up with these messages from what I can see (as I now have some fairly aggressive log rotation on that file)
I think the issue is the limited space of the root partition, where often times there is not a spare 5G for CTDB to operate normally. I think I will try moving the /var/lib/ctdb/ to one of other partitions like you suggest and continue to monitor.
Thanks for your help on this!
In this case I think the "high hopcount" were the symptom of the issue rather than the cause. I now believe that they started to appear *after* CTDB ran out of disk space and so the database got corrupted.
The problem last night was not caused by the log.ctdb file filling up with these messages from what I can see (as I now have some fairly aggressive log rotation on that file)
I think the issue is the limited space of the root partition, where often times there is not a spare 5G for CTDB to operate normally. I think I will try moving the /var/lib/ctdb/ to one of other partitions like you suggest and continue to monitor.
Thanks for your help on this!
admin
2,930 Posts
March 15, 2023, 12:33 pmQuote from admin on March 15, 2023, 12:33 pmIt could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
It could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
admin
2,930 Posts
March 15, 2023, 12:46 pmQuote from admin on March 15, 2023, 12:46 pmto add to above:
you could create symlink to:
/opt/petasan/config/var/lib/
we already did something similar for glusterd
Please let us know db size while deleting millions of files.
to add to above:
you could create symlink to:
/opt/petasan/config/var/lib/
we already did something similar for glusterd
Please let us know db size while deleting millions of files.
Last edited on March 15, 2023, 12:47 pm by admin · #15
wluke
66 Posts
March 15, 2023, 1:16 pmQuote from wluke on March 15, 2023, 1:16 pm
Quote from admin on March 15, 2023, 12:33 pm
It could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
Kicked off a deletion of just under 1 million files, and it seems to sit around 4G-5G for the /var/lib/ctdb/volatile/locking.tdb.2 file, and another few hundred Mb for the other files in that folder.
I've been keeping a close eye, and it seems to be working fine as long as it doesn't run out of space (had to trim down some of the collectl .raw.gz files and other logs to ensure there's plenty space - should there be something else keeping these in check?)
So, for the symlink, something like this? (with petasan-cifs and ctdb services stopped)
cp /var/lib/ctdb /opt/petasan/config/var/lib/
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb
Quote from admin on March 15, 2023, 12:33 pm
It could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
Kicked off a deletion of just under 1 million files, and it seems to sit around 4G-5G for the /var/lib/ctdb/volatile/locking.tdb.2 file, and another few hundred Mb for the other files in that folder.
I've been keeping a close eye, and it seems to be working fine as long as it doesn't run out of space (had to trim down some of the collectl .raw.gz files and other logs to ensure there's plenty space - should there be something else keeping these in check?)
So, for the symlink, something like this? (with petasan-cifs and ctdb services stopped)
cp /var/lib/ctdb /opt/petasan/config/var/lib/
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb
Last edited on March 15, 2023, 1:19 pm by wluke · #16
admin
2,930 Posts
March 15, 2023, 1:23 pmQuote from admin on March 15, 2023, 1:23 pmThanks for the info
it should be more like (i have not tried it)
systemctl stop petasan-cifs
cp -r /var/lib/ctdb /opt/petasan/config/var/lib/
rm -r /var/lib/ctdb
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb
systemctl start petasan-cifs
Thanks for the info
it should be more like (i have not tried it)
systemctl stop petasan-cifs
cp -r /var/lib/ctdb /opt/petasan/config/var/lib/
rm -r /var/lib/ctdb
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb
systemctl start petasan-cifs
Last edited on March 15, 2023, 1:24 pm by admin · #17
wluke
66 Posts
March 15, 2023, 1:46 pmQuote from wluke on March 15, 2023, 1:46 pmThat seemed to work, thanks!
I've some more deletions to carry out (going through the process of cleaning out old customer documents in line with data retention polices), and will let you know how things behave
That seemed to work, thanks!
I've some more deletions to carry out (going through the process of cleaning out old customer documents in line with data retention polices), and will let you know how things behave
wluke
66 Posts
March 15, 2023, 1:56 pmQuote from wluke on March 15, 2023, 1:56 pmI did notice that stopping petasan-cifs didn't seem to also stop the ctdb service, so I also manually stopped that too
I did notice that stopping petasan-cifs didn't seem to also stop the ctdb service, so I also manually stopped that too
admin
2,930 Posts
March 15, 2023, 2:41 pmQuote from admin on March 15, 2023, 2:41 pmVery good thinks are working.
can you double check that stopping petasan-cifs does not stop ctdb, it should, there may some delay.
Very good thinks are working.
can you double check that stopping petasan-cifs does not stop ctdb, it should, there may some delay.
Logrotate settings? (High disk useage on / partition)
wluke
66 Posts
Quote from wluke on March 15, 2023, 11:12 amSo, this failed again overnight - one of the nodes again ran out of disk space, and the CTDB database got corrupted.
It seems that when doing operations such as this it's not unusual for the /var/lib/ctdb/volatile/locking.tdb file to grow to many gigabytes on some nodes, in my case this file had grown to 4.1G.
This issue seems to be that with the limited constraints of this standard 15G root partition, combined with often many gigs of logs or other files, this is often enough to take this partition to 100% use.
Nothing seemed out of the ordinary with the log files, but I did need to cleanup the /var/log/collectl folder, as this has a few gigs worth of .raw.gz files in going back a while. These appear to be in non-text based format and not just log files - is there something that should be cleaning these up, or is there a reason why they would be so large under normal operation?
root@gl-san-02b:/var/log/collectl# ls -toah
total 2.6G
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055320.raw.gz
-rw-r--r-- 1 root 2.1K Mar 15 05:53 gl-san-02b-collectl-202303.log
drwxr-xr-x 2 root 4.0K Mar 15 05:53 .
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055300.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055250.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055220.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055200.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055150.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055120.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055100.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055050.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055020.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055000.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-054950.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054920.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054900.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054850.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054820.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054800.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054750.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054720.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054700.raw.gz
-rw-r--r-- 1 root 87M Mar 15 05:47 gl-san-02b-20230315-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 14 23:59 gl-san-02b-20230314-000000.raw.gz
drwxrwxr-x 19 root 4.0K Mar 14 13:17 ..
-rw-r--r-- 1 root 362M Mar 13 23:59 gl-san-02b-20230313-000000.raw.gz
-rw-r--r-- 1 root 363M Mar 12 23:59 gl-san-02b-20230312-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 11 23:59 gl-san-02b-20230311-000000.raw.gz
-rw-r--r-- 1 root 362M Mar 10 23:59 gl-san-02b-20230310-000000.raw.gz
-rw-r--r-- 1 root 358M Mar 9 23:59 gl-san-02b-20230309-000000.raw.gz
-rw-r--r-- 1 root 354M Mar 9 00:00 gl-san-02b-20230308-000000.raw.gz
-rw-r--r-- 1 root 896 Feb 28 00:00 gl-san-02b-collectl-202302.log
-rw-r--r-- 1 root 992 Jan 30 23:59 gl-san-02b-collectl-202301.log
-rw-r--r-- 1 root 992 Dec 30 23:59 gl-san-02b-collectl-202212.log
-rw-r--r-- 1 root 960 Nov 29 23:59 gl-san-02b-collectl-202211.log
-rw-r--r-- 1 root 1.1K Oct 30 23:00 gl-san-02b-collectl-202210.log
-rw-r--r-- 1 root 960 Sep 29 23:59 gl-san-02b-collectl-202209.log
-rw-r--r-- 1 root 992 Aug 31 2022 gl-san-02b-collectl-202208.log
-rw-r--r-- 1 root 1.4K Jul 30 2022 gl-san-02b-collectl-202207.log
-rw-r--r-- 1 root 960 Jun 29 2022 gl-san-02b-collectl-202206.log
-rw-r--r-- 1 root 992 May 30 2022 gl-san-02b-collectl-202205.log
-rw-r--r-- 1 root 960 Apr 30 2022 gl-san-02b-collectl-202204.log
-rw-r--r-- 1 root 2.1K Mar 31 2022 gl-san-02b-collectl-202203.log
So, this failed again overnight - one of the nodes again ran out of disk space, and the CTDB database got corrupted.
It seems that when doing operations such as this it's not unusual for the /var/lib/ctdb/volatile/locking.tdb file to grow to many gigabytes on some nodes, in my case this file had grown to 4.1G.
This issue seems to be that with the limited constraints of this standard 15G root partition, combined with often many gigs of logs or other files, this is often enough to take this partition to 100% use.
Nothing seemed out of the ordinary with the log files, but I did need to cleanup the /var/log/collectl folder, as this has a few gigs worth of .raw.gz files in going back a while. These appear to be in non-text based format and not just log files - is there something that should be cleaning these up, or is there a reason why they would be so large under normal operation?
root@gl-san-02b:/var/log/collectl# ls -toah
total 2.6G
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055320.raw.gz
-rw-r--r-- 1 root 2.1K Mar 15 05:53 gl-san-02b-collectl-202303.log
drwxr-xr-x 2 root 4.0K Mar 15 05:53 .
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055300.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:53 gl-san-02b-20230315-055250.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055220.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055200.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:52 gl-san-02b-20230315-055150.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055120.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055100.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:51 gl-san-02b-20230315-055050.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055020.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-055000.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:50 gl-san-02b-20230315-054950.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054920.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054900.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:49 gl-san-02b-20230315-054850.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054820.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054800.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:48 gl-san-02b-20230315-054750.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054720.raw.gz
-rw-r--r-- 1 root 0 Mar 15 05:47 gl-san-02b-20230315-054700.raw.gz
-rw-r--r-- 1 root 87M Mar 15 05:47 gl-san-02b-20230315-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 14 23:59 gl-san-02b-20230314-000000.raw.gz
drwxrwxr-x 19 root 4.0K Mar 14 13:17 ..
-rw-r--r-- 1 root 362M Mar 13 23:59 gl-san-02b-20230313-000000.raw.gz
-rw-r--r-- 1 root 363M Mar 12 23:59 gl-san-02b-20230312-000000.raw.gz
-rw-r--r-- 1 root 361M Mar 11 23:59 gl-san-02b-20230311-000000.raw.gz
-rw-r--r-- 1 root 362M Mar 10 23:59 gl-san-02b-20230310-000000.raw.gz
-rw-r--r-- 1 root 358M Mar 9 23:59 gl-san-02b-20230309-000000.raw.gz
-rw-r--r-- 1 root 354M Mar 9 00:00 gl-san-02b-20230308-000000.raw.gz
-rw-r--r-- 1 root 896 Feb 28 00:00 gl-san-02b-collectl-202302.log
-rw-r--r-- 1 root 992 Jan 30 23:59 gl-san-02b-collectl-202301.log
-rw-r--r-- 1 root 992 Dec 30 23:59 gl-san-02b-collectl-202212.log
-rw-r--r-- 1 root 960 Nov 29 23:59 gl-san-02b-collectl-202211.log
-rw-r--r-- 1 root 1.1K Oct 30 23:00 gl-san-02b-collectl-202210.log
-rw-r--r-- 1 root 960 Sep 29 23:59 gl-san-02b-collectl-202209.log
-rw-r--r-- 1 root 992 Aug 31 2022 gl-san-02b-collectl-202208.log
-rw-r--r-- 1 root 1.4K Jul 30 2022 gl-san-02b-collectl-202207.log
-rw-r--r-- 1 root 960 Jun 29 2022 gl-san-02b-collectl-202206.log
-rw-r--r-- 1 root 992 May 30 2022 gl-san-02b-collectl-202205.log
-rw-r--r-- 1 root 960 Apr 30 2022 gl-san-02b-collectl-202204.log
-rw-r--r-- 1 root 2.1K Mar 31 2022 gl-san-02b-collectl-202203.log
admin
2,930 Posts
Quote from admin on March 15, 2023, 11:56 amNote the links and suggestions in my earlier post were for the "High hopcount" messages in the logs which indicate database performance contention.
You can move the /var/lib/ctdb/volatile to a larger partition, like part 4 or 5 where you can remaining disk space, you can do so by creating a symlik. However it tend to think this may not be ideal, once contention starts and log starts to fill in many cases it is like an avalanche and giving it more space may not be the answer.
Maybe in case you are deleting millions of file, you may delete them from underlying ceph mount in /mnt/cepfs after making sure all your clients do not have open files. Another idea is to reduce the number of active CIFS servers by temporarily moving CIFS ips to 1 or 2 nodes using the PetaSAN UI, then stopping services via systemctl stop petasan-cifs and waiting for status to be OK then perform the delete from Windows (it could help to leave the ctdb master node as viewed from ctdb status else it may take more time from status to be Ok). From the description of the High hopcount issue, one factor would be the number of active nodes.
Do you think changing the source code to suppress or rate limit the Hop count message would solve the issue ?
Note the links and suggestions in my earlier post were for the "High hopcount" messages in the logs which indicate database performance contention.
You can move the /var/lib/ctdb/volatile to a larger partition, like part 4 or 5 where you can remaining disk space, you can do so by creating a symlik. However it tend to think this may not be ideal, once contention starts and log starts to fill in many cases it is like an avalanche and giving it more space may not be the answer.
Maybe in case you are deleting millions of file, you may delete them from underlying ceph mount in /mnt/cepfs after making sure all your clients do not have open files. Another idea is to reduce the number of active CIFS servers by temporarily moving CIFS ips to 1 or 2 nodes using the PetaSAN UI, then stopping services via systemctl stop petasan-cifs and waiting for status to be OK then perform the delete from Windows (it could help to leave the ctdb master node as viewed from ctdb status else it may take more time from status to be Ok). From the description of the High hopcount issue, one factor would be the number of active nodes.
Do you think changing the source code to suppress or rate limit the Hop count message would solve the issue ?
wluke
66 Posts
Quote from wluke on March 15, 2023, 12:23 pmIn this case I think the "high hopcount" were the symptom of the issue rather than the cause. I now believe that they started to appear *after* CTDB ran out of disk space and so the database got corrupted.
The problem last night was not caused by the log.ctdb file filling up with these messages from what I can see (as I now have some fairly aggressive log rotation on that file)
I think the issue is the limited space of the root partition, where often times there is not a spare 5G for CTDB to operate normally. I think I will try moving the /var/lib/ctdb/ to one of other partitions like you suggest and continue to monitor.
Thanks for your help on this!
In this case I think the "high hopcount" were the symptom of the issue rather than the cause. I now believe that they started to appear *after* CTDB ran out of disk space and so the database got corrupted.
The problem last night was not caused by the log.ctdb file filling up with these messages from what I can see (as I now have some fairly aggressive log rotation on that file)
I think the issue is the limited space of the root partition, where often times there is not a spare 5G for CTDB to operate normally. I think I will try moving the /var/lib/ctdb/ to one of other partitions like you suggest and continue to monitor.
Thanks for your help on this!
admin
2,930 Posts
Quote from admin on March 15, 2023, 12:33 pmIt could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
It could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
admin
2,930 Posts
Quote from admin on March 15, 2023, 12:46 pmto add to above:
you could create symlink to:
/opt/petasan/config/var/lib/
we already did something similar for glusterd
Please let us know db size while deleting millions of files.
to add to above:
you could create symlink to:
/opt/petasan/config/var/lib/
we already did something similar for glusterd
Please let us know db size while deleting millions of files.
wluke
66 Posts
Quote from wluke on March 15, 2023, 1:16 pmQuote from admin on March 15, 2023, 12:33 pmIt could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
Kicked off a deletion of just under 1 million files, and it seems to sit around 4G-5G for the /var/lib/ctdb/volatile/locking.tdb.2 file, and another few hundred Mb for the other files in that folder.
I've been keeping a close eye, and it seems to be working fine as long as it doesn't run out of space (had to trim down some of the collectl .raw.gz files and other logs to ensure there's plenty space - should there be something else keeping these in check?)
So, for the symlink, something like this? (with petasan-cifs and ctdb services stopped)
cp /var/lib/ctdb /opt/petasan/config/var/lib/
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb
Quote from admin on March 15, 2023, 12:33 pmIt could very well be. Can you please let me know if this is the case, and what size does the db grow to while deleting millions of files ?
Kicked off a deletion of just under 1 million files, and it seems to sit around 4G-5G for the /var/lib/ctdb/volatile/locking.tdb.2 file, and another few hundred Mb for the other files in that folder.
I've been keeping a close eye, and it seems to be working fine as long as it doesn't run out of space (had to trim down some of the collectl .raw.gz files and other logs to ensure there's plenty space - should there be something else keeping these in check?)
So, for the symlink, something like this? (with petasan-cifs and ctdb services stopped)
cp /var/lib/ctdb /opt/petasan/config/var/lib/
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdb
admin
2,930 Posts
Quote from admin on March 15, 2023, 1:23 pmThanks for the info
it should be more like (i have not tried it)
systemctl stop petasan-cifs
cp -r /var/lib/ctdb /opt/petasan/config/var/lib/
rm -r /var/lib/ctdb
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdbsystemctl start petasan-cifs
Thanks for the info
it should be more like (i have not tried it)
systemctl stop petasan-cifs
cp -r /var/lib/ctdb /opt/petasan/config/var/lib/
rm -r /var/lib/ctdb
ln -s /opt/petasan/config/var/lib/ctdb /var/lib/ctdbsystemctl start petasan-cifs
wluke
66 Posts
Quote from wluke on March 15, 2023, 1:46 pmThat seemed to work, thanks!
I've some more deletions to carry out (going through the process of cleaning out old customer documents in line with data retention polices), and will let you know how things behave
That seemed to work, thanks!
I've some more deletions to carry out (going through the process of cleaning out old customer documents in line with data retention polices), and will let you know how things behave
wluke
66 Posts
Quote from wluke on March 15, 2023, 1:56 pmI did notice that stopping petasan-cifs didn't seem to also stop the ctdb service, so I also manually stopped that too
I did notice that stopping petasan-cifs didn't seem to also stop the ctdb service, so I also manually stopped that too
admin
2,930 Posts
Quote from admin on March 15, 2023, 2:41 pmVery good thinks are working.
can you double check that stopping petasan-cifs does not stop ctdb, it should, there may some delay.
Very good thinks are working.
can you double check that stopping petasan-cifs does not stop ctdb, it should, there may some delay.