Cluster node shutdown because of high memory usage of glusterfs

RobertH
27 Posts

December 14, 2020, 1:36 pm

This is a newly deployed pair of clusters installed fresh with v2.6.2 and put into production mid November

I checked and there is a symlink between /etc/cron.daily/cron-1d and /opt/petasan/scripts/cron-1d.py

and the /etc/crontab file contains:

# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.

SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin

# m h dom mon dow user command
17 * * * * root cd / && run-parts --report /etc/cron.hourly
25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
#
*/9 * * * * root logrotate /etc/logrotate.d/nagent >/dev/null

Anacron doesnt appear to be installed on any of the nodes, but looking at the code it should fall back to using runparts, I checked the journal logs but it doesnt look like the script outputs anything so there isnt anything in the journal to indicate if it was actually running them.

When I do run-parts --test /etc/cron.daily it lists that as one of the scripts it would attempt to run

admin
2,930 Posts

December 14, 2020, 2:48 pm

I would recommend you first make sure the memory is gluster related and is reclaimed with

umount /opt/petasan/config/shared

if so, then cton is not running, i just checked here and cron does run. Can you add a log command to /opt/petasan/scripts/cron-1d.py

call_cmd('umount /opt/petasan/config/shared')
call_cmd('echo $(date) > /opt/petasan/log/cron1d.log ')

and see if indeed it is running. The default daily cron runs at 6:25 as defined in /etc/crontab but you can change it.

one last thing make sure cron service is ok:

systemctl status cron

RobertH
27 Posts

December 14, 2020, 3:12 pm

I checked the service status for cron and it is running on all the nodes and set to auto start

I added the logging to the cron-1d.py file and will check in the morning to see if it writes anything out

I ran the "umount /opt/petasan/config/shared" command and it came back with the error/warning "umount: /opt/petasan/config/shared: target is busy.", after running it I waited a good 10 minutes and it didnt clear the memory usage, I then ran it with the --force command and the command returned the same warning but the memory almost immediately went down (within seconds)

Wondering if it gets into some sort of situation where the umount wont work without the force and thats why the memory just starts building up because its not actually unmounting it??

admin
2,930 Posts

December 14, 2020, 5:53 pm

can you try the -l (lazy) flag when un-mounting, in /opt/petasan/scripts/cron-1d.py

call_cmd('umount /opt/petasan/config/shared -l')

RobertH
27 Posts

December 16, 2020, 5:04 pm

Checked the logs and the cron is running and writing to the log file as expected

Adding the lazy option appears to have worked

Looking at the memory usage graphs on the nodes running the windows cluster the memory goes up by 2-3% over the course of a day on one of the tree nodes, and then at roughly 6:30ish every morning the memory usage drops by 2-3%, on the linux cluster the memory goes up by about 1% on one of the nodes and goes down at 6:30ish, The memory change only occurs on one of the 3 nodes in each of the clusters

Before adding that option the graphs did not show that drop every day just a gradual build up.

Going to keep an eye on it over the next week or so and see but it looks like that tweak may have gotten it to deal with the issue

Is there a "known" reason as to why there is a memory leak with that service in the first place?

The leak seems to have been more pronounced on the cluster serving the windows guests than the linux ones, our windows hosts have overall more IO throughput than the linux ones by almost a factor of 2 so Im guessing whatever is causing the leak has something to do with the loads placed on the petasan nodes

admin
2,930 Posts

December 16, 2020, 7:57 pm

Good it is working now, i assume the charts data is showing 🙂

It is been a while since we looked at this, there were some bug reports on gluster client memory leaks, one fix was to umount as we did.

RobertH
27 Posts

December 16, 2020, 8:14 pm

Yes the graphs are working, they stop working for a few minutes after the command runs when I ran it manually but comes back on its own, guessing it probably does the same when the cron runs but Im typically not up staring at it at 6:30am