Cluster node shutdown because of high memory usage of glusterfs
Pages: 1 2
RobertH
27 Posts
December 14, 2020, 1:36 pmQuote from RobertH on December 14, 2020, 1:36 pmThis is a newly deployed pair of clusters installed fresh with v2.6.2 and put into production mid November
I checked and there is a symlink between /etc/cron.daily/cron-1d and /opt/petasan/scripts/cron-1d.py
and the /etc/crontab file contains:
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
# m h dom mon dow user command
17 * * * * root cd / && run-parts --report /etc/cron.hourly
25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
#
*/9 * * * * root logrotate /etc/logrotate.d/nagent >/dev/null
Anacron doesnt appear to be installed on any of the nodes, but looking at the code it should fall back to using runparts, I checked the journal logs but it doesnt look like the script outputs anything so there isnt anything in the journal to indicate if it was actually running them.
When I do run-parts --test /etc/cron.daily it lists that as one of the scripts it would attempt to run
This is a newly deployed pair of clusters installed fresh with v2.6.2 and put into production mid November
I checked and there is a symlink between /etc/cron.daily/cron-1d and /opt/petasan/scripts/cron-1d.py
and the /etc/crontab file contains:
# /etc/crontab: system-wide crontab
# Unlike any other crontab you don't have to run the `crontab'
# command to install the new version when you edit this file
# and files in /etc/cron.d. These files also have username fields,
# that none of the other crontabs do.
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
# m h dom mon dow user command
17 * * * * root cd / && run-parts --report /etc/cron.hourly
25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )
47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly )
52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly )
#
*/9 * * * * root logrotate /etc/logrotate.d/nagent >/dev/null
Anacron doesnt appear to be installed on any of the nodes, but looking at the code it should fall back to using runparts, I checked the journal logs but it doesnt look like the script outputs anything so there isnt anything in the journal to indicate if it was actually running them.
When I do run-parts --test /etc/cron.daily it lists that as one of the scripts it would attempt to run
admin
2,930 Posts
December 14, 2020, 2:48 pmQuote from admin on December 14, 2020, 2:48 pmI would recommend you first make sure the memory is gluster related and is reclaimed with
umount /opt/petasan/config/shared
if so, then cton is not running, i just checked here and cron does run. Can you add a log command to /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared')
call_cmd('echo $(date) > /opt/petasan/log/cron1d.log ')
and see if indeed it is running. The default daily cron runs at 6:25 as defined in /etc/crontab but you can change it.
one last thing make sure cron service is ok:
systemctl status cron
I would recommend you first make sure the memory is gluster related and is reclaimed with
umount /opt/petasan/config/shared
if so, then cton is not running, i just checked here and cron does run. Can you add a log command to /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared')
call_cmd('echo $(date) > /opt/petasan/log/cron1d.log ')
and see if indeed it is running. The default daily cron runs at 6:25 as defined in /etc/crontab but you can change it.
one last thing make sure cron service is ok:
systemctl status cron
RobertH
27 Posts
December 14, 2020, 3:12 pmQuote from RobertH on December 14, 2020, 3:12 pmI checked the service status for cron and it is running on all the nodes and set to auto start
I added the logging to the cron-1d.py file and will check in the morning to see if it writes anything out
I ran the "umount /opt/petasan/config/shared" command and it came back with the error/warning "umount: /opt/petasan/config/shared: target is busy.", after running it I waited a good 10 minutes and it didnt clear the memory usage, I then ran it with the --force command and the command returned the same warning but the memory almost immediately went down (within seconds)
Wondering if it gets into some sort of situation where the umount wont work without the force and thats why the memory just starts building up because its not actually unmounting it??
I checked the service status for cron and it is running on all the nodes and set to auto start
I added the logging to the cron-1d.py file and will check in the morning to see if it writes anything out
I ran the "umount /opt/petasan/config/shared" command and it came back with the error/warning "umount: /opt/petasan/config/shared: target is busy.", after running it I waited a good 10 minutes and it didnt clear the memory usage, I then ran it with the --force command and the command returned the same warning but the memory almost immediately went down (within seconds)
Wondering if it gets into some sort of situation where the umount wont work without the force and thats why the memory just starts building up because its not actually unmounting it??
admin
2,930 Posts
December 14, 2020, 5:53 pmQuote from admin on December 14, 2020, 5:53 pmcan you try the -l (lazy) flag when un-mounting, in /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared -l')
can you try the -l (lazy) flag when un-mounting, in /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared -l')
RobertH
27 Posts
December 16, 2020, 5:04 pmQuote from RobertH on December 16, 2020, 5:04 pmChecked the logs and the cron is running and writing to the log file as expected
Adding the lazy option appears to have worked
Looking at the memory usage graphs on the nodes running the windows cluster the memory goes up by 2-3% over the course of a day on one of the tree nodes, and then at roughly 6:30ish every morning the memory usage drops by 2-3%, on the linux cluster the memory goes up by about 1% on one of the nodes and goes down at 6:30ish, The memory change only occurs on one of the 3 nodes in each of the clusters
Before adding that option the graphs did not show that drop every day just a gradual build up.
Going to keep an eye on it over the next week or so and see but it looks like that tweak may have gotten it to deal with the issue
Is there a "known" reason as to why there is a memory leak with that service in the first place?
The leak seems to have been more pronounced on the cluster serving the windows guests than the linux ones, our windows hosts have overall more IO throughput than the linux ones by almost a factor of 2 so Im guessing whatever is causing the leak has something to do with the loads placed on the petasan nodes
Checked the logs and the cron is running and writing to the log file as expected
Adding the lazy option appears to have worked
Looking at the memory usage graphs on the nodes running the windows cluster the memory goes up by 2-3% over the course of a day on one of the tree nodes, and then at roughly 6:30ish every morning the memory usage drops by 2-3%, on the linux cluster the memory goes up by about 1% on one of the nodes and goes down at 6:30ish, The memory change only occurs on one of the 3 nodes in each of the clusters
Before adding that option the graphs did not show that drop every day just a gradual build up.
Going to keep an eye on it over the next week or so and see but it looks like that tweak may have gotten it to deal with the issue
Is there a "known" reason as to why there is a memory leak with that service in the first place?
The leak seems to have been more pronounced on the cluster serving the windows guests than the linux ones, our windows hosts have overall more IO throughput than the linux ones by almost a factor of 2 so Im guessing whatever is causing the leak has something to do with the loads placed on the petasan nodes
Last edited on December 16, 2020, 5:14 pm by RobertH · #15
admin
2,930 Posts
December 16, 2020, 7:57 pmQuote from admin on December 16, 2020, 7:57 pmGood it is working now, i assume the charts data is showing 🙂
It is been a while since we looked at this, there were some bug reports on gluster client memory leaks, one fix was to umount as we did.
Good it is working now, i assume the charts data is showing 🙂
It is been a while since we looked at this, there were some bug reports on gluster client memory leaks, one fix was to umount as we did.
RobertH
27 Posts
December 16, 2020, 8:14 pmQuote from RobertH on December 16, 2020, 8:14 pmYes the graphs are working, they stop working for a few minutes after the command runs when I ran it manually but comes back on its own, guessing it probably does the same when the cron runs but Im typically not up staring at it at 6:30am
Yes the graphs are working, they stop working for a few minutes after the command runs when I ran it manually but comes back on its own, guessing it probably does the same when the cron runs but Im typically not up staring at it at 6:30am
Pages: 1 2
Cluster node shutdown because of high memory usage of glusterfs
RobertH
27 Posts
Quote from RobertH on December 14, 2020, 1:36 pmThis is a newly deployed pair of clusters installed fresh with v2.6.2 and put into production mid November
I checked and there is a symlink between /etc/cron.daily/cron-1d and /opt/petasan/scripts/cron-1d.py
and the /etc/crontab file contains:
# /etc/crontab: system-wide crontab # Unlike any other crontab you don't have to run the `crontab' # command to install the new version when you edit this file # and files in /etc/cron.d. These files also have username fields, # that none of the other crontabs do. SHELL=/bin/sh PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin # m h dom mon dow user command 17 * * * * root cd / && run-parts --report /etc/cron.hourly 25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) 47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ) 52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) # */9 * * * * root logrotate /etc/logrotate.d/nagent >/dev/null
Anacron doesnt appear to be installed on any of the nodes, but looking at the code it should fall back to using runparts, I checked the journal logs but it doesnt look like the script outputs anything so there isnt anything in the journal to indicate if it was actually running them.
When I do run-parts --test /etc/cron.daily it lists that as one of the scripts it would attempt to run
This is a newly deployed pair of clusters installed fresh with v2.6.2 and put into production mid November
I checked and there is a symlink between /etc/cron.daily/cron-1d and /opt/petasan/scripts/cron-1d.py
and the /etc/crontab file contains:
# /etc/crontab: system-wide crontab # Unlike any other crontab you don't have to run the `crontab' # command to install the new version when you edit this file # and files in /etc/cron.d. These files also have username fields, # that none of the other crontabs do. SHELL=/bin/sh PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin # m h dom mon dow user command 17 * * * * root cd / && run-parts --report /etc/cron.hourly 25 6 * * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) 47 6 * * 7 root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.weekly ) 52 6 1 * * root test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.monthly ) # */9 * * * * root logrotate /etc/logrotate.d/nagent >/dev/null
Anacron doesnt appear to be installed on any of the nodes, but looking at the code it should fall back to using runparts, I checked the journal logs but it doesnt look like the script outputs anything so there isnt anything in the journal to indicate if it was actually running them.
When I do run-parts --test /etc/cron.daily it lists that as one of the scripts it would attempt to run
admin
2,930 Posts
Quote from admin on December 14, 2020, 2:48 pmI would recommend you first make sure the memory is gluster related and is reclaimed with
umount /opt/petasan/config/shared
if so, then cton is not running, i just checked here and cron does run. Can you add a log command to /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared')
call_cmd('echo $(date) > /opt/petasan/log/cron1d.log ')and see if indeed it is running. The default daily cron runs at 6:25 as defined in /etc/crontab but you can change it.
one last thing make sure cron service is ok:
systemctl status cron
I would recommend you first make sure the memory is gluster related and is reclaimed with
umount /opt/petasan/config/shared
if so, then cton is not running, i just checked here and cron does run. Can you add a log command to /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared')
call_cmd('echo $(date) > /opt/petasan/log/cron1d.log ')
and see if indeed it is running. The default daily cron runs at 6:25 as defined in /etc/crontab but you can change it.
one last thing make sure cron service is ok:
systemctl status cron
RobertH
27 Posts
Quote from RobertH on December 14, 2020, 3:12 pmI checked the service status for cron and it is running on all the nodes and set to auto start
I added the logging to the cron-1d.py file and will check in the morning to see if it writes anything out
I ran the "umount /opt/petasan/config/shared" command and it came back with the error/warning "umount: /opt/petasan/config/shared: target is busy.", after running it I waited a good 10 minutes and it didnt clear the memory usage, I then ran it with the --force command and the command returned the same warning but the memory almost immediately went down (within seconds)
Wondering if it gets into some sort of situation where the umount wont work without the force and thats why the memory just starts building up because its not actually unmounting it??
I checked the service status for cron and it is running on all the nodes and set to auto start
I added the logging to the cron-1d.py file and will check in the morning to see if it writes anything out
I ran the "umount /opt/petasan/config/shared" command and it came back with the error/warning "umount: /opt/petasan/config/shared: target is busy.", after running it I waited a good 10 minutes and it didnt clear the memory usage, I then ran it with the --force command and the command returned the same warning but the memory almost immediately went down (within seconds)
Wondering if it gets into some sort of situation where the umount wont work without the force and thats why the memory just starts building up because its not actually unmounting it??
admin
2,930 Posts
Quote from admin on December 14, 2020, 5:53 pmcan you try the -l (lazy) flag when un-mounting, in /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared -l')
can you try the -l (lazy) flag when un-mounting, in /opt/petasan/scripts/cron-1d.py
call_cmd('umount /opt/petasan/config/shared -l')
RobertH
27 Posts
Quote from RobertH on December 16, 2020, 5:04 pmChecked the logs and the cron is running and writing to the log file as expected
Adding the lazy option appears to have worked
Looking at the memory usage graphs on the nodes running the windows cluster the memory goes up by 2-3% over the course of a day on one of the tree nodes, and then at roughly 6:30ish every morning the memory usage drops by 2-3%, on the linux cluster the memory goes up by about 1% on one of the nodes and goes down at 6:30ish, The memory change only occurs on one of the 3 nodes in each of the clusters
Before adding that option the graphs did not show that drop every day just a gradual build up.
Going to keep an eye on it over the next week or so and see but it looks like that tweak may have gotten it to deal with the issue
Is there a "known" reason as to why there is a memory leak with that service in the first place?
The leak seems to have been more pronounced on the cluster serving the windows guests than the linux ones, our windows hosts have overall more IO throughput than the linux ones by almost a factor of 2 so Im guessing whatever is causing the leak has something to do with the loads placed on the petasan nodes
Checked the logs and the cron is running and writing to the log file as expected
Adding the lazy option appears to have worked
Looking at the memory usage graphs on the nodes running the windows cluster the memory goes up by 2-3% over the course of a day on one of the tree nodes, and then at roughly 6:30ish every morning the memory usage drops by 2-3%, on the linux cluster the memory goes up by about 1% on one of the nodes and goes down at 6:30ish, The memory change only occurs on one of the 3 nodes in each of the clusters
Before adding that option the graphs did not show that drop every day just a gradual build up.
Going to keep an eye on it over the next week or so and see but it looks like that tweak may have gotten it to deal with the issue
Is there a "known" reason as to why there is a memory leak with that service in the first place?
The leak seems to have been more pronounced on the cluster serving the windows guests than the linux ones, our windows hosts have overall more IO throughput than the linux ones by almost a factor of 2 so Im guessing whatever is causing the leak has something to do with the loads placed on the petasan nodes
admin
2,930 Posts
Quote from admin on December 16, 2020, 7:57 pmGood it is working now, i assume the charts data is showing 🙂
It is been a while since we looked at this, there were some bug reports on gluster client memory leaks, one fix was to umount as we did.
Good it is working now, i assume the charts data is showing 🙂
It is been a while since we looked at this, there were some bug reports on gluster client memory leaks, one fix was to umount as we did.
RobertH
27 Posts
Quote from RobertH on December 16, 2020, 8:14 pmYes the graphs are working, they stop working for a few minutes after the command runs when I ran it manually but comes back on its own, guessing it probably does the same when the cron runs but Im typically not up staring at it at 6:30am
Yes the graphs are working, they stop working for a few minutes after the command runs when I ran it manually but comes back on its own, guessing it probably does the same when the cron runs but Im typically not up staring at it at 6:30am