Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Cluster node shutdown because of high memory usage of glusterfs

Pages: 1 2

Yesterday we had a cluster node shutdown unexpectedly.

After bringing it back online looked at the PetaSAN log and found the following on the node that shutdown:

24/11/2020 15:10:31 ERROR Error executing cmd : rbd showmapped --format json --cluster ceph
24/11/2020 15:10:31 ERROR Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 889, in get_mapped_images ret, out, err = exec_command_ex(cmd)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/common/cmd.py", line 64, in exec_command_exp = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 394, in __init__ errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 938, in _execute_childself.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Followed by about an hour and a half's worth of (note we are only using iSCSI on the nodes, no NFS):

24/11/2020 15:10:41 INFO Stopping NFS Service
24/11/2020 15:10:41 INFO NFSServer : clean all local resources

Then finally:

24/11/2020 16:27:07 INFO Cleaned disk path 00001/2.
24/11/2020 16:27:07 INFO Cleaned disk path 00001/1.
24/11/2020 16:27:07 INFO Cleaned disk path 00002/1.
24/11/2020 16:27:07 INFO Cleaned disk path 00002/2.
24/11/2020 16:27:07 INFO PetaSAN cleaned local paths not locked by this node in consul.

before it shutdown.

Looking at the graphs on the node that had the issue, the memory usage on the device was at/near 100% for the past few days.

Looked at our second PetaSAN cluster and it had a similar high memory usage on one of the nodes, did a quick search on the forums and found the thread: https://www.petasan.org/forums/?view=thread&id=199

Ran the command recommended by Admin on the node that had the high memory usage in the second cluster (umount /opt/petasan/config/shared) the command came back that it couldnt unmount because of locks on the file system, and after a few the graphs in the management GUI stopped working but the memory usage showing in top on the node dropped to about 60% which is where the other nodes were sitting

It looks like the glusterfs thing has some sort of memory leak in it that should get addressed / worked around, going to keep an eye on it on our end and if the memory usage starts creeping up again will probably put a cron job in to recycle the mount as Admin suggested

Cluster details:

PetaSAN 2.6.2

3x nodes in each cluster, 2x clusters

each node:

  • 64gb ram
  • 2x 2 port 10GB nics
  • 5x OSD 2.18TB SAS
  • 2x NVME Journals 448GB
  • 1x 256GB SATA SSD for OS

can you identify which process is consuming this memory

Impossible to tell on the first cluster since the node had shutdown before we were alerted to any sort of issue

On the second cluster it was the glusterfs process like the person in the posting I linked to, it was consuming most of the memory on the node. After issuing the command (umount /opt/petasan/config/shared) the memory usage on the node in the second cluster dropped to the same levels as the other two nodes, and the graph in the gui went blank and said something along the lines of no data points, checked later in the evening and the graphs came back.

Both clusters have been up and running in production for about 30 days now and that was the first issue we have had with them

currently is the memory increasing ?

Does not appear to be on any of the nodes at the moment, all the nodes are hovering at about 50% memory usage right now

Here is a graph of the usage on the node that rebooted over the last month:

Here is a graph of the usage on the node in the second cluster that I ran the umount command on:

what i mean is the gluster process memory. total memory  can fill up to be used as io buffer which is ok.

I would have to check the process over the span of a few hours / days and see, we currently dont have anything on the system to monitor the individual process utilization as we didnt want to put any additional software on the nodes in case it caused conflicts

you can use

atop -m

 

Installed a monitoring agent from our RMM system on the nodes after the last issue and got alerted again that the system was running low on ram

There is definitely some sort of leak in the glusterfs handling the [/opt/petasan/config/shared] volume, ran the atop-m command before running the umount command (see below)

This time it was on different node in the same cluster as the first issue, the memory usage started to eat up all the ram on the system and ended up with less than 1MB available, ran the umount /opt/petasan/config/shared command on it and all the memory was freed up again, none of the other nodes in the cluster show the same memory leak. Checked our other cluster and it doesnt show any high memory usage at this time, it seems to have the same memory leak but is leaking slower than the first cluster.

The cluster this has occured twice on is handling disk IO for a windows hyper-v cluster (8 hosts) that is running windows guest machines, approx 12 VMs mostly file / AD servers, and one exchange server. The second cluster that is happening slower to is handling a hyper-v cluster (8 hosts) that is running linux guest machines, approx 32 VMs ranging from simple DNS to webhosting platforms.

ran the atop command you recommended before doing so and had the following output:
ATOP - QT-PS-WIN-02 2020/12/12 21:26:07 - 48d10h9m7s elapsed
PRC | sys 6d01h | user 8d16h | | | #proc 385 | #trun 1 | #tslpi 1040 | #tslpu 2 | #zombie 0 | clones 114e6 | | | #exit 5 |
CPU | sys 16% | user 22% | irq 3% | | | idle 1061% | wait 98% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 2% | user 2% | irq 3% | | | idle 88% | cpu000 w 6% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 3% | irq 0% | | | idle 82% | cpu005 w 14% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 2% | user 2% | irq 0% | | | idle 87% | cpu011 w 10% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 89% | cpu002 w 8% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 90% | cpu010 w 7% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 88% | cpu003 w 8% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 87% | cpu004 w 10% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 88% | cpu001 w 9% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 91% | cpu007 w 6% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 90% | cpu006 w 7% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 91% | cpu008 w 7% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
cpu | sys 1% | user 2% | irq 0% | | | idle 91% | cpu009 w 7% | | steal 0% | guest 0% | | curf 2.29GHz | curscal 91% |
CPL | avg1 2.11 | | avg5 2.57 | avg15 2.91 | | | csw 554671e5 | | intr 15390e6 | | | numcpu 12 | |
MEM | tot 62.9G | free 811.8M | cache 368.8M | dirty 1.5M | buff 10.3G | slab 810.8M | slrec 446.7M | shmem 10.7M | shrss 0.9M | shswp 0.0M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M |
SWP | tot 0.0M | free 0.0M | | | | | | | | | vmcom 33.5G | | vmlim 31.4G |
PAG | scan 44970e3 | steal 4060e4 | | stall 0 | | | | | | | swin 0 | | swout 0 |
LVM | dm-0 | busy 32% | | read 32332e4 | write 4053e5 | KiB/r 84 | KiB/w 65 | | MBr/s 6.4 | MBw/s 6.2 | avq 2.70 | | avio 1.85 ms |
LVM | dm-1 | busy 30% | | read 29669e4 | write 3493e5 | KiB/r 83 | KiB/w 59 | | MBr/s 5.8 | MBw/s 4.9 | avq 1.99 | | avio 1.94 ms |
LVM | dm-4 | busy 18% | | read 16724e4 | write 2850e5 | KiB/r 107 | KiB/w 60 | | MBr/s 4.2 | MBw/s 4.0 | avq 0.18 | | avio 1.68 ms |
LVM | dm-2 | busy 12% | | read 12752e4 | write 1770e5 | KiB/r 100 | KiB/w 56 | | MBr/s 3.0 | MBw/s 2.3 | avq 3.63 | | avio 1.61 ms |
LVM | dm-3 | busy 7% | | read 74899e3 | write 1371e5 | KiB/r 108 | KiB/w 59 | | MBr/s 1.9 | MBw/s 1.9 | avq 4.06 | | avio 1.45 ms |
DSK | sda | busy 32% | | read 30546e4 | write 3897e5 | KiB/r 89 | KiB/w 67 | | MBr/s 6.4 | MBw/s 6.2 | avq 2.32 | | avio 1.94 ms |
DSK | sdb | busy 30% | | read 28108e4 | write 3385e5 | KiB/r 88 | KiB/w 61 | | MBr/s 5.8 | MBw/s 4.9 | avq 1.70 | | avio 2.02 ms |
DSK | sdf | busy 18% | | read 14884e4 | write 2751e5 | KiB/r 120 | KiB/w 62 | | MBr/s 4.2 | MBw/s 4.0 | avq 5.41 | | avio 1.79 ms |
DSK | sdd | busy 12% | | read 11314e4 | write 1723e5 | KiB/r 113 | KiB/w 57 | | MBr/s 3.0 | MBw/s 2.3 | avq 3.38 | | avio 1.72 ms |
DSK | sde | busy 7% | | read 66515e3 | write 1334e5 | KiB/r 122 | KiB/w 60 | | MBr/s 1.9 | MBw/s 1.9 | avq 3.81 | | avio 1.54 ms |
DSK | sdc | busy 3% | | read 198097 | write 1176e5 | KiB/r 9 | KiB/w 20 | | MBr/s 0.0 | MBw/s 0.6 | avq 22.28 | | avio 0.93 ms |
NET | transport | tcpi 21707e6 | tcpo 38054e6 | udpi 23640e3 | udpo 15637e3 | tcpao 7844e3 | tcppo 8203e3 | | tcprs 701895 | tcpie 5 | tcpor 604477 | udpnp 443 | udpie 0 |
NET | network | ipi 217275e5 | | ipo 201999e5 | ipfrw 0 | deliv 2173e7 | | | | | icmpi 38074 | | icmpo 8078 |
NET | eth0 1% | pcki 12190e6 | pcko 13175e6 | sp 10 Gbps | si 122 Mbps | so 129 Mbps | | coll 0 | mlti 139523 | erri 1032 | erro 0 | drpi 0 | drpo 0 |
NET | bond0 1% | pcki 22712e6 | pcko 25356e6 | sp 20 Gbps | si 225 Mbps | so 255 Mbps | | coll 0 | mlti 279049 | erri 2108 | erro 0 | drpi 0 | drpo 8 |
NET | bond0.8 1% | pcki 14645e6 | pcko 12801e6 | sp 20 Gbps | si 224 Mbps | so 253 Mbps | | coll 0 | mlti 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | eth2 1% | pcki 10523e6 | pcko 12181e6 | sp 10 Gbps | si 102 Mbps | so 125 Mbps | | coll 0 | mlti 139526 | erri 1076 | erro 0 | drpi 0 | drpo 0 |
NET | eth1 0% | pcki 29240e5 | pcko 50355e5 | sp 10 Gbps | si 28 Mbps | so 60 Mbps | | coll 0 | mlti 56 | erri 16 | erro 0 | drpi 0 | drpo 0 |
NET | eth3 0% | pcki 29489e5 | pcko 50365e5 | sp 10 Gbps | si 28 Mbps | so 60 Mbps | | coll 0 | mlti 55 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | eth1.80 0% | pcki 22667e5 | pcko 24363e5 | sp 10 Gbps | si 27 Mbps | so 60 Mbps | | coll 0 | mlti 24 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | eth3.80 0% | pcki 22913e5 | pcko 24358e5 | sp 10 Gbps | si 27 Mbps | so 60 Mbps | | coll 0 | mlti 23 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | bond0.2 0% | pcki 11005e3 | pcko 3196666 | sp 20 Gbps | si 1 Kbps | so 1 Kbps | | coll 0 | mlti 94 | erri 0 | erro 0 | drpi 0 | drpo 0 |
NET | lo ---- | pcki 25269e5 | pcko 25269e5 | sp 0 Mbps | si 70 Mbps | so 70 Mbps | | coll 0 | mlti 0 | erri 0 | erro 0 | drpi 0 | drpo 0 |
*** system and process activity since boot ***
PID TID MINFLT MAJFLT VSTEXT VSLIBS VDATA VSTACK VSIZE RSIZE PSIZE VGROW RGROW SWAPSZ RUID EUID MEM CMD 1/23
720603 - 9713e3 19 92K 9108K 25.6G 132K 26.1G 25.5G 0K 26.1G 25.5G 0K root root 41% glusterfs
9282 - 1898e6 336 23088K 20872K 5.6G 916K 5.8G 4.9G 0K 5.8G 4.9G 0K ceph ceph 8% ceph-osd
10254 - 1972e6 51 23088K 20872K 5.6G 924K 5.8G 4.8G 0K 5.8G 4.8G 0K ceph ceph 8% ceph-osd
346881 - 1225e6 60 23088K 20872K 5.4G 916K 5.5G 4.6G 0K 5.5G 4.6G 0K ceph ceph 7% ceph-osd
327012 - 9551e5 83 23088K 20872K 5.1G 920K 5.2G 4.3G 0K 5.2G 4.3G 0K ceph ceph 7% ceph-osd
336315 - 4832e5 35 23088K 20872K 4.7G 916K 4.8G 3.9G 0K 4.8G 3.9G 0K ceph ceph 6% ceph-osd
3651 - 2870e4 381 8640K 27432K 1.1G 920K 1.3G 771.7M 0K 1.3G 771.7M 0K ceph ceph 1% ceph-mon
3650 - 4825e4 307 3872K 39288K 1.0G 920K 1.2G 417.0M 0K 1.2G 417.0M 0K ceph ceph 1% ceph-mgr
6014 - 3165e4 5 160K 6204K 224.0M 132K 280.6M 227.5M 0K 280.6M 227.5M 0K root root 0% ctdb-eventd
1682623 - 2315e5 5 3064K 35344K 641.1M 920K 1.8G 124.0M 0K 1.8G 124.0M 0K root root 0% start_notifica
1682557 - 8998e4 2 200K 29324K 697.4M 924K 2.8G 92620K 0K 2.8G 92620K 0K root root 0% collectd
411 - 2223e4 15326 120K 9380K 28264K 132K 452.3M 91804K 0K 452.3M 91804K 0K root root 0% systemd-journa
3301 - 1650e4 3 3064K 35992K 276.3M 920K 1.2G 90120K 0K 1.2G 90120K 0K root root 0% admin.py
6012 - 2992e5 590 728K 6264K 16136K 132K 143.1M 72060K 0K 143.1M 72060K 0K root root 0% ctdbd
3297 - 5763e5 16 3064K 35340K 211.7M 924K 738.7M 67280K 0K 738.7M 67280K 0K root root 0% iscsi_service.
5404 - 8153e4 0 3064K 35324K 163.4M 920K 306.1M 56172K 0K 306.1M 56172K 0K root root 0% cifs_service.p
4036 - 14142 0 3064K 35340K 162.1M 924K 306.8M 53452K 0K 306.8M 53452K 0K root root 0% console.py

1)  Thanks for the feedback, yes we had some reports of this gluster issue in the past, as you state in does not occur in a consistent way. However many releases ago we did put a cron command to run the umount command daily, since you had to run it manually to reduce memory i somehow suspect the cron is not running correctly. Has this cluster been installed many releases ago ?

The umount is included in /opt/petasan/scripts/cron-1d.py and there should be a sym link to this script:

ls -l /etc/cron.daily/cron-1d

can you check if this is not setup as above.

2) On another note: we do frequently see memory near 100% as reported by systat/iostat (as well as atop) which we use in our charts, but some parts of this memory is just kernel buffers used for improving io, it is not application committed memory and is reclaimable. in version 2.7 we deduct this buffer memory from the used memory being reported by sysstat

 

Pages: 1 2