Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Memory leak

Hello Hatem,

it seems, that there is a memory leak. we have four HPG9 servers, each with 64gig, 6 ssd osds and 2 2*10ten gig nics. The free memory is decreasing every day. As we started with PetaSAN, we´ve had 32gig and after a few days some osds where crashing and we could see, that it was a problem of to less memory. We increased the memory to 64gig, we are now scared about loose every day mb´s of free ram and the osd crash will start again.

Regards, Carsten

see what processes are taking memory

atop -m

32GB is on the low side, if you find your osds are talking too much memory, lower it via

bluestore_cache_size_ssd = 1073741824

in your conf file ( default is 3G ) and restart the OSDs

 

https://we.tl/t-sUezOg4E4X

There is a screenshot and a free.log from the past days. It seems that every osd is taking about 5.5Gig. In the moment it´s still enough free memory.

If it stops increasing you are ok.

there has been recent changes to osd caching to improve performance by increasing the cache size, specially for ssds, in 2.2 it is 3 G, in 2.3 it is 4 G for cache size aside from the actual osd memory. it should be reflected in the docs which state 2 G.

you can adjust it via the conf value posted earlier,

Hello,

the system has currently 64Gig RAM. But the memory goes less and less.

top - 14:37:52 up 8 days, 15:36,  1 user,  load average: 0.27, 0.35, 0.41
Tasks: 407 total,   1 running, 406 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.4 us,  0.9 sy,  0.0 ni, 97.4 id,  0.2 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65826440 total, 10335688 free, 52682724 used,  2808028 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 12097032 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
3361 ceph      20   0 5742732 4.490g  29416 S   0.0  7.2 811:15.15 ceph-osd
2245 ceph      20   0 5787616 4.483g  29588 S   0.0  7.1 879:18.55 ceph-osd
2161 ceph      20   0 5755288 4.421g  29332 S   0.0  7.0 743:29.45 ceph-osd
2504 ceph      20   0 5862844 4.389g  29288 S   6.2  7.0 917:50.79 ceph-osd
2340 ceph      20   0 5804500 4.341g  29396 S   6.2  6.9 785:43.14 ceph-osd
2856 ceph      20   0 5796268 4.244g  29452 S   0.0  6.8 717:14.39 ceph-osd
1957 root      20   0 4232972 3.483g   6344 S   0.0  5.5 340:16.43 glusterfs

top - 06:37:57 up 9 days,  7:36,  1 user,  load average: 0.53, 0.56, 0.58
Tasks: 417 total,   2 running, 415 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.5 us,  0.8 sy,  0.0 ni, 97.4 id,  0.2 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem : 65826440 total,  9377700 free, 53625548 used,  2823192 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11153976 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2340 ceph      20   0 5804500 4.536g  29396 S   6.9  7.2 820:04.66 ceph-osd
3361 ceph      20   0 5841036 4.489g  29416 S   6.3  7.2 843:55.98 ceph-osd
2161 ceph      20   0 5820824 4.475g  29332 S   5.9  7.1 776:23.41 ceph-osd
2504 ceph      20   0 5895612 4.458g  29288 S   8.2  7.1 954:37.42 ceph-osd
2245 ceph      20   0 5885920 4.375g  29588 S   6.6  7.0 916:32.48 ceph-osd
2856 ceph      20   0 5796268 4.283g  29452 S   6.2  6.8 750:19.23 ceph-osd
1957 root      20   0 4495116 3.758g   6344 S   2.7  6.0 366:16.19 glusterfs

top - 11:22:58 up 9 days, 12:21,  1 user,  load average: 2.67, 2.69, 2.58
Tasks: 416 total,   3 running, 413 sleeping,   0 stopped,   0 zombie
%Cpu(s):  4.5 us,  2.5 sy,  0.0 ni, 91.9 id,  0.5 wa,  0.0 hi,  0.6 si,  0.0 st
KiB Mem : 65826440 total,  9206748 free, 53747900 used,  2871792 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11026764 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2504 ceph      20   0 5895612 4.600g  29288 S  28.6  7.3   1013:52 ceph-osd
2245 ceph      20   0 5918688 4.546g  29588 S  25.9  7.2 971:24.57 ceph-osd
2340 ceph      20   0 5804500 4.422g  29396 S  24.9  7.0 871:51.41 ceph-osd
2856 ceph      20   0 5796268 4.384g  29452 S  20.2  7.0 794:34.80 ceph-osd
2161 ceph      20   0 5820824 4.346g  29332 S  22.7  6.9 825:44.07 ceph-osd
3361 ceph      20   0 5939340 4.295g  29416 S  24.3  6.8 895:14.77 ceph-osd
1957 root      20   0 4626188 3.839g   6344 S   2.7  6.1 374:18.16 glusterfs

top - 15:07:59 up 9 days, 16:06,  1 user,  load average: 1.32, 1.46, 1.58
Tasks: 412 total,   2 running, 410 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.6 us,  2.0 sy,  0.0 ni, 93.6 id,  0.4 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem : 65826440 total,  8921292 free, 53994352 used,  2910796 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 10775720 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
2161 ceph      20   0 5820824 4.508g  29332 S  17.6  7.2 871:03.41 ceph-osd
2504 ceph      20   0 5961148 4.470g  29288 S  21.5  7.1   1068:31 ceph-osd
2340 ceph      20   0 5870036 4.451g  29396 S  19.8  7.1 920:06.44 ceph-osd
2856 ceph      20   0 5960108 4.450g  29452 S  16.0  7.1 835:20.51 ceph-osd
2245 ceph      20   0 5951456 4.386g  29588 S  19.8  7.0   1022:26 ceph-osd
3361 ceph      20   0 5939340 4.367g  29416 S  18.2  7.0 942:55.13 ceph-osd
1957 root      20   0 4626188 3.907g   6344 S   3.0  6.2 381:07.24 glusterfs

 

Also the glusterfs uses more and more memory.

When is the maximum use per osd reached?

When is the maximum for glusterfs reached?

We are still using the default cache size of 3GB per osd.

If we increase the memory to 80Gig RAM is this then definitly enough for our configuration?

Thx for your help, Carsten

For gluster do a

umount /opt/petasan/config/shared

this will clear gluster memory, it is safe to unmount do as we automatically remount it, the gluster share is used for stats.

If the ceph daemons do not stabilize,  do a

ceph daemon osd.X dump_mempools --cluster XX

to see where the memory is taken by the OSD, typically the OSD will take more than the cache assigned to it which you can lower as per prior post ( 3G for ssd). If you use compression or ec pools or the OSD is doing backfills, it could overshoot this by a large amount.

Thx a lot.

It seem´s to be a problem with glusterfs on only one node. On all other nodes the glusterfs daemon uses ~500MB. On the failing host glusterfs has used at least 5Gig. I´ve unmounted the volume on the problematic host. I will monitor it and will send you an update the next days.

The osd daemons using 4,5 Gig RAM up to 4,9 Gig RAM. It looks like that there is no increase any more.

Question: Does it make sense to increase the physical RAM in case of a failure, that there are enough ressources not get trouble because of less memory?    And what is your recommendation 128Gig maybe?

Thx a lot, Carsten