Forums - PetaSAN

ForumGeneral Discussionmds failing to respond to cache p …
You need to log in to create posts and topics. Login · Register
mds failing to respond to cache pressure

lberrezoug
6 Posts

December 30, 2024, 3:13 pm
Quote from lberrezoug on December 30, 2024, 3:13 pm
Hi everyone,
This is my first post on the forum, so first of all, I want to thank you for all the work you do. I am able to fully utilize the performance of iSCSI. The difference is striking compared to a pure Ceph deployment.

I am reaching out for your help regarding recurring MDS alerts of the type "failing to respond to cache pressure", always associated with the same ID. These alerts are related to Veeam and, more specifically, a backup set from an email server using an NFS export as the storage destination. I have other NFS exports on the cluster, but only this one generates alerts several minutes after the completion of the backup set in question.

I have tried reducing the `mds_recall_max_caps` parameter to 20000, as well as increasing `mds_cache_memory_limit` to 17 GB, but this has not helped. I am forced to run the following command to clear the alert:
`ceph tell mds.nodexx client evict id=9489331`.

I would like to clarify that this is a new deployment and that the current version is PetaSAN 3.3.0. I have 4 identical nodes in terms of hardware, and on the network side, it's SFP 10 Gb. In any case, I remain at your disposal to provide further information if needed.

Hi everyone,
This is my first post on the forum, so first of all, I want to thank you for all the work you do. I am able to fully utilize the performance of iSCSI. The difference is striking compared to a pure Ceph deployment.

I am reaching out for your help regarding recurring MDS alerts of the type "failing to respond to cache pressure", always associated with the same ID. These alerts are related to Veeam and, more specifically, a backup set from an email server using an NFS export as the storage destination. I have other NFS exports on the cluster, but only this one generates alerts several minutes after the completion of the backup set in question.

I have tried reducing the `mds_recall_max_caps` parameter to 20000, as well as increasing `mds_cache_memory_limit` to 17 GB, but this has not helped. I am forced to run the following command to clear the alert:
`ceph tell mds.nodexx client evict id=9489331`.

I would like to clarify that this is a new deployment and that the current version is PetaSAN 3.3.0. I have 4 identical nodes in terms of hardware, and on the network side, it's SFP 10 Gb. In any case, I remain at your disposal to provide further information if needed.

#1

f.cuseo
76 Posts

December 31, 2024, 8:21 am
Quote from f.cuseo on December 31, 2024, 8:21 am
I have the same problem (10 nodes cluster, most use is s3 object storage, but I also use it with NFS for 4 clients to make backups. 2 nodes are nfs servers, and I have this problem that I can't resolve with any configuration; so every day I need to migrate NFS servers from a server to another one to fix the error.

I have the same problem (10 nodes cluster, most use is s3 object storage, but I also use it with NFS for 4 clients to make backups. 2 nodes are nfs servers, and I have this problem that I can't resolve with any configuration; so every day I need to migrate NFS servers from a server to another one to fix the error.

#2

lberrezoug
6 Posts

December 31, 2024, 2:39 pm
Quote from lberrezoug on December 31, 2024, 2:39 pm
Hi f.cuseo,

Thanks for your feedback. I think our two methods are still a workaround only to remove the alert and in no way resolves the root cause. It would be appreciated if someone from petasan could tell us the procedure for permanently remediating this.

Hi f.cuseo,

Thanks for your feedback. I think our two methods are still a workaround only to remove the alert and in no way resolves the root cause. It would be appreciated if someone from petasan could tell us the procedure for permanently remediating this.

#3

admin
2,957 Posts

January 2, 2025, 7:45 pm
Quote from admin on January 2, 2025, 7:45 pm
I want to thank you for all the work you do. I am able to fully utilize the performance of iSCSI. The difference is striking compared to a pure Ceph deployment.

Nice to hear, thank you 🙂

Can you give more detail on the NFS clients, which one work, which give you the warning.

I want to thank you for all the work you do. I am able to fully utilize the performance of iSCSI. The difference is striking compared to a pure Ceph deployment.

Nice to hear, thank you 🙂

Can you give more detail on the NFS clients, which one work, which give you the warning.

#4

dszaraz
1 Post

January 9, 2025, 4:59 pm
Quote from dszaraz on January 9, 2025, 4:59 pm
Hi everyone

I would like to thank too for the hard work what has beed done so far by the petasan team.

I came across with the same issue mentioned in this topic and spent countless hours of troubleshooting and getting my head around if its client, config, bug related but no luck. Just to clarify Im on the lates available version 3.3.0 and purely Im using it for NFS and k8s. Have total of 6 NFS client connected 4 on one IP set and 2 on another IP set. Strangely enough I only get the alerts always for the exact same NFS IP and its 4 clients the other 2 are unaffected with the cache pressure warning. Plenty of space left, no I/O issue. Whatever custom ceph config parameters I add for e.g. mds_cache_memory_limit and increasing the limits but cant resolve this with any configuration. Im basically stuck in this loop of workaround either moving the affected NFS ip to a different node or running the ceph tell mds.nodexx client evict id=xxxxxxx command on daily bases to clear the warning message.

Would appreciate if someone from the petasan tech team could share some info or guidance in case they came across similar issue.

Hi everyone

I would like to thank too for the hard work what has beed done so far by the petasan team.

I came across with the same issue mentioned in this topic and spent countless hours of troubleshooting and getting my head around if its client, config, bug related but no luck. Just to clarify Im on the lates available version 3.3.0 and purely Im using it for NFS and k8s. Have total of 6 NFS client connected 4 on one IP set and 2 on another IP set. Strangely enough I only get the alerts always for the exact same NFS IP and its 4 clients the other 2 are unaffected with the cache pressure warning. Plenty of space left, no I/O issue. Whatever custom ceph config parameters I add for e.g. mds_cache_memory_limit and increasing the limits but cant resolve this with any configuration. Im basically stuck in this loop of workaround either moving the affected NFS ip to a different node or running the ceph tell mds.nodexx client evict id=xxxxxxx command on daily bases to clear the warning message.

Would appreciate if someone from the petasan tech team could share some info or guidance in case they came across similar issue.

#5

lberrezoug
6 Posts

January 9, 2025, 5:38 pm
Quote from lberrezoug on January 9, 2025, 5:38 pm

Quote from admin on January 2, 2025, 7:45 pm

I want to thank you for all the work you do. I am able to fully utilize the performance of iSCSI. The difference is striking compared to a pure Ceph deployment.

Nice to hear, thank you 🙂

Can you give more detail on the NFS clients, which one work, which give you the warning.

Hi,

I apologize for my late reply. Since I wasn't sure exactly what information to provide to clarify our situation, I took some screenshots of the NFS service configuration and I hope they will help. If not, please feel free to tell me exactly what information you need.

In summary, I have four NFS exports, and all of them are used by Veeam as backup repositories. The only difference is that the export vbr-003 is used by Veeam to back up files from a mail server, while the other exports are used for VM backups at the hypervisor level, if I can express it that way.

The issue I have is with the vbr-003 export. The backup itself is not a problem and completes successfully. However, after a few minutes, a warning appears on the PetaSAN dashboard or through the `ceph-s` command, saying "mds failing to respond to cache pressure". In my case, it's always the same client ID, "9489331". When I run the command `ceph tell mds.orn-node-stor1 client evict id=9489331`, the vbr-003 repo on the Veeam side becomes temporarily inaccessible, which I imagine is normal since I interrupted the session. Please find the link to the screenshots below.

https://we.tl/t-oO11mF6uMz

root@node-stor1:~# ceph -s
cluster:
id: 695f1f6e-97fb-488c-b895-eb2665af99c9
health: HEALTH_WARN
1 clients failing to respond to cache pressure

services:
mon: 3 daemons, quorum node-stor3,node-stor1,node-stor2 (age 10d)
mgr: node-stor1(active, since 4w), standbys: node-stor2, node-stor3
mds: 1/1 daemons up, 2 standby
osd: 24 osds: 24 up (since 19h), 24 in (since 4w)
rgw: 2 daemons active (2 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 10 pools, 193 pgs
objects: 3.20M objects, 12 TiB
usage: 38 TiB used, 313 TiB / 351 TiB avail
pgs: 192 active+clean
1 active+clean+scrubbing

io:
client: 11 MiB/s rd, 1.6 MiB/s wr, 144 op/s rd, 67 op/s wr

root@node-stor1:~# ceph health detail
HEALTH_WARN 1 clients failing to respond to cache pressure
[WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
mds.orn-node-stor1(mds.0): Client NFS-XXX-XXX-XXX-103 failing to respond to cache pressure client_id: 9489331

root@node-stor1:~# ceph fs status
cephfs - 62 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active node-stor1 Reqs: 0 /s 34.0k 33.5k 2112 33.7k
POOL TYPE USED AVAIL
cephfs_metadata metadata 667M 85.0T
cephfs_data data 10.6T 85.0T
STANDBY MDS
node-stor2
node-stor3
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

Quote from admin on January 2, 2025, 7:45 pm

I want to thank you for all the work you do. I am able to fully utilize the performance of iSCSI. The difference is striking compared to a pure Ceph deployment.

Nice to hear, thank you 🙂

Can you give more detail on the NFS clients, which one work, which give you the warning.

Hi,

I apologize for my late reply. Since I wasn't sure exactly what information to provide to clarify our situation, I took some screenshots of the NFS service configuration and I hope they will help. If not, please feel free to tell me exactly what information you need.

In summary, I have four NFS exports, and all of them are used by Veeam as backup repositories. The only difference is that the export vbr-003 is used by Veeam to back up files from a mail server, while the other exports are used for VM backups at the hypervisor level, if I can express it that way.

The issue I have is with the vbr-003 export. The backup itself is not a problem and completes successfully. However, after a few minutes, a warning appears on the PetaSAN dashboard or through the `ceph-s` command, saying "mds failing to respond to cache pressure". In my case, it's always the same client ID, "9489331". When I run the command `ceph tell mds.orn-node-stor1 client evict id=9489331`, the vbr-003 repo on the Veeam side becomes temporarily inaccessible, which I imagine is normal since I interrupted the session. Please find the link to the screenshots below.

https://we.tl/t-oO11mF6uMz

root@node-stor1:~# ceph -s
cluster:
id: 695f1f6e-97fb-488c-b895-eb2665af99c9
health: HEALTH_WARN
1 clients failing to respond to cache pressure

services:
mon: 3 daemons, quorum node-stor3,node-stor1,node-stor2 (age 10d)
mgr: node-stor1(active, since 4w), standbys: node-stor2, node-stor3
mds: 1/1 daemons up, 2 standby
osd: 24 osds: 24 up (since 19h), 24 in (since 4w)
rgw: 2 daemons active (2 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 10 pools, 193 pgs
objects: 3.20M objects, 12 TiB
usage: 38 TiB used, 313 TiB / 351 TiB avail
pgs: 192 active+clean
1 active+clean+scrubbing

io:
client: 11 MiB/s rd, 1.6 MiB/s wr, 144 op/s rd, 67 op/s wr

root@node-stor1:~# ceph health detail
HEALTH_WARN 1 clients failing to respond to cache pressure
[WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
mds.orn-node-stor1(mds.0): Client NFS-XXX-XXX-XXX-103 failing to respond to cache pressure client_id: 9489331

root@node-stor1:~# ceph fs status
cephfs - 62 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active node-stor1 Reqs: 0 /s 34.0k 33.5k 2112 33.7k
POOL TYPE USED AVAIL
cephfs_metadata metadata 667M 85.0T
cephfs_data data 10.6T 85.0T
STANDBY MDS
node-stor2
node-stor3
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

Last edited on January 9, 2025, 8:24 pm by lberrezoug · #6

lberrezoug
6 Posts

January 22, 2025, 11:29 am
Quote from lberrezoug on January 22, 2025, 11:29 am
Hi,

I know you have a lot on your plate, but is there any chance of getting help?

If we subscribe to the support service, will this type of intervention be taken care of, or does the service in question only concern PetaSAN components and not ceph?

Best regards,

Hi,

I know you have a lot on your plate, but is there any chance of getting help?

If we subscribe to the support service, will this type of intervention be taken care of, or does the service in question only concern PetaSAN components and not ceph?

Best regards,

Last edited on January 22, 2025, 11:30 am by lberrezoug · #7

Post Reply: mds failing to respond to cache pressure

Cancel