Forums - PetaSAN

ForumGeneral DiscussionNeed help finding source of laten …
You need to log in to create posts and topics. Login · Register
Need help finding source of latency

rvalkenburg
8 Posts

November 4, 2022, 5:18 pm
Quote from rvalkenburg on November 4, 2022, 5:18 pm
Hello,

I am looking for help trying to track down the cause of some disk latency.

Our current setup:
PetaSAN 2.8
10 Node, 24 SSDs per node
40 cores per node
128GB of RAM per node
Bonded 10Gbit using LACP

Ceph and iSCSI traffic using bonded interfaces

iSCSI disks are presented to VMware ESXI 6.7 U3 Hosts

We get random bursts of a few seconds of latency. Any help would be appreciated.

Hello,

I am looking for help trying to track down the cause of some disk latency.

Our current setup:
PetaSAN 2.8
10 Node, 24 SSDs per node
40 cores per node
128GB of RAM per node
Bonded 10Gbit using LACP

Ceph and iSCSI traffic using bonded interfaces

iSCSI disks are presented to VMware ESXI 6.7 U3 Hosts

We get random bursts of a few seconds of latency. Any help would be appreciated.

#1

admin
2,961 Posts

November 4, 2022, 8:33 pm
Quote from admin on November 4, 2022, 8:33 pm
1 What model of SSDs do you use ?

2 What controller do you use, do you have any RAID setup ?

3 Does this happen for all OSDs or just some of them ?

4 Does this happen during specific workloads like backups or it is all the time or is periodic ?

5 Are you using workloads other that iSCSI ?

6 Are you using any PetaSAN write cache ?

7 Can you disable volatile write cache on drives using hdparm or nvme commands

8 Can you temporarily switch off scrub and deep-scrub from maintenance tab and see if it affect this.

9 Run

iostat -dxt 2 > iostat.log

and see if you get high w_await and if so, look at the %util, wareq-sz, wkB/s

10 Do you see any errors or latency warnings in the osd logs in /var/log/ceph ?

11 Other useful info can be obtained from
ceph daemon osd.X perf dump (look at latency values)
ceph daemon osd.X dump_historic_ops (within 10 min of high value in chart)

1 What model of SSDs do you use ?

2 What controller do you use, do you have any RAID setup ?

3 Does this happen for all OSDs or just some of them ?

4 Does this happen during specific workloads like backups or it is all the time or is periodic ?

5 Are you using workloads other that iSCSI ?

6 Are you using any PetaSAN write cache ?

7 Can you disable volatile write cache on drives using hdparm or nvme commands

8 Can you temporarily switch off scrub and deep-scrub from maintenance tab and see if it affect this.

9 Run

iostat -dxt 2 > iostat.log

and see if you get high w_await and if so, look at the %util, wareq-sz, wkB/s

10 Do you see any errors or latency warnings in the osd logs in /var/log/ceph ?

11 Other useful info can be obtained from
ceph daemon osd.X perf dump (look at latency values)
ceph daemon osd.X dump_historic_ops (within 10 min of high value in chart)

#2

rvalkenburg
8 Posts

November 4, 2022, 9:09 pm
Quote from rvalkenburg on November 4, 2022, 9:09 pm

Crucial MX500 1TB SSD

AOC-S3008-L8i HBA (No Raid)

It seems random

Well we do hourly backups on most of our VMs (Total of 50 sitting on storage cluster)

No only iSCSI

No write cache

Perhaps?

Yes I can try that

Highest W_await was 12.77 (most were between 0 and 5) , Highest was at %util was 51.8 (most were between 1 and 7), Highest wareq-sz was 77 (most were between 0 and 7, however I did see a few 10,13,25), Highest wkB/s was 3845 (most were between

Yes some OSDs but not all (/var/lib/ceph/osd/ceph-20) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.742069367s,)

11. Here is the perf dump from the same osd-20

"recoverystate_perf": {
"initial_latency": {
"avgcount": 187,
"sum": 0.038535012,
"avgtime": 0.000206069
},
"started_latency": {
"avgcount": 509,
"sum": 10300276.394202555,
"avgtime": 20236.299399219
},
"reset_latency": {
"avgcount": 696,
"sum": 0.091873184,
"avgtime": 0.000132001
},
"start_latency": {
"avgcount": 696,
"sum": 0.038111891,
"avgtime": 0.000054758
},
"primary_latency": {
"avgcount": 43,
"sum": 606856.540345384,
"avgtime": 14112.942798729
},
"peering_latency": {
"avgcount": 73,
"sum": 70.994576213,
"avgtime": 0.972528441
},
"backfilling_latency": {
"avgcount": 14,
"sum": 3777.554051844,
"avgtime": 269.825289417
},
"waitremotebackfillreserved_latency": {
"avgcount": 14,
"sum": 667.901702636,
"avgtime": 47.707264474
},
"waitlocalbackfillreserved_latency": {
"avgcount": 16,
"sum": 303.140102551,
"avgtime": 18.946256409
},
"notbackfilling_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"repnotrecovering_latency": {
"avgcount": 515,
"sum": 8863758.466705283,
"avgtime": 17211.181488748
},
"repwaitrecoveryreserved_latency": {
"avgcount": 1,
"sum": 0.000065646,
"avgtime": 0.000065646
},
"repwaitbackfillreserved_latency": {
"avgcount": 206,
"sum": 541972.319619940,
"avgtime": 2630.933590388
},
"reprecovering_latency": {
"avgcount": 145,
"sum": 48254.181497900,
"avgtime": 332.787458606
},
"activating_latency": {
"avgcount": 60,
"sum": 8.413385900,
"avgtime": 0.140223098
},
"waitlocalrecoveryreserved_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"waitremoterecoveryreserved_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"recovering_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"recovered_latency": {
"avgcount": 52,
"sum": 0.002363470,
"avgtime": 0.000045451
},
"clean_latency": {
"avgcount": 22,
"sum": 602058.899521991,
"avgtime": 27366.313614635
},
"active_latency": {
"avgcount": 30,
"sum": 606812.772259577,
"avgtime": 20227.092408652
},
"replicaactive_latency": {
"avgcount": 266,
"sum": 9453984.952967389,
"avgtime": 35541.296815666
},
"stray_latency": {
"avgcount": 623,
"sum": 239636.757881106,
"avgtime": 384.649691622
},
"getinfo_latency": {
"avgcount": 73,
"sum": 0.889038771,
"avgtime": 0.012178613
},
"getlog_latency": {
"avgcount": 73,
"sum": 0.167979837,
"avgtime": 0.002301093
},
"waitactingchange_latency": {
"avgcount": 5,
"sum": 4.828695894,
"avgtime": 0.965739178
},
"incomplete_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"down_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"getmissing_latency": {
"avgcount": 68,
"sum": 0.002043543,
"avgtime": 0.000030052
},
"waitupthru_latency": {
"avgcount": 60,
"sum": 69.932839343,
"avgtime": 1.165547322
},
"notrecovering_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000

11b. My graph gets cut off so it's hard to pinpoint what OSD is showing up.

Crucial MX500 1TB SSD

AOC-S3008-L8i HBA (No Raid)

It seems random

Well we do hourly backups on most of our VMs (Total of 50 sitting on storage cluster)

No only iSCSI

No write cache

Perhaps?

Yes I can try that

Highest W_await was 12.77 (most were between 0 and 5) , Highest was at %util was 51.8 (most were between 1 and 7), Highest wareq-sz was 77 (most were between 0 and 7, however I did see a few 10,13,25), Highest wkB/s was 3845 (most were between

Yes some OSDs but not all (/var/lib/ceph/osd/ceph-20) log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.742069367s,)

11. Here is the perf dump from the same osd-20

"recoverystate_perf": {
"initial_latency": {
"avgcount": 187,
"sum": 0.038535012,
"avgtime": 0.000206069
},
"started_latency": {
"avgcount": 509,
"sum": 10300276.394202555,
"avgtime": 20236.299399219
},
"reset_latency": {
"avgcount": 696,
"sum": 0.091873184,
"avgtime": 0.000132001
},
"start_latency": {
"avgcount": 696,
"sum": 0.038111891,
"avgtime": 0.000054758
},
"primary_latency": {
"avgcount": 43,
"sum": 606856.540345384,
"avgtime": 14112.942798729
},
"peering_latency": {
"avgcount": 73,
"sum": 70.994576213,
"avgtime": 0.972528441
},
"backfilling_latency": {
"avgcount": 14,
"sum": 3777.554051844,
"avgtime": 269.825289417
},
"waitremotebackfillreserved_latency": {
"avgcount": 14,
"sum": 667.901702636,
"avgtime": 47.707264474
},
"waitlocalbackfillreserved_latency": {
"avgcount": 16,
"sum": 303.140102551,
"avgtime": 18.946256409
},
"notbackfilling_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"repnotrecovering_latency": {
"avgcount": 515,
"sum": 8863758.466705283,
"avgtime": 17211.181488748
},
"repwaitrecoveryreserved_latency": {
"avgcount": 1,
"sum": 0.000065646,
"avgtime": 0.000065646
},
"repwaitbackfillreserved_latency": {
"avgcount": 206,
"sum": 541972.319619940,
"avgtime": 2630.933590388
},
"reprecovering_latency": {
"avgcount": 145,
"sum": 48254.181497900,
"avgtime": 332.787458606
},
"activating_latency": {
"avgcount": 60,
"sum": 8.413385900,
"avgtime": 0.140223098
},
"waitlocalrecoveryreserved_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"waitremoterecoveryreserved_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"recovering_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"recovered_latency": {
"avgcount": 52,
"sum": 0.002363470,
"avgtime": 0.000045451
},
"clean_latency": {
"avgcount": 22,
"sum": 602058.899521991,
"avgtime": 27366.313614635
},
"active_latency": {
"avgcount": 30,
"sum": 606812.772259577,
"avgtime": 20227.092408652
},
"replicaactive_latency": {
"avgcount": 266,
"sum": 9453984.952967389,
"avgtime": 35541.296815666
},
"stray_latency": {
"avgcount": 623,
"sum": 239636.757881106,
"avgtime": 384.649691622
},
"getinfo_latency": {
"avgcount": 73,
"sum": 0.889038771,
"avgtime": 0.012178613
},
"getlog_latency": {
"avgcount": 73,
"sum": 0.167979837,
"avgtime": 0.002301093
},
"waitactingchange_latency": {
"avgcount": 5,
"sum": 4.828695894,
"avgtime": 0.965739178
},
"incomplete_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"down_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"getmissing_latency": {
"avgcount": 68,
"sum": 0.002043543,
"avgtime": 0.000030052
},
"waitupthru_latency": {
"avgcount": 60,
"sum": 69.932839343,
"avgtime": 1.165547322
},
"notrecovering_latency": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000

11b. My graph gets cut off so it's hard to pinpoint what OSD is showing up.

#3

admin
2,961 Posts

November 4, 2022, 9:59 pm
Quote from admin on November 4, 2022, 9:59 pm
1)
MX500 is not highly rated with Ceph, it is a consumer model SSD.

7) you could use

#! /bin/bash

for DEVICE_PATH in $(find /sys/block/* | grep -E '\/(sd)' )
do
DEVICE=${DEVICE_PATH##/}
/sbin/hdparm -W 0 /dev/$DEVICE > /dev/null 2>&1
done

for DEVICE_PATH in $(find /sys/block/ | grep -E '\/(nvme)' )
do
DEVICE=${DEVICE_PATH##/}
nvme set-feature /dev/$DEVICE -f 0x6 -v 0 > /dev/null 2>&1
done

10)
log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.742069367s

This is not good, can you try the following and see if it helps:

ceph config set osd. bluefs_buffered_io true

also do you see an kernel dmesg messages relating to the drives at the time of latency spikes ?
In general for VMWare backups, reommended:
use thick provisioned/eager zero vmdk disks
set MaxIoSize param to 512k as per our guide

1)
MX500 is not highly rated with Ceph, it is a consumer model SSD.

7) you could use

#! /bin/bash

for DEVICE_PATH in $(find /sys/block/* | grep -E '\/(sd)' )
do
DEVICE=${DEVICE_PATH##/}
/sbin/hdparm -W 0 /dev/$DEVICE > /dev/null 2>&1
done

for DEVICE_PATH in $(find /sys/block/ | grep -E '\/(nvme)' )
do
DEVICE=${DEVICE_PATH##/}
nvme set-feature /dev/$DEVICE -f 0x6 -v 0 > /dev/null 2>&1
done

10)
log_latency_fn slow operation observed for _txc_committed_kv, latency = 5.742069367s

This is not good, can you try the following and see if it helps:

ceph config set osd. bluefs_buffered_io true

also do you see an kernel dmesg messages relating to the drives at the time of latency spikes ?
In general for VMWare backups, reommended:
use thick provisioned/eager zero vmdk disks
set MaxIoSize param to 512k as per our guide

#4

rvalkenburg
8 Posts

November 7, 2022, 1:59 pm
Quote from rvalkenburg on November 7, 2022, 1:59 pm
MX500 is not highly rated with Ceph, it is a consumer model SSD.

I'm getting that impression now yes. Would a Samsung Evo 870 Pro work better?

#! /bin/bash

for DEVICE_PATH in $(find /sys/block/* | grep -E '\/(sd)' )
do
DEVICE=${DEVICE_PATH##/}
/sbin/hdparm -W 0 /dev/$DEVICE > /dev/null 2>&1
done

for DEVICE_PATH in $(find /sys/block/ | grep -E '\/(nvme)' )
do
DEVICE=${DEVICE_PATH##/}
nvme set-feature /dev/$DEVICE -f 0x6 -v 0 > /dev/null 2>&1
done

I ran this on all the nodes, I'll report back if disabling the drive cache helped/fixed the issue.

ceph config set osd. bluefs_buffered_io true

bluefs_buffered_io true is already set in my ceph config by default it seems.

I did follow the guide and configured the iSCSI software interfaces per spec.

Thank you for all the help so far. I really appreciate it.

MX500 is not highly rated with Ceph, it is a consumer model SSD.

I'm getting that impression now yes. Would a Samsung Evo 870 Pro work better?

#! /bin/bash

for DEVICE_PATH in $(find /sys/block/* | grep -E '\/(sd)' )
do
DEVICE=${DEVICE_PATH##/}
/sbin/hdparm -W 0 /dev/$DEVICE > /dev/null 2>&1
done

for DEVICE_PATH in $(find /sys/block/ | grep -E '\/(nvme)' )
do
DEVICE=${DEVICE_PATH##/}
nvme set-feature /dev/$DEVICE -f 0x6 -v 0 > /dev/null 2>&1
done

I ran this on all the nodes, I'll report back if disabling the drive cache helped/fixed the issue.

ceph config set osd. bluefs_buffered_io true

bluefs_buffered_io true is already set in my ceph config by default it seems.

I did follow the guide and configured the iSCSI software interfaces per spec.

Thank you for all the help so far. I really appreciate it.

#5

admin
2,961 Posts

November 7, 2022, 9:24 pm
Quote from admin on November 7, 2022, 9:24 pm
No Evo 870 Pro will not be better, probably worse than MX500. There are many good drives recommended for Ceph, but they are not consumer grade drives. You can search yourself for recommended drives or i can recommend if you wish.

I recommend you try the different points in the prev posts and see if any has an effect.

Can you also check your memory % util and make sure you have no memory issues

No Evo 870 Pro will not be better, probably worse than MX500. There are many good drives recommended for Ceph, but they are not consumer grade drives. You can search yourself for recommended drives or i can recommend if you wish.

I recommend you try the different points in the prev posts and see if any has an effect.

Can you also check your memory % util and make sure you have no memory issues

#6

rvalkenburg
8 Posts

November 8, 2022, 1:59 pm
Quote from rvalkenburg on November 8, 2022, 1:59 pm
No consumer drives got it. So basically enterprise-grade or data-center-grade drives only, got it.

Right so about the previous tasks, you asked me to check:

Ram utilization is 79% for all nodes.

The script to turn off the write cache on the drives did help quite a bit. The latency has reduced significantly

Changing the bluefs_buffered_io true had no effect, I think it was already enabled by default.

No consumer drives got it. So basically enterprise-grade or data-center-grade drives only, got it.

Right so about the previous tasks, you asked me to check:

Ram utilization is 79% for all nodes.

The script to turn off the write cache on the drives did help quite a bit. The latency has reduced significantly

Changing the bluefs_buffered_io true had no effect, I think it was already enabled by default.

#7

admin
2,961 Posts

November 8, 2022, 8:25 pm
Quote from admin on November 8, 2022, 8:25 pm
Good things are better. I would also try the scrub disable as pointed above just to rule out issue relating to scrub.

Good things are better. I would also try the scrub disable as pointed above just to rule out issue relating to scrub.

Last edited on November 8, 2022, 8:25 pm by admin · #8

Post Reply: Need help finding source of latency

Cancel