Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Datastore often is disconnected from VMware

Pages: 1 2

Hello,

Our datastores worked properly for a long time.

Recently, one of our CEPH pool is getting disconnected randomly.

dmesg:

[Mon Jun 27 04:29:08 2022] Unable to locate Target Portal Group on iqn.2016-05.com.petahdd:00005
[Mon Jun 27 04:29:08 2022] iSCSI Login negotiation failed.
[Mon Jun 27 04:29:34 2022] Unable to locate Target Portal Group on iqn.2016-05.com.petahdd:00006
[Mon Jun 27 04:29:34 2022] iSCSI Login negotiation failed.
[Mon Jun 27 04:29:35 2022] Alternate GPT is invalid, using primary GPT.
[Mon Jun 27 04:29:35 2022] rbd0: p1 p2
[Mon Jun 27 04:29:35 2022] rbd: rbd0: capacity 109951162777600 features 0x1
[Mon Jun 27 04:30:02 2022] Alternate GPT is invalid, using primary GPT.
[Mon Jun 27 04:30:02 2022] rbd1: p1 p2
[Mon Jun 27 04:30:02 2022] rbd: rbd1: capacity 54975581388800 features 0x1
[Mon Jun 27 04:30:14 2022] Unable to locate Target Portal Group on iqn.2016-05.com.petahdd:00005
[Mon Jun 27 04:30:14 2022] iSCSI Login negotiation failed.
[Mon Jun 27 04:31:03 2022] Alternate GPT is invalid, using primary GPT.
[Mon Jun 27 04:31:03 2022] rbd0: p1 p2
[Mon Jun 27 04:31:03 2022] rbd: rbd0: capacity 109951162777600 features 0x1
[Mon Jun 27 10:54:32 2022] Unable to locate Target Portal Group on iqn.2016-05.com.petahdd:00005
[Mon Jun 27 10:54:32 2022] iSCSI Login negotiation failed.
[Mon Jun 27 10:54:42 2022] Unable to locate Target Portal Group on iqn.2016-05.com.petahdd:00006
[Mon Jun 27 10:54:42 2022] iSCSI Login negotiation failed.
[Mon Jun 27 12:45:55 2022] libceph: mon1 (1)10.74.1.212:6789 session established
[Mon Jun 27 12:45:55 2022] libceph: mon1 (1)10.74.1.212:6789 socket closed (con state OPEN)
[Mon Jun 27 12:45:55 2022] libceph: mon1 (1)10.74.1.212:6789 session lost, hunting for new mon
[Mon Jun 27 12:45:55 2022] libceph: mon0 (1)10.74.1.211:6789 session established
[Mon Jun 27 12:45:55 2022] libceph: client62284382 fsid 5e2d37a0-a9ca-4ebf-814d-892d1da42032
[Mon Jun 27 12:45:55 2022] Alternate GPT is invalid, using primary GPT.
[Mon Jun 27 12:45:55 2022] rbd0: p1
[Mon Jun 27 12:45:55 2022] rbd: rbd0: capacity 32985348833280 features 0x1
[Mon Jun 27 13:08:18 2022] ABORT_TASK: Found referenced iSCSI task_tag: 9604439
[Mon Jun 27 13:08:18 2022] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 9604439
[Mon Jun 27 13:41:38 2022] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1998-01.com.vmware:host62-602ede00,i,0x00023d000003,iqn.2016-05.com.petahdd:00001,t,0x03
[Mon Jun 27 13:41:44 2022] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1998-01.com.vmware:host62-602ede00,i,0x00023d000004,iqn.2016-05.com.petahdd:00001,t,0x04

At OSD logs, we found some latency problems:

/var/log/ceph/ceph-osd.50.log.2.gz:2022-02-04T01:05:17.462+0100 7fd82c3f4700 0 bluestore(/var/lib/ceph/osd/ceph-50) log_latency slow operation observed for kv_final, latency = 7.279872027s
/var/log/ceph/ceph-osd.50.log.2.gz:2022-02-04T01:05:17.462+0100 7fd82c3f4700 0 bluestore(/var/lib/ceph/osd/ceph-50) log_latency_fn slow operation observed for _txc_committed_kv, latency = 7.262747658s, txc = 0x563061cb0900
/var/log/ceph/ceph-osd.51.log.2.gz:2022-02-04T01:05:57.083+0100 7ff9b249d700 0 bluestore(/var/lib/ceph/osd/ceph-51) log_latency slow operation observed for kv_final, latency = 6.875388465s
/var/log/ceph/ceph-osd.51.log.2.gz:2022-02-04T01:05:57.083+0100 7ff9b249d700 0 bluestore(/var/lib/ceph/osd/ceph-51) log_latency_fn slow operation observed for _txc_committed_kv, latency = 6.857830080s, txc = 0x55750704f200
/var/log/ceph/ceph-osd.53.log.1.gz:2022-06-14T23:01:01.034+0200 7f5cda7cf700 0 bluestore(/var/lib/ceph/osd/ceph-53) log_latency_fn slow operation observed for _do_read, latency = 5.751492967s, num_ios = -3
/var/log/ceph/ceph-osd.53.log.1.gz:2022-06-14T23:01:01.034+0200 7f5cda7cf700 0 bluestore(/var/lib/ceph/osd/ceph-53) log_latency slow operation observed for read, latency = 5.751601209s
/var/log/ceph/ceph-osd.53.log.1.gz:2022-06-14T23:01:01.090+0200 7f5cd97cd700 0 bluestore(/var/lib/ceph/osd/ceph-53) log_latency_fn slow operation observed for _do_read, latency = 5.855374173s, num_ios = -4
/var/log/ceph/ceph-osd.53.log.1.gz:2022-06-14T23:01:01.090+0200 7f5cd97cd700 0 bluestore(/var/lib/ceph/osd/ceph-53) log_latency slow operation observed for read, latency = 5.855564817s

VMware events shows problems with connectivity:

Access to volume XX was lost due to connectivity problems. Attempt for recovery ongoing. The result will be informed shortly.

We can discard switching devides beacuse they are working fine with others pools and datastores.

VMware has only problems with that pool, so we think is not a Hypervisor issue.

What can we look for more info to be able to solve this? Any idea?

Thanks,

 

can you describe the hardware setup: cpu, ram, num of OSDs, type of OSDs

can you describe how often/when you see this issue, all the time ?

my first suspect is load, can you look at charts for % disk util, % ram, % cpu

is cluster health ok ? is PG status and OSD status charts ok ?

on the node showing dmesg error,  list LIO conifg using

targetcli-fb ls

does it have  iqn.2016-05.com.petahdd:00006 configured ?

We have 5x physical servers with 12x mixed 10TB/16TB HDD each.

It has been failing since 2 months ago. Firstly once a week, but last 2 weeks it fails everyday at least one of VMware host.

Ceph health isn't OK due to "6 pgs not deep-scrubbed in time".

CPU

RAM

o- iscsi ............................................................................................. [Targets: 2]
| o- iqn.2016-05.com.petahdd:00001 ...................................................................... [TPGs: 6]
| | o- tpg1 ............................................................................................ [disabled]
| | | o- acls ........................................................................................... [ACLs: 0]
| | | o- luns ........................................................................................... [LUNs: 1]
| | | | o- lun0 ................................... [rbd/image-00001 (/dev/rbd/rbd/image-00001) (default_tg_pt_gp)]
| | | o- portals ..................................................................................... [Portals: 1]
| | | o- 192.168.74.211:3260 ............................................................................... [OK]
| | o- tpg2 ............................................................................................ [disabled]
| | | o- acls ........................................................................................... [ACLs: 0]
| | | o- luns ........................................................................................... [LUNs: 1]
| | | | o- lun0 ................................... [rbd/image-00001 (/dev/rbd/rbd/image-00001) (default_tg_pt_gp)]
| | | o- portals ..................................................................................... [Portals: 1]
| | | o- 192.168.75.211:3260 ............................................................................... [OK]

That occurs when we backup our VMs but not always. We have some VMs stored into HDD pool and that become read-only with no reason.

Thanks a lot.

Digging into logs:

2022-06-29T05:16:08.360+0200 7ff7389cb700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7389cb700' had timed out after 0
2022-06-29T05:30:15.793+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T05:32:18.365+0200 7ff7389cb700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7389cb700' had timed out after 0
2022-06-29T05:32:27.365+0200 7ff7379c9700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7379c9700' had timed out after 0
2022-06-29T05:45:15.797+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T06:00:15.802+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T06:12:10.358+0200 7ff7381ca700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7381ca700' had timed out after 0
2022-06-29T06:15:15.806+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T06:27:03.359+0200 7ff7379c9700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7379c9700' had timed out after 0
2022-06-29T06:27:51.359+0200 7ff7389cb700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7389cb700' had timed out after 0
2022-06-29T06:27:55.359+0200 7ff7389cb700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7389cb700' had timed out after 0
2022-06-29T06:30:15.811+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T06:36:36.359+0200 7ff7381ca700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7381ca700' had timed out after 0
2022-06-29T06:45:15.815+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T06:47:18.359+0200 7ff7371c8700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7371c8700' had timed out after 0
2022-06-29T06:48:50.359+0200 7ff7381ca700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7ff7381ca700' had timed out after 0
2022-06-29T07:00:15.816+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
[...]
2022-06-29T08:41:40.124+0200 7f3ecc1a0700 0 client.0 ms_handle_reset on v2:10.74.1.213:6898/1182288
2022-06-29T08:41:50.044+0200 7ff7391cc700 1 mon.CEPH-11@1(peon).paxos(paxos active c 66041364..66041998) lease_timeout -- calling new election
2022-06-29T08:41:51.900+0200 7ff7359c5700 1 mon.CEPH-11@1(electing) e5 handle_auth_request failed to assign global_id
[...]
2022-06-29T08:41:54.924+0200 7ff7359c5700 1 mon.CEPH-11@1(electing) e5 handle_auth_request failed to assign global_id
2022-06-29T08:41:55.048+0200 7ff7391cc700 1 mon.CEPH-11@1(electing) e5 collect_metadata : no unique device id for : fallback method has no model nor serial'
2022-06-29T08:41:55.048+0200 7ff7391cc700 1 paxos.1).electionLogic(583) init, last seen epoch 583, mid-election, bumping

I will continue looking for some clue.

 

what about disk % util during vm backups ?

what is your current scrub speed ?

the dmesg error of iqn.2016-05.com.petahdd:00006 not found is a little strange, some client thinks that current node has this disk. Maybe you have changed disks ips or did detach/attach disks without re-scanning some clients. it could also happen if you has a prev cluster installation with same ip setting and re-install, but still have an old ESXi trying to connect to this disk. Do you current have a disk 00006 ?

All nodes have similar usage.

Our scrub speed is set "very slow" and with:

This week it becomes extremly unstable and all VMware hosts have problems to keep datastore online.

At VMware vmkernel.log:

2022-06-30T07:12:26.236Z cpu60:2098685)NMP: nmp_ThrottleLogForDevice:3818: last error status from device naa.60014055e2d37a000010000000000000 repeated 234 times
2022-06-30T07:12:26.236Z cpu60:2098685)NMP: nmp_ThrottleLogForDevice:3863: Cmd 0x2a (0x45bae661e5c0, 2097225) to dev "naa.60014055e2d37a000010000000000000" on path "vmhba64:C0:T2:L0" Failed:
2022-06-30T07:12:26.236Z cpu60:2098685)NMP: nmp_ThrottleLogForDevice:3872: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL. cmdId.initiator=0x43070bd0ef00 CmdSN 0x2f840
2022-06-30T07:12:26.236Z cpu60:2098685)ScsiDeviceIO: 3448: Cmd(0x45bae661e5c0) 0x2a, cmdId.initiator=0x43070bd0ef00 CmdSN 0x2f840 from world 2097225 to dev "naa.60014055e2d37a000010000000000000" failed H:0x8 D:0x0 P:0x0 Aborted at driver layer
2022-06-30T07:12:26.240Z cpu33:2097285)ScsiDeviceIO: 3448: Cmd(0x45bae6765500) 0x28, cmdId.initiator=0x43070bd0ef00 CmdSN 0x2f842 from world 2100083 to dev "naa.60014055e2d37a000010000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-30T07:12:26.240Z cpu32:2097268)ScsiDeviceIO: 3448: Cmd(0x45bae67301c0) 0x28, cmdId.initiator=0x43070bd0ef00 CmdSN 0x2f841 from world 2100083 to dev "naa.60014055e2d37a000010000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-30T07:12:26.339Z cpu60:2098012)NMP: nmp_ThrottleLogForDevice:3863: Cmd 0x28 (0x45bae661e5c0, 0) to dev "naa.60014055e2d37a000010000000000000" on path "vmhba64:C0:T2:L0" Failed:
2022-06-30T07:12:26.339Z cpu60:2098012)NMP: nmp_ThrottleLogForDevice:3872: H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL. cmdId.initiator=0x43070bd0ef00 CmdSN 0x2f843
2022-06-30T07:12:26.339Z cpu60:2098012)ScsiDeviceIO: 3463: Cmd(0x45bae661e5c0) 0x28, CmdSN 0x2f843 from world 0 to dev "naa.60014055e2d37a000010000000000000" failed H:0x2 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2022-06-30T07:12:26.476Z cpu60:2098012)NMP: nmp_ThrottleLogForDevice:3801: last error status from device naa.60014055e2d37a000010000000000000 repeated 10 times
2022-06-30T07:12:26.752Z cpu60:2098012)NMP: nmp_ThrottleLogForDevice:3801: last error status from device naa.60014055e2d37a000010000000000000 repeated 20 times
2022-06-30T07:12:26.970Z cpu60:2098012)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60014055e2d37a000010000000000000" state in doubt; requested fast path state update...
2022-06-30T07:12:27.241Z cpu61:2100840)HBX: 3041: 'HDD-HA': HB at offset 3211264 - Waiting for timed out HB:
2022-06-30T07:12:27.241Z cpu61:2100840) [HB state abcdef02 offset 3211264 gen 9 stampUS 63038296081 uuid 62bc56bf-8e6e362b-c780-1402ec93caa4 jrnl <FB 15> drv 24.82 lockImpl 4 ip 192.168.77.61]
2022-06-30T07:12:27.251Z cpu60:2098012)NMP: nmp_ThrottleLogForDevice:3801: last error status from device naa.60014055e2d37a000010000000000000 repeated 40 times
2022-06-30T07:12:27.977Z cpu60:2098012)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60014055e2d37a000010000000000000" state in doubt; requested fast path state update...
2022-06-30T07:12:28.281Z cpu60:2098012)NMP: nmp_ThrottleLogForDevice:3801: last error status from device naa.60014055e2d37a000010000000000000 repeated 80 times
2022-06-30T07:12:28.757Z cpu7:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2213: socket 0x43100f549460 network resource pool netsched.pools.persist.iscsi associated
2022-06-30T07:12:28.757Z cpu7:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2241: socket 0x43100f549460 network tracker id 770665162 tracker.iSCSI.192.168.74.219 associated
2022-06-30T07:12:28.983Z cpu60:2098012)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.60014055e2d37a000010000000000000" state in doubt; requested fast path state update...
2022-06-30T07:12:29.012Z cpu7:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:880: vmhba64:CH:0 T:2 CN:0: iSCSI connection is being marked "ONLINE"
2022-06-30T07:12:29.012Z cpu7:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:881: Sess [ISID: 00023d000001 TARGET: iqn.2016-05.com.petahdd:00010 TPGT: 1 TSIH: 0]
2022-06-30T07:12:29.012Z cpu7:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:882: Conn [CID: 0 L: 192.168.74.61:30125 R: 192.168.74.219:3260]
2022-06-30T07:12:29.025Z cpu60:2098685)HBX: 5730: Reclaiming HB at 3211264 on vol 'HDD-HA' replayHostHB: 0 replayHostHBgen: 0 replayHostUUID: (00000000-00000000-0000-000000000000).
2022-06-30T07:12:29.026Z cpu60:2098685)HBX: 313: 'HDD-HA': HB at offset 3211264 - Reclaimed heartbeat [Timeout]:
2022-06-30T07:12:29.026Z cpu60:2098685) [HB state abcdef02 offset 3211264 gen 9 stampUS 63080045904 uuid 62bc56bf-8e6e362b-c780-1402ec93caa4 jrnl <FB 15> drv 24.82 lockImpl 4 ip 192.168.77.61]
2022-06-30T07:12:44.660Z cpu32:2127385)HBX: 3041: 'HDD-HA': HB at offset 3211264 - Waiting for timed out HB:
2022-06-30T07:12:44.660Z cpu32:2127385) [HB state abcdef02 offset 3211264 gen 9 stampUS 63080994382 uuid 62bc56bf-8e6e362b-c780-1402ec93caa4 jrnl <FB 15> drv 24.82 lockImpl 4 ip 192.168.77.61]
2022-06-30T07:12:47.975Z cpu1:2133379)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:643: vmhba64:CH:0 T:2 L:0 : Task mgmt "Abort Task" with itt=0x366da (refITT=0x366d6) timed out.
2022-06-30T07:12:54.662Z cpu33:2127385)HBX: 3041: 'HDD-HA': HB at offset 3211264 - Waiting for timed out HB:
2022-06-30T07:12:54.662Z cpu33:2127385) [HB state abcdef02 offset 3211264 gen 9 stampUS 63080994382 uuid 62bc56bf-8e6e362b-c780-1402ec93caa4 jrnl <FB 15> drv 24.82 lockImpl 4 ip 192.168.77.61]
2022-06-30T07:12:57.975Z cpu1:2133379)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:643: vmhba64:CH:0 T:2 L:0 : Task mgmt "Abort Task" with itt=0x366db (refITT=0x366d6) timed out.

As we can test other scenarios, we create another iSCSI disk and it also fails when initialitzaed by VMware.

In the other hand, others iSCSI disk initialized by WindowsOS (Veeam) don't fail and we have no problems with these.

Any idea?

Thanks again,

Can you please reply to the earlier disk 6 question

Is this issue happening all the time or as you stated earlier during vm backups ? is it load related ?

Do the HDD have SSD journals ?

Note Windows does have more timeout tolerance than VMWare, so under load it is possible the later will show more errors. Also next time you perform backups, look at the disk charts with 1 or 3 hour resolution as the week chart averages points out, better you can also run atop will takes sample very 1 or 2 sec.

Yes, we have a disk 00006:

iqn.2016-05.com.petahdd:00001
iqn.2016-05.com.petahdd:00005
iqn.2016-05.com.petahdd:00006
iqn.2016-05.com.petahdd:00008
iqn.2016-05.com.petahdd:00010

At first we suspect about Veeam Backups.

Now can confirm that is not the problem because we turn off backup to test and it fails too.

It seems that load is not a problem, because this datastore is used for backups. Veeam is disabled, so no load at all and keep failing. We suspect there's an option, protocol or checkbox in VMware isn't config properly and not working with that iSCSI disk (but others yes)

HDD pool haven't SSD jounals.

 

We suspect there's an option, protocol or checkbox in VMware isn't config properly and not working with that iSCSI disk (but others yes)

So this happens with 1 disk / datastore only, and others are fine ?

as per earlier post, the disk 6 not found error is strange, this means the ESXi client is using incorrect ip to reach the disk, have you detached/re-attached disk or maybe had an earlier installation with same ips that an ESXi is still trying connect to?  Are the ESXi on the same subnet as the iSCSI paths with no routing ? HA failover relies on sending ARP messages to clients to refresh mac addresses in ip failover, so it is also possible the messages did not make it and the ESXi is sending request to the wrong host, can you do a test on a test disk and move its paths to other nodes via the UI Path Assignment and check the ESXi are able to reach it.

The disk load charts yesterday around noon was high, note you showing a week timeframe charts, so data is averaged every 10 min or more, it is possible the values have higher peaks for a few seconds, enough to stall io.

It is always advisable to use journals with HDD, there are a lot of metadata iops required by the OSD,  putting it on pure HDD is too heavy. We have a script to add journals to existing OSDs if you decide to.

It occurs when disks by the same HDD pool are initialed by VMware. Other disk on the same pool initiated by Windows works fine.

Tonight we had change IQN names to test if the issue was there. No changes.

Tonight he detached and re-attached disk and nothings change. Keep failing.

VMware hosts are in the same subnet and routing is not needed.

Paths are migrated correctly and VMware hosts keep detect it online.

Pages: 1 2