Failing VMware datastore through iSCSI
wailer
75 Posts
June 28, 2022, 12:04 pmQuote from wailer on June 28, 2022, 12:04 pmHello,
After several years of working slow but stable, our HDD petasan cluster has started to fail with almost any IO operation.
VMkernel.log from vmware shows messages like this:
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2213: socket 0x4310491d0c20 network resource pool netsched.pools.persist.iscsi associated
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2241: socket 0x4310491d0c20 network tracker id 284125898 tracker.iSCSI.192.168.74.211 associated
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:880: vmhba64:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:881: Sess [ISID: 00023d000001 TARGET: iqn.2016-05.com.petahdd:00001 TPGT: 1 TSIH: 0]
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:882: Conn [CID: 0 L: 192.168.74.62:24454 R: 192.168.74.211:3260]
2022-06-28T12:01:56.413Z cpu1:2097237)ScsiDeviceIO: 3501: Cmd(0x459ad7be7540) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d0 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:56.413Z cpu1:2097237)count Active:4 Queued:23
2022-06-28T12:01:58.798Z cpu32:2097271)ScsiDeviceIO: 3448: Cmd(0x45bae71e7900) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824aa from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0 Aborted at device layer
2022-06-28T12:01:58.798Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d16740 cmdId: 0x43070bca4c40, 0x824d2)
2022-06-28T12:01:58.798Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45baf2d16740) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d2 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:58.798Z cpu36:2097267)ScsiDeviceIO: 3501: Cmd(0x459ad7bd8cc0) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d1 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:58.798Z cpu36:2097267)count Active:4 Queued:22
2022-06-28T12:01:59.000Z cpu45:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baeab0afc0 cmdId: 0x43070bca4c40, 0x824d3)
2022-06-28T12:01:59.000Z cpu39:2097280)ScsiDeviceIO: 3448: Cmd(0x45baeab0afc0) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d3 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.201Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc715880 cmdId: 0x43070bca4c40, 0x824d4)
2022-06-28T12:01:59.201Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45bacc715880) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d4 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.401Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2db6c40 cmdId: 0x43070bca4c40, 0x824d5)
2022-06-28T12:01:59.401Z cpu36:2097267)ScsiDeviceIO: 3448: Cmd(0x45baf2db6c40) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d5 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.602Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc639380 cmdId: 0x43070bca4c40, 0x824d6)
2022-06-28T12:01:59.802Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d7)
2022-06-28T12:01:59.950Z cpu2:2097255)ScsiDeviceIO: 3501: Cmd(0x45baea014e00) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824c4 from world 2135425 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:59.950Z cpu2:2097255)count Active:4 Queued:21
2022-06-28T12:02:00.003Z cpu36:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d8)
2022-06-28T12:02:07.477Z cpu32:2130070)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:643: vmhba64:CH:1 T:1 L:0 : Task mgmt "Abort Task" with itt=0xa4de0 (refITT=0xa4ddf) timed out.
We are on petasan 3.1.0 , it's a 5 node cluster x 12- 10TB HD's
Any hint?
Hello,
After several years of working slow but stable, our HDD petasan cluster has started to fail with almost any IO operation.
VMkernel.log from vmware shows messages like this:
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2213: socket 0x4310491d0c20 network resource pool netsched.pools.persist.iscsi associated
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2241: socket 0x4310491d0c20 network tracker id 284125898 tracker.iSCSI.192.168.74.211 associated
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:880: vmhba64:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:881: Sess [ISID: 00023d000001 TARGET: iqn.2016-05.com.petahdd:00001 TPGT: 1 TSIH: 0]
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:882: Conn [CID: 0 L: 192.168.74.62:24454 R: 192.168.74.211:3260]
2022-06-28T12:01:56.413Z cpu1:2097237)ScsiDeviceIO: 3501: Cmd(0x459ad7be7540) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d0 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:56.413Z cpu1:2097237)count Active:4 Queued:23
2022-06-28T12:01:58.798Z cpu32:2097271)ScsiDeviceIO: 3448: Cmd(0x45bae71e7900) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824aa from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0 Aborted at device layer
2022-06-28T12:01:58.798Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d16740 cmdId: 0x43070bca4c40, 0x824d2)
2022-06-28T12:01:58.798Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45baf2d16740) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d2 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:58.798Z cpu36:2097267)ScsiDeviceIO: 3501: Cmd(0x459ad7bd8cc0) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d1 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:58.798Z cpu36:2097267)count Active:4 Queued:22
2022-06-28T12:01:59.000Z cpu45:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baeab0afc0 cmdId: 0x43070bca4c40, 0x824d3)
2022-06-28T12:01:59.000Z cpu39:2097280)ScsiDeviceIO: 3448: Cmd(0x45baeab0afc0) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d3 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.201Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc715880 cmdId: 0x43070bca4c40, 0x824d4)
2022-06-28T12:01:59.201Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45bacc715880) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d4 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.401Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2db6c40 cmdId: 0x43070bca4c40, 0x824d5)
2022-06-28T12:01:59.401Z cpu36:2097267)ScsiDeviceIO: 3448: Cmd(0x45baf2db6c40) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d5 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.602Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc639380 cmdId: 0x43070bca4c40, 0x824d6)
2022-06-28T12:01:59.802Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d7)
2022-06-28T12:01:59.950Z cpu2:2097255)ScsiDeviceIO: 3501: Cmd(0x45baea014e00) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824c4 from world 2135425 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:59.950Z cpu2:2097255)count Active:4 Queued:21
2022-06-28T12:02:00.003Z cpu36:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d8)
2022-06-28T12:02:07.477Z cpu32:2130070)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:643: vmhba64:CH:1 T:1 L:0 : Task mgmt "Abort Task" with itt=0xa4de0 (refITT=0xa4ddf) timed out.
We are on petasan 3.1.0 , it's a 5 node cluster x 12- 10TB HD's
Any hint?
Last edited on June 28, 2022, 12:05 pm by wailer · #1
admin
2,930 Posts
June 28, 2022, 1:55 pmQuote from admin on June 28, 2022, 1:55 pmCan you show screen shots of disk % utilisation charts for you nodes as well as the pg status chart for time of problem.
Can you show screen shots of disk % utilisation charts for you nodes as well as the pg status chart for time of problem.
wailer
75 Posts
June 28, 2022, 2:25 pmQuote from wailer on June 28, 2022, 2:25 pmsure, problem is happening all the time. I attach last hour utilization.
[url=https://postimg.cc/FYYpQ7Tx][img]https://i.postimg.cc/FYYpQ7Tx/image.png[/img][/url]
[url=https://postimg.cc/NywHPMSF][img]https://i.postimg.cc/NywHPMSF/image.png[/img][/url]
[url=https://postimg.cc/XGq726jX][img]https://i.postimg.cc/XGq726jX/image.png[/img][/url]
[url=https://postimg.cc/Yvt7nB8k][img]https://i.postimg.cc/Yvt7nB8k/image.png[/img][/url]
[url=https://postimg.cc/qzTHqHPn][img]https://i.postimg.cc/qzTHqHPn/image.png[/img][/url]
PG status
[url=https://postimg.cc/BLzDpg09][img]https://i.postimg.cc/BLzDpg09/image.png[/img][/url]
Throughput and IOPS
[url=https://postimg.cc/vxMyJRMW][img]https://i.postimg.cc/vxMyJRMW/image.png[/img][/url]
[url=https://postimg.cc/QV4s6ywW][img]https://i.postimg.cc/QV4s6ywW/image.png[/img][/url]
Also we have noticed VMware is detecting the datastore as Flash disk .. not sure if this really matters as it has been working well until now..
thank you,
sure, problem is happening all the time. I attach last hour utilization.
[url=https://postimg.cc/FYYpQ7Tx][img]https://i.postimg.cc/FYYpQ7Tx/image.png[/img][/url]
[url=https://postimg.cc/NywHPMSF][img]https://i.postimg.cc/NywHPMSF/image.png[/img][/url]
[url=https://postimg.cc/XGq726jX][img]https://i.postimg.cc/XGq726jX/image.png[/img][/url]
[url=https://postimg.cc/Yvt7nB8k][img]https://i.postimg.cc/Yvt7nB8k/image.png[/img][/url]
[url=https://postimg.cc/qzTHqHPn][img]https://i.postimg.cc/qzTHqHPn/image.png[/img][/url]
PG status
[url=https://postimg.cc/BLzDpg09][img]https://i.postimg.cc/BLzDpg09/image.png[/img][/url]
Throughput and IOPS
[url=https://postimg.cc/vxMyJRMW][img]https://i.postimg.cc/vxMyJRMW/image.png[/img][/url]
[url=https://postimg.cc/QV4s6ywW][img]https://i.postimg.cc/QV4s6ywW/image.png[/img][/url]
Also we have noticed VMware is detecting the datastore as Flash disk .. not sure if this really matters as it has been working well until now..
thank you,
admin
2,930 Posts
June 28, 2022, 8:18 pmQuote from admin on June 28, 2022, 8:18 pmcannot see something apparent. just on 4th node there is a disk sdf seems to go to 100% util at the end. but nothing obvious from charts, the cluster health seems fine. You could double check the OSD and mon charts if there were any failures, you could also search the OSD log files in /var/log/ceph for errors in the logs. Also chek your cpu and rams charts. It could also be connectivity issues.
cannot see something apparent. just on 4th node there is a disk sdf seems to go to 100% util at the end. but nothing obvious from charts, the cluster health seems fine. You could double check the OSD and mon charts if there were any failures, you could also search the OSD log files in /var/log/ceph for errors in the logs. Also chek your cpu and rams charts. It could also be connectivity issues.
Last edited on June 28, 2022, 8:19 pm by admin · #4
wailer
75 Posts
June 28, 2022, 10:25 pmQuote from wailer on June 28, 2022, 10:25 pmThank you , yes everything looks fine and we are pretty sure that networking is fine too. we'll try to dig a bit more into osd logs then...
Thank you , yes everything looks fine and we are pretty sure that networking is fine too. we'll try to dig a bit more into osd logs then...
Failing VMware datastore through iSCSI
wailer
75 Posts
Quote from wailer on June 28, 2022, 12:04 pmHello,
After several years of working slow but stable, our HDD petasan cluster has started to fail with almost any IO operation.
VMkernel.log from vmware shows messages like this:
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2213: socket 0x4310491d0c20 network resource pool netsched.pools.persist.iscsi associated
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2241: socket 0x4310491d0c20 network tracker id 284125898 tracker.iSCSI.192.168.74.211 associated
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:880: vmhba64:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:881: Sess [ISID: 00023d000001 TARGET: iqn.2016-05.com.petahdd:00001 TPGT: 1 TSIH: 0]
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:882: Conn [CID: 0 L: 192.168.74.62:24454 R: 192.168.74.211:3260]
2022-06-28T12:01:56.413Z cpu1:2097237)ScsiDeviceIO: 3501: Cmd(0x459ad7be7540) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d0 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:56.413Z cpu1:2097237)count Active:4 Queued:23
2022-06-28T12:01:58.798Z cpu32:2097271)ScsiDeviceIO: 3448: Cmd(0x45bae71e7900) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824aa from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0 Aborted at device layer
2022-06-28T12:01:58.798Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d16740 cmdId: 0x43070bca4c40, 0x824d2)
2022-06-28T12:01:58.798Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45baf2d16740) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d2 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:58.798Z cpu36:2097267)ScsiDeviceIO: 3501: Cmd(0x459ad7bd8cc0) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d1 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:58.798Z cpu36:2097267)count Active:4 Queued:22
2022-06-28T12:01:59.000Z cpu45:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baeab0afc0 cmdId: 0x43070bca4c40, 0x824d3)
2022-06-28T12:01:59.000Z cpu39:2097280)ScsiDeviceIO: 3448: Cmd(0x45baeab0afc0) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d3 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.201Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc715880 cmdId: 0x43070bca4c40, 0x824d4)
2022-06-28T12:01:59.201Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45bacc715880) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d4 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.401Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2db6c40 cmdId: 0x43070bca4c40, 0x824d5)
2022-06-28T12:01:59.401Z cpu36:2097267)ScsiDeviceIO: 3448: Cmd(0x45baf2db6c40) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d5 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.602Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc639380 cmdId: 0x43070bca4c40, 0x824d6)
2022-06-28T12:01:59.802Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d7)
2022-06-28T12:01:59.950Z cpu2:2097255)ScsiDeviceIO: 3501: Cmd(0x45baea014e00) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824c4 from world 2135425 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:59.950Z cpu2:2097255)count Active:4 Queued:21
2022-06-28T12:02:00.003Z cpu36:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d8)
2022-06-28T12:02:07.477Z cpu32:2130070)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:643: vmhba64:CH:1 T:1 L:0 : Task mgmt "Abort Task" with itt=0xa4de0 (refITT=0xa4ddf) timed out.We are on petasan 3.1.0 , it's a 5 node cluster x 12- 10TB HD's
Any hint?
Hello,
After several years of working slow but stable, our HDD petasan cluster has started to fail with almost any IO operation.
VMkernel.log from vmware shows messages like this:
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2213: socket 0x4310491d0c20 network resource pool netsched.pools.persist.iscsi associated
2022-06-28T12:01:49.498Z cpu5:2098478)iscsi_vmk: iscsivmk_ConnNetRegister:2241: socket 0x4310491d0c20 network tracker id 284125898 tracker.iSCSI.192.168.74.211 associated
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:880: vmhba64:CH:0 T:1 CN:0: iSCSI connection is being marked "ONLINE"
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:881: Sess [ISID: 00023d000001 TARGET: iqn.2016-05.com.petahdd:00001 TPGT: 1 TSIH: 0]
2022-06-28T12:01:49.751Z cpu0:2098478)WARNING: iscsi_vmk: iscsivmk_StartConnection:882: Conn [CID: 0 L: 192.168.74.62:24454 R: 192.168.74.211:3260]
2022-06-28T12:01:56.413Z cpu1:2097237)ScsiDeviceIO: 3501: Cmd(0x459ad7be7540) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d0 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:56.413Z cpu1:2097237)count Active:4 Queued:23
2022-06-28T12:01:58.798Z cpu32:2097271)ScsiDeviceIO: 3448: Cmd(0x45bae71e7900) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824aa from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0 Aborted at device layer
2022-06-28T12:01:58.798Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d16740 cmdId: 0x43070bca4c40, 0x824d2)
2022-06-28T12:01:58.798Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45baf2d16740) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d2 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:58.798Z cpu36:2097267)ScsiDeviceIO: 3501: Cmd(0x459ad7bd8cc0) 0x2a, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d1 from world 2097225 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:58.798Z cpu36:2097267)count Active:4 Queued:22
2022-06-28T12:01:59.000Z cpu45:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baeab0afc0 cmdId: 0x43070bca4c40, 0x824d3)
2022-06-28T12:01:59.000Z cpu39:2097280)ScsiDeviceIO: 3448: Cmd(0x45baeab0afc0) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d3 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.201Z cpu33:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc715880 cmdId: 0x43070bca4c40, 0x824d4)
2022-06-28T12:01:59.201Z cpu35:2097268)ScsiDeviceIO: 3448: Cmd(0x45bacc715880) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d4 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.401Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2db6c40 cmdId: 0x43070bca4c40, 0x824d5)
2022-06-28T12:01:59.401Z cpu36:2097267)ScsiDeviceIO: 3448: Cmd(0x45baf2db6c40) 0x28, cmdId.initiator=0x43070bca4c40 CmdSN 0x824d5 from world 2135398 to dev "naa.60014055e2d37a000001000000000000" failed H:0x8 D:0x0 P:0x0
2022-06-28T12:01:59.602Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45bacc639380 cmdId: 0x43070bca4c40, 0x824d6)
2022-06-28T12:01:59.802Z cpu32:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d7)
2022-06-28T12:01:59.950Z cpu2:2097255)ScsiDeviceIO: 3501: Cmd(0x45baea014e00) 0x89, cmdId.initiator=0x43070bca4c40 CmdSN 0x824c4 from world 2135425 to dev "naa.60014055e2d37a000001000000000000" failed H:0x5 D:0x0 P:0x0 Aborted at device layer. Cmd
2022-06-28T12:01:59.950Z cpu2:2097255)count Active:4 Queued:21
2022-06-28T12:02:00.003Z cpu36:2135398)ScsiDeviceIO: 4513: Restricting cmd 0x28 (512 bytes) from WID 2135398 to quiesced dev naa.60014055e2d37a000001000000000000:1 (vmkCmd=0x45baf2d0f300 cmdId: 0x43070bca4c40, 0x824d8)
2022-06-28T12:02:07.477Z cpu32:2130070)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue:643: vmhba64:CH:1 T:1 L:0 : Task mgmt "Abort Task" with itt=0xa4de0 (refITT=0xa4ddf) timed out.
We are on petasan 3.1.0 , it's a 5 node cluster x 12- 10TB HD's
Any hint?
admin
2,930 Posts
Quote from admin on June 28, 2022, 1:55 pmCan you show screen shots of disk % utilisation charts for you nodes as well as the pg status chart for time of problem.
Can you show screen shots of disk % utilisation charts for you nodes as well as the pg status chart for time of problem.
wailer
75 Posts
Quote from wailer on June 28, 2022, 2:25 pmsure, problem is happening all the time. I attach last hour utilization.
[url=https://postimg.cc/FYYpQ7Tx][img]https://i.postimg.cc/FYYpQ7Tx/image.png[/img][/url]
[url=https://postimg.cc/NywHPMSF][img]https://i.postimg.cc/NywHPMSF/image.png[/img][/url]
[url=https://postimg.cc/XGq726jX][img]https://i.postimg.cc/XGq726jX/image.png[/img][/url]
[url=https://postimg.cc/Yvt7nB8k][img]https://i.postimg.cc/Yvt7nB8k/image.png[/img][/url]
[url=https://postimg.cc/qzTHqHPn][img]https://i.postimg.cc/qzTHqHPn/image.png[/img][/url]
PG status
[url=https://postimg.cc/BLzDpg09][img]https://i.postimg.cc/BLzDpg09/image.png[/img][/url]
Throughput and IOPS
[url=https://postimg.cc/vxMyJRMW][img]https://i.postimg.cc/vxMyJRMW/image.png[/img][/url]
[url=https://postimg.cc/QV4s6ywW][img]https://i.postimg.cc/QV4s6ywW/image.png[/img][/url]
Also we have noticed VMware is detecting the datastore as Flash disk .. not sure if this really matters as it has been working well until now..
thank you,
sure, problem is happening all the time. I attach last hour utilization.
[url=https://postimg.cc/FYYpQ7Tx][img]https://i.postimg.cc/FYYpQ7Tx/image.png[/img][/url]
[url=https://postimg.cc/NywHPMSF][img]https://i.postimg.cc/NywHPMSF/image.png[/img][/url]
[url=https://postimg.cc/XGq726jX][img]https://i.postimg.cc/XGq726jX/image.png[/img][/url]
[url=https://postimg.cc/Yvt7nB8k][img]https://i.postimg.cc/Yvt7nB8k/image.png[/img][/url]
[url=https://postimg.cc/qzTHqHPn][img]https://i.postimg.cc/qzTHqHPn/image.png[/img][/url]
PG status
[url=https://postimg.cc/BLzDpg09][img]https://i.postimg.cc/BLzDpg09/image.png[/img][/url]
Throughput and IOPS
[url=https://postimg.cc/vxMyJRMW][img]https://i.postimg.cc/vxMyJRMW/image.png[/img][/url]
[url=https://postimg.cc/QV4s6ywW][img]https://i.postimg.cc/QV4s6ywW/image.png[/img][/url]
Also we have noticed VMware is detecting the datastore as Flash disk .. not sure if this really matters as it has been working well until now..
thank you,
admin
2,930 Posts
Quote from admin on June 28, 2022, 8:18 pmcannot see something apparent. just on 4th node there is a disk sdf seems to go to 100% util at the end. but nothing obvious from charts, the cluster health seems fine. You could double check the OSD and mon charts if there were any failures, you could also search the OSD log files in /var/log/ceph for errors in the logs. Also chek your cpu and rams charts. It could also be connectivity issues.
cannot see something apparent. just on 4th node there is a disk sdf seems to go to 100% util at the end. but nothing obvious from charts, the cluster health seems fine. You could double check the OSD and mon charts if there were any failures, you could also search the OSD log files in /var/log/ceph for errors in the logs. Also chek your cpu and rams charts. It could also be connectivity issues.
wailer
75 Posts
Quote from wailer on June 28, 2022, 10:25 pmThank you , yes everything looks fine and we are pretty sure that networking is fine too. we'll try to dig a bit more into osd logs then...
Thank you , yes everything looks fine and we are pretty sure that networking is fine too. we'll try to dig a bit more into osd logs then...