Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

ESX iscsi dead or error

Sometimes after a crash some esx server can not mount the volume via iscsi.
Doing some research I came to the conclusion that the likely problem is not vmware, probably lio / iscsi.
Running a tcpdump I see the esx server packages arriving at the cluster nodes but they do not return.

How could I better solve this problem?

Have you seen anything like this?

--------------------------------------tcpdump-log-

0:34:13.830053 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [P.], seq 1:265, ack 1, win 514, options [nop,nop,TS val 2858695 ecr 22278352], length 264

10:34:14.100163 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [P.], seq 1:265, ack 1, win 514, options [nop,nop,TS val 2858722 ecr 22278352], length 264

10:34:14.430107 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [P.], seq 1:265, ack 1, win 514, options [nop,nop,TS val 2858755 ecr 22278352], length 264

10:34:14.592341 IP 10.0.2.104.3260 > 10.0.2.156.44233: Flags [S.], seq 1717036133, ack 3332370974, win 28960, options [mss 1460,sackOK,TS val 22278602 ecr 2858755,nop,wscale 7], length 0

10:34:14.592585 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [.], ack 1, win 514, options [nop,nop,TS val 2858771 ecr 22278602], length 0

10:34:14.880180 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [P.], seq 1:265, ack 1, win 514, options [nop,nop,TS val 2858800 ecr 22278602], length 264

10:34:15.570107 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [P.], seq 1:265, ack 1, win 514, options [nop,nop,TS val 2858869 ecr 22278602], length 264

10:34:16.592328 IP 10.0.2.104.3260 > 10.0.2.156.44233: Flags [S.], seq 1717036133, ack 3332370974, win 28960, options [mss 1460,sackOK,TS val 22279102 ecr 2858869,nop,wscale 7], length 0

10:34:16.592489 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [.], ack 1, win 514, options [nop,nop,TS val 2858971 ecr 22279102], length 0

10:34:16.740053 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [P.], seq 1:265, ack 1, win 514, options [nop,nop,TS val 2858986 ecr 22279102], length 264

10:34:18.595643 IP 10.0.2.156.44233 > 10.0.2.104.3260: Flags [R.], seq 265, ack 1, win 514, options [nop,nop,TS val 2859171 ecr 2227

--------------------------------------------------------------------------------------------------------------------------------------------------------------

 

 

As vezes apos um crash alguns esx server nao consegue montar o volume via iscsi.

Fazendo alguns pesquisas eu cheguei a conclusao que o o provavel problema nao e o vmware, provavelmente o lio/iscsi. Rodando um tcpdump

eu vejo os pacotes do esx server chegando nos nodes do cluster mas nao retornam.

 

Como eu poderia debugar melhor esse problema ?

Voces ja viram por algo parecido?

Hi,

We do not see these issues in our tests. I understand you crash a PetaSAN node on purpose and ESXi loses the datastore..is this the case ? Does it happen all the time or sometimes, can you please try to identify a bit more the conditions when this happens.

General things:

  • Make sure you use MPIO with ESXi
  • When this happens, can you identify on which nodes is the disk being served from the PstaSAN ui. Can you ping from the ESXi to these nodes on the path ips, can the nodes ping back the ESXi ?
  • Can you run "targetcli ls" on the nodes that should serve the disk, do you see the iSCSI target with the dynamic ip enabled ? Also run "ip addr | grep PATH_IP" to make sure the ip is allocated and also run it on other nodes to make sure the dynamic ip is not allocated elsewhere.
  • Can the same lun/datastore function from other ESXi  and only one ESXi cannot connect or is it not accessible to all ?
  • Does "dmesg"  on the iSCSI nodes show any error ?
  • If you are running low on resources or have a small number of spinning disks, the act of failing will kick in the Ceph recovery process, this can stress the system if it does not have enough resources. Can you run atop when  the failure happens and see how much % busy cpu/disk/ram/net, if you see them near 100% you should try to increase this, we have a hardware recommendation.

Doing some tests, after restarting any node, the esx managed to reestablish the iscsi volume.

Now at 8:00 PM it happened again, at that moment I'm with a node that stopped due to a disk failure, and I can not restart a node.

Esx iscsi has stopped again and my cluster is down.

 

Following information:

>>Make sure you use MPIO with ESXi
OK.

>>When this happens, can you identify on which nodes is the disk being served from the PstaSAN ui. Can you ping from the ESXi to these nodes on the path ips, can >>the nodes ping back the ESXi ?
Esxi and nodes ping ok

>>Can you run "targetcli ls" on the nodes that should serve the disk, do you see the iSCSI target with the dynamic ip enabled ? Also run "ip addr | grep PATH_IP" to >>make sure the ip is allocated and also run it on other nodes to make sure the dynamic ip is not allocated elsewhere.
targetcli-ls -> http://imgur.com/a/fuoNq
Ips are not allocated in duplicity

>>Can the same lun/datastore function from other ESXi  and only one ESXi cannot connect or is it not accessible to all ?
Only one esxi node loses access to lun scsi.

>>Does "dmesg"  on the iSCSI nodes show any error ?

On all nodes displays this in dmesg

[108672.866070] iSCSI/iqn.1998-01.com.vmware:esx1: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[108801.022610] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[110472.979731] iSCSI/iqn.1998-01.com.vmware:esx1: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[110601.078816] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[111808.546093] iSCSI/iqn.1998-01.com.vmware:esx03: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[112401.164459] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[113609.772228] iSCSI/iqn.1998-01.com.vmware:esx03: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[114201.221900] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[115873.165509] iSCSI/iqn.1998-01.com.vmware:esx1: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[116001.277484] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[117673.220153] iSCSI/iqn.1998-01.com.vmware:esx1: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[117801.334055] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[119601.393620] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[121401.449178] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[123073.409314] iSCSI/iqn.1998-01.com.vmware:esx1: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

[123201.505292] iSCSI/iqn.1998-01.com.vmware:esx07: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.

 

>>If you are running low on resources or have a small number of spinning disks, the act of failing will kick in the Ceph recovery process, this can stress the system if it >>does not have enough resources. Can you run atop when  the failure happens and see how much % busy cpu/disk/ram/net, if you see them near 100% you should >>try to increase this, we have a hardware recommendation.

In this case I do not believe it's a lack of resources, all the nodes are with 8GB of RAM, and the esxi servers with 64GB.

As I mentioned earlier I see the iscsi packets coming up to the nodes, but there is no response.

Does restarting any node also restart the scsi / lio service at all?

 

 

Hi there.  As i understand you did test some node restarts and things were working until 8 pm. I suspect it is a resource issue, you may not have enough disks/ram/net bandwidth/cpu cores to handle your load. If i recall in one of your previous posts you started with very low resources and had many issues, you did increase them and things are better but still may not be enough. The recommended approach is to try to be as close as our hardware recommendation guide..so please have a look at it, if this is not possible and you want to use the lowest possible resources, which is not recommended, i can help you if you can supply the requested resources loads via atop command, this will help us increase just enough to the bare minimum required.

Apart from your own iSCSI load, when you shutdown a node, Ceph will try to recreate other replicas of the data that was down on the remaining up disks. If you have only a couple of running disks and they are not fast enough, it can cause them to become very busy and sometimes unresponsive. I had posted you before on some Ceph params on how to lower the priority of Ceph recovery from it default values.

Now the current case i suspect may be due to an extra load by Ceph called scrubbing. It is a backend process that checks data consistency ( crc / metada ) among the different replicas, this does put load on the system and is happening on top of the recovery and client load. I suspect this since in PetaSAN we limit the scrub hours to be starting from 8 pm to 8 am, the default in Ceph is it can do this anytime which can cause issues if daytime traffic is higher.

Please add the following to the /etc/ceph/CLUSTER_NAME.conf on all nodes then reboot, they should help further limit the scrub load from the default values

osd_max_scrubs = 1

osd_scrub_during_recovery = false

osd_scrub_priority = 1

Also you can change the scrub allowed hours:

osd_scrub_begin_hour = 20

osd_scrub_end_hour = 8

 

you can turn scrub off (not recommended at all) but just for testing, run this from shell on any node

ceph osd set noscrub  --cluster CLUSTER_NAME

ceph osd set nodeep-scrub --cluster CLUSTER_NAME

to re-enable it back

ceph osd unset noscrub --cluster CLUSTER_NAME

ceph osd unset nodeep-scrub --cluster CLUSTER_NAME

 

Also as i pointed in your earlier issues, place this in the conf file to help reduce the recovery traffic

osd_max_backfills = 1

osd_recovery_max_active = 1

osd_recovery_priority = 1

osd_recovery_op_priority = 1

osd_recovery_threads = 1

osd_client_op_priority = 63

osd_recovery_max_start = 1

So i really recommend you be as close as resource recommendations as possible. This does save a lot of debugging problems. If however you need to run with the lowest possible confiuration which i do not recommend, but i can help if you send me the result of the resource values using atop (or collectl/sysstat) at the highest load: you crash a node and your client are still writing.

Lastly we did see sometimes issues with ip messages not being replied, but they network setup errors like overlapping subnet masks and i do not believe this is your case.

Good luck..

>>Hi there.  As i understand you did test some node restarts and things were working until 8 pm. I suspect it is a resource issue, you may not have enough disks/ram/net >>bandwidth/cpu cores to handle your load. If i recall in one of your previous posts you started with very low resources and had many issues, you did increase them and >>things are better but still may not be enough. The recommended approach is to try to be as close as our hardware recommendation guide..so please have a look at it, if >>this is not possible and you want to use the lowest possible resources, which is not recommended, i can help you if you can supply the requested resources loads via >>atop command, this will help us increase just enough to the bare minimum required.

Yes and true, I had many problems in the beginning, by having started with low resources. Today I'm using the following features.

4 x petasan node, 8GB and 2xcore 2.5ghz
2 x petasan node have 3 x 2tb 7200 rpm hdd each. total 6 osd de 2tb , total 12TB
2x petsan have no disks, only monitor and iscsi (waiting more disks)

>>Apart from your own iSCSI load, when you shutdown a node, Ceph will try to recreate other replicas of the data that was down on the remaining up disks. If you >>have only a couple of running disks and they are not fast enough, it can cause them to become very busy and sometimes unresponsive. I had posted you before on >>some Ceph params on how to lower the priority of Ceph recovery from it default values.

All right. I had already followed his previous tips, and this improved a lot with regard to problems when a node crash.

Analyzing better, I realized that when a node stops, the ips addresses are redistributed, okay. At this point I realize that esx loses some paths.

 

I created a 1TB disc with 8 paths. Here's a few esx after some crash.

View post on imgur.com

View post on imgur.com

View post on imgur.com

View post on imgur.com

In some escs the iscsi path decreases in others the paths are marked dead.

 

Even in the cases where the path is down I see that the ip is assigned to the petasan node.

View post on imgur.com

Is this behavior correct, or am I doing something wrong?

I believe I have identified part of the problem. The reason some iscsi path stay dead for awhile.

In arp table esx some, not all, the old mac address of the IP of the node that has stopped, and it was re-created on another host, continues in the arp table esx.

The esx in turn has a high arp cache timeout, about 20 minutes. More or less the time I get the iscsi path dead.

 

This process can be accelerated by sending a ping from the petasan node to the esxi, otherwise it will not work. Can also be run

The esxcli network ip neighbor command removes -v 4 x.x.x.x from the esxi to clear the arp entry of the ip petasan that has been changed from node.

 

Why is it important ? It means that depending on the crash time between one node and another you may lose full access to ISCSI LUN.

 

Another point that I have not yet discovered is the reason for decreasing the amount of path iscsi.

You are correct regarding the ESXi arp cache. It does seem that paths that are not involved in io or are idle do take much more time to become active in case of failover, which as you tested can remain for up to 20 min which is the arp refresh interval.  We will soon have an update to fix it.