ESX-Server ISCSI problems?
therm
121 Posts
August 5, 2017, 12:31 pmQuote from therm on August 5, 2017, 12:31 pmroot@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
13
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
Yes! Thank you!
Now the ESX-Server which crashed have 11 paths and the surviving ESX-Server have 20 paths but all available.
How to bring back the missing?
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
13
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
Yes! Thank you!
Now the ESX-Server which crashed have 11 paths and the surviving ESX-Server have 20 paths but all available.
How to bring back the missing?
admin
2,930 Posts
August 5, 2017, 1:17 pmQuote from admin on August 5, 2017, 1:17 pmIf this is related to the arp issue (fixed in 1.3.1 ) can you try to print the arp table on each esx (sorry i do not have the command offhand but can get it ) and see if the failed paths do show the correct mac addresses. if not then the arp fix outlined above should fix it. If they do show correct address, can you ping from esx to those ips and vice versa from the PetaSAN nodes to that esx. If you can upgrade to 1.3.1 it will be better. You can also try rescan the paths from the esx.
I would also recommend to get a feel on PetaSAN performance load during your night operations ( % busy cpu/disks ..etc ) but one quick thing to try is disable scrubbing as it does lead to increase latency and we do schedule it to run at night:
ceph osd set nodeep-scrub
ceph osd set noscrub
There are other various performance factors we can try after getting some of the load metrics. v 1.4 will make this process easier.
If this is related to the arp issue (fixed in 1.3.1 ) can you try to print the arp table on each esx (sorry i do not have the command offhand but can get it ) and see if the failed paths do show the correct mac addresses. if not then the arp fix outlined above should fix it. If they do show correct address, can you ping from esx to those ips and vice versa from the PetaSAN nodes to that esx. If you can upgrade to 1.3.1 it will be better. You can also try rescan the paths from the esx.
I would also recommend to get a feel on PetaSAN performance load during your night operations ( % busy cpu/disks ..etc ) but one quick thing to try is disable scrubbing as it does lead to increase latency and we do schedule it to run at night:
ceph osd set nodeep-scrub
ceph osd set noscrub
There are other various performance factors we can try after getting some of the load metrics. v 1.4 will make this process easier.
Last edited on August 5, 2017, 1:20 pm · #22
therm
121 Posts
August 6, 2017, 5:30 amQuote from therm on August 6, 2017, 5:30 amThis morning all servers survived. All paths are available on all servers. It really seems load related. (I did disable scrub)
Will update to 1.3.1 on monday. Any idea when 1.4 will be released? In addition and as I mentioned in another thread we will expand the cluster.
Thank you very much for your support!
Dennis
This morning all servers survived. All paths are available on all servers. It really seems load related. (I did disable scrub)
Will update to 1.3.1 on monday. Any idea when 1.4 will be released? In addition and as I mentioned in another thread we will expand the cluster.
Thank you very much for your support!
Dennis
therm
121 Posts
August 6, 2017, 11:47 amQuote from therm on August 6, 2017, 11:47 amIn addition I`ve found one LUN on one ESX that had giant latency in esxtop(u). Changing the path selection from round robin to last used fixed it for the moment.
In addition I`ve found one LUN on one ESX that had giant latency in esxtop(u). Changing the path selection from round robin to last used fixed it for the moment.
admin
2,930 Posts
August 7, 2017, 8:18 amQuote from admin on August 7, 2017, 8:18 amYes upgrading to 1.3.1 is recommended as it has been tested a lot with ESX failover.
version 1.4 is due Aug 17
The high latency on 1 path on 1 esx for 1 lun is intriguing, if the storage node is loaded i would slow down other paths. but happy things are working better now.
Yes upgrading to 1.3.1 is recommended as it has been tested a lot with ESX failover.
version 1.4 is due Aug 17
The high latency on 1 path on 1 esx for 1 lun is intriguing, if the storage node is loaded i would slow down other paths. but happy things are working better now.
Last edited on August 7, 2017, 9:18 am · #25
therm
121 Posts
August 7, 2017, 2:33 pmQuote from therm on August 7, 2017, 2:33 pmAfter replacing an sfp-transciever and updating one of three nodes petasan has the following ip distribution:
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
18
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
2
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
ceph-node-mru-3 is PetaSAN 1.3.1 while the others are 1.3.0. Sending arping command as you suggested before seems not to help. I did a rescan on one ESX and it is seeing 18 Paths. This situation means to me that if paths will not came back an upgrade of ceph-node-mru-1 is impossible. How to manually switch the iscsi paths?
After replacing an sfp-transciever and updating one of three nodes petasan has the following ip distribution:
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
18
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
2
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
ceph-node-mru-3 is PetaSAN 1.3.1 while the others are 1.3.0. Sending arping command as you suggested before seems not to help. I did a rescan on one ESX and it is seeing 18 Paths. This situation means to me that if paths will not came back an upgrade of ceph-node-mru-1 is impossible. How to manually switch the iscsi paths?
therm
121 Posts
August 7, 2017, 2:39 pmQuote from therm on August 7, 2017, 2:39 pmHere dmesg output of the ESX not getting the paths back:
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:13333 R: 192.168.3.21:3260]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:12914 R: 192.168.3.29:3260]
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:38055 R: 192.168.3.21:3260]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:19402 R: 192.168.3.29:3260]
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:45832 R: 192.168.3.21:3260]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:40743 R: 192.168.3.29:3260]
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
[root@bl460-12:~] ping 192.168.3.21
PING 192.168.3.21 (192.168.3.21): 56 data bytes
64 bytes from 192.168.3.21: icmp_seq=0 ttl=64 time=0.125 ms
64 bytes from 192.168.3.21: icmp_seq=1 ttl=64 time=0.110 ms
--- 192.168.3.21 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.110/0.117/0.125 ms
[root@bl460-12:~] ping 192.168.3.29
PING 192.168.3.29 (192.168.3.29): 56 data bytes
64 bytes from 192.168.3.29: icmp_seq=0 ttl=64 time=0.080 ms
64 bytes from 192.168.3.29: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.3.29: icmp_seq=2 ttl=64 time=0.099 ms
Here dmesg output of the ESX not getting the paths back:
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:13333 R: 192.168.3.21:3260]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:12914 R: 192.168.3.29:3260]
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:38055 R: 192.168.3.21:3260]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:19402 R: 192.168.3.29:3260]
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:45832 R: 192.168.3.21:3260]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:40743 R: 192.168.3.29:3260]
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
[root@bl460-12:~] ping 192.168.3.21
PING 192.168.3.21 (192.168.3.21): 56 data bytes
64 bytes from 192.168.3.21: icmp_seq=0 ttl=64 time=0.125 ms
64 bytes from 192.168.3.21: icmp_seq=1 ttl=64 time=0.110 ms
--- 192.168.3.21 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.110/0.117/0.125 ms
[root@bl460-12:~] ping 192.168.3.29
PING 192.168.3.29 (192.168.3.29): 56 data bytes
64 bytes from 192.168.3.29: icmp_seq=0 ttl=64 time=0.080 ms
64 bytes from 192.168.3.29: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.3.29: icmp_seq=2 ttl=64 time=0.099 ms
admin
2,930 Posts
August 7, 2017, 3:39 pmQuote from admin on August 7, 2017, 3:39 pmCurrently in PetaSAN when a node fails or is shutdown, its paths will be distributed to the other nodes. If it comes back it will not take back the original paths it was serving, it will wait for paths from new disk created or to take over paths from other nodes that fail or from paths whose disks were stopped then restarted.
In the future we will be supporting a way for moving paths from a running node and balancing them to other nodes, either manually via the admin or dynamic based on load. Currently the only way to do invoke path changes is either shutting/failing the node or by stopping and starting the disks.
The new 1.3.1 arp code is used when a node takes over paths from a another node that was shutdown or the disk was stopped and restarted.
In your case i would recommend upgrading the second node with 2 paths, and observe the 2 paths will be distributed to node 1 and 3. Then you can do the same with first node.
Currently in PetaSAN when a node fails or is shutdown, its paths will be distributed to the other nodes. If it comes back it will not take back the original paths it was serving, it will wait for paths from new disk created or to take over paths from other nodes that fail or from paths whose disks were stopped then restarted.
In the future we will be supporting a way for moving paths from a running node and balancing them to other nodes, either manually via the admin or dynamic based on load. Currently the only way to do invoke path changes is either shutting/failing the node or by stopping and starting the disks.
The new 1.3.1 arp code is used when a node takes over paths from a another node that was shutdown or the disk was stopped and restarted.
In your case i would recommend upgrading the second node with 2 paths, and observe the 2 paths will be distributed to node 1 and 3. Then you can do the same with first node.
Last edited on August 7, 2017, 3:43 pm · #28
admin
2,930 Posts
August 8, 2017, 11:29 amQuote from admin on August 8, 2017, 11:29 amDid you go ahead with upgrading the other 2 machines ? please let me now if you had issues.
If you need to move active paths on running machines, we can provide you a command line tool if you need this until we have it in ui.
Did you go ahead with upgrading the other 2 machines ? please let me now if you had issues.
If you need to move active paths on running machines, we can provide you a command line tool if you need this until we have it in ui.
Last edited on August 8, 2017, 11:30 am · #29
therm
121 Posts
August 8, 2017, 12:30 pmQuote from therm on August 8, 2017, 12:30 pmMaybe because of the distribution (nearly all ISCSI-Paths on one host) the node1 shut the other nodes down this morning. All servers are now on 1.3.1, but we have to prevent constant crashes.
A commandline tool to move paths would be awesome for that!
Maybe because of the distribution (nearly all ISCSI-Paths on one host) the node1 shut the other nodes down this morning. All servers are now on 1.3.1, but we have to prevent constant crashes.
A commandline tool to move paths would be awesome for that!
ESX-Server ISCSI problems?
therm
121 Posts
Quote from therm on August 5, 2017, 12:31 pmroot@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
13
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0Yes! Thank you!
Now the ESX-Server which crashed have 11 paths and the surviving ESX-Server have 20 paths but all available.
How to bring back the missing?
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:/tmp# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
13
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
Yes! Thank you!
Now the ESX-Server which crashed have 11 paths and the surviving ESX-Server have 20 paths but all available.
How to bring back the missing?
admin
2,930 Posts
Quote from admin on August 5, 2017, 1:17 pmIf this is related to the arp issue (fixed in 1.3.1 ) can you try to print the arp table on each esx (sorry i do not have the command offhand but can get it ) and see if the failed paths do show the correct mac addresses. if not then the arp fix outlined above should fix it. If they do show correct address, can you ping from esx to those ips and vice versa from the PetaSAN nodes to that esx. If you can upgrade to 1.3.1 it will be better. You can also try rescan the paths from the esx.
I would also recommend to get a feel on PetaSAN performance load during your night operations ( % busy cpu/disks ..etc ) but one quick thing to try is disable scrubbing as it does lead to increase latency and we do schedule it to run at night:
ceph osd set nodeep-scrub
ceph osd set noscrub
There are other various performance factors we can try after getting some of the load metrics. v 1.4 will make this process easier.
If this is related to the arp issue (fixed in 1.3.1 ) can you try to print the arp table on each esx (sorry i do not have the command offhand but can get it ) and see if the failed paths do show the correct mac addresses. if not then the arp fix outlined above should fix it. If they do show correct address, can you ping from esx to those ips and vice versa from the PetaSAN nodes to that esx. If you can upgrade to 1.3.1 it will be better. You can also try rescan the paths from the esx.
I would also recommend to get a feel on PetaSAN performance load during your night operations ( % busy cpu/disks ..etc ) but one quick thing to try is disable scrubbing as it does lead to increase latency and we do schedule it to run at night:
ceph osd set nodeep-scrub
ceph osd set noscrub
There are other various performance factors we can try after getting some of the load metrics. v 1.4 will make this process easier.
therm
121 Posts
Quote from therm on August 6, 2017, 5:30 amThis morning all servers survived. All paths are available on all servers. It really seems load related. (I did disable scrub)
Will update to 1.3.1 on monday. Any idea when 1.4 will be released? In addition and as I mentioned in another thread we will expand the cluster.
Thank you very much for your support!
Dennis
This morning all servers survived. All paths are available on all servers. It really seems load related. (I did disable scrub)
Will update to 1.3.1 on monday. Any idea when 1.4 will be released? In addition and as I mentioned in another thread we will expand the cluster.
Thank you very much for your support!
Dennis
therm
121 Posts
Quote from therm on August 6, 2017, 11:47 amIn addition I`ve found one LUN on one ESX that had giant latency in esxtop(u). Changing the path selection from round robin to last used fixed it for the moment.
In addition I`ve found one LUN on one ESX that had giant latency in esxtop(u). Changing the path selection from round robin to last used fixed it for the moment.
admin
2,930 Posts
Quote from admin on August 7, 2017, 8:18 amYes upgrading to 1.3.1 is recommended as it has been tested a lot with ESX failover.
version 1.4 is due Aug 17
The high latency on 1 path on 1 esx for 1 lun is intriguing, if the storage node is loaded i would slow down other paths. but happy things are working better now.
Yes upgrading to 1.3.1 is recommended as it has been tested a lot with ESX failover.
version 1.4 is due Aug 17
The high latency on 1 path on 1 esx for 1 lun is intriguing, if the storage node is loaded i would slow down other paths. but happy things are working better now.
therm
121 Posts
Quote from therm on August 7, 2017, 2:33 pmAfter replacing an sfp-transciever and updating one of three nodes petasan has the following ip distribution:
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
18
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
2
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0ceph-node-mru-3 is PetaSAN 1.3.1 while the others are 1.3.0. Sending arping command as you suggested before seems not to help. I did a rescan on one ESX and it is seeing 18 Paths. This situation means to me that if paths will not came back an upgrade of ceph-node-mru-1 is impossible. How to manually switch the iscsi paths?
After replacing an sfp-transciever and updating one of three nodes petasan has the following ip distribution:
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
18
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
2
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
0
ceph-node-mru-3 is PetaSAN 1.3.1 while the others are 1.3.0. Sending arping command as you suggested before seems not to help. I did a rescan on one ESX and it is seeing 18 Paths. This situation means to me that if paths will not came back an upgrade of ceph-node-mru-1 is impossible. How to manually switch the iscsi paths?
therm
121 Posts
Quote from therm on August 7, 2017, 2:39 pmHere dmesg output of the ESX not getting the paths back:
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:13333 R: 192.168.3.21:3260]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:12914 R: 192.168.3.29:3260]
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:38055 R: 192.168.3.21:3260]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:19402 R: 192.168.3.29:3260]
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:45832 R: 192.168.3.21:3260]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:40743 R: 192.168.3.29:3260]
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
[root@bl460-12:~] ping 192.168.3.21
PING 192.168.3.21 (192.168.3.21): 56 data bytes
64 bytes from 192.168.3.21: icmp_seq=0 ttl=64 time=0.125 ms
64 bytes from 192.168.3.21: icmp_seq=1 ttl=64 time=0.110 ms--- 192.168.3.21 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.110/0.117/0.125 ms
[root@bl460-12:~] ping 192.168.3.29
PING 192.168.3.29 (192.168.3.29): 56 data bytes
64 bytes from 192.168.3.29: icmp_seq=0 ttl=64 time=0.080 ms
64 bytes from 192.168.3.29: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.3.29: icmp_seq=2 ttl=64 time=0.099 ms
Here dmesg output of the ESX not getting the paths back:
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:28.273Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:13333 R: 192.168.3.21:3260]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:29.030Z cpu24:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:12914 R: 192.168.3.29:3260]
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:30.793Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:31.550Z cpu24:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:36.579Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:38055 R: 192.168.3.21:3260]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:37.335Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:19402 R: 192.168.3.29:3260]
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.093Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:39.848Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00001 TPGT: 3 TSIH: 0]
2017-08-07T14:34:44.869Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:45832 R: 192.168.3.21:3260]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:4 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000005 TARGET: iqn.2016-05.com.petasan:00005 TPGT: 3 TSIH: 0]
2017-08-07T14:34:45.623Z cpu12:33617)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 192.168.3.12:40743 R: 192.168.3.29:3260]
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:47.387Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f708b0e0 network tracker id 16768 tracker.iSCSI.192.168.3.21 associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network resource pool netsched.pools.persist.iscsi associated
2017-08-07T14:34:48.143Z cpu12:33617)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4307f72fdfc0 network tracker id 16768 tracker.iSCSI.192.168.3.29 associated
[root@bl460-12:~] ping 192.168.3.21
PING 192.168.3.21 (192.168.3.21): 56 data bytes
64 bytes from 192.168.3.21: icmp_seq=0 ttl=64 time=0.125 ms
64 bytes from 192.168.3.21: icmp_seq=1 ttl=64 time=0.110 ms--- 192.168.3.21 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.110/0.117/0.125 ms
[root@bl460-12:~] ping 192.168.3.29
PING 192.168.3.29 (192.168.3.29): 56 data bytes
64 bytes from 192.168.3.29: icmp_seq=0 ttl=64 time=0.080 ms
64 bytes from 192.168.3.29: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.3.29: icmp_seq=2 ttl=64 time=0.099 ms
admin
2,930 Posts
Quote from admin on August 7, 2017, 3:39 pmCurrently in PetaSAN when a node fails or is shutdown, its paths will be distributed to the other nodes. If it comes back it will not take back the original paths it was serving, it will wait for paths from new disk created or to take over paths from other nodes that fail or from paths whose disks were stopped then restarted.
In the future we will be supporting a way for moving paths from a running node and balancing them to other nodes, either manually via the admin or dynamic based on load. Currently the only way to do invoke path changes is either shutting/failing the node or by stopping and starting the disks.
The new 1.3.1 arp code is used when a node takes over paths from a another node that was shutdown or the disk was stopped and restarted.
In your case i would recommend upgrading the second node with 2 paths, and observe the 2 paths will be distributed to node 1 and 3. Then you can do the same with first node.
Currently in PetaSAN when a node fails or is shutdown, its paths will be distributed to the other nodes. If it comes back it will not take back the original paths it was serving, it will wait for paths from new disk created or to take over paths from other nodes that fail or from paths whose disks were stopped then restarted.
In the future we will be supporting a way for moving paths from a running node and balancing them to other nodes, either manually via the admin or dynamic based on load. Currently the only way to do invoke path changes is either shutting/failing the node or by stopping and starting the disks.
The new 1.3.1 arp code is used when a node takes over paths from a another node that was shutdown or the disk was stopped and restarted.
In your case i would recommend upgrading the second node with 2 paths, and observe the 2 paths will be distributed to node 1 and 3. Then you can do the same with first node.
admin
2,930 Posts
Quote from admin on August 8, 2017, 11:29 amDid you go ahead with upgrading the other 2 machines ? please let me now if you had issues.
If you need to move active paths on running machines, we can provide you a command line tool if you need this until we have it in ui.
Did you go ahead with upgrading the other 2 machines ? please let me now if you had issues.
If you need to move active paths on running machines, we can provide you a command line tool if you need this until we have it in ui.
therm
121 Posts
Quote from therm on August 8, 2017, 12:30 pmMaybe because of the distribution (nearly all ISCSI-Paths on one host) the node1 shut the other nodes down this morning. All servers are now on 1.3.1, but we have to prevent constant crashes.
A commandline tool to move paths would be awesome for that!
Maybe because of the distribution (nearly all ISCSI-Paths on one host) the node1 shut the other nodes down this morning. All servers are now on 1.3.1, but we have to prevent constant crashes.
A commandline tool to move paths would be awesome for that!