Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Bug in move_paths

Pages: 1 2 3

Sometimes it happens that all ips of the corrospondig nic are suddenly down if used move_paths script:

root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7
inet 192.168.3.29/24 scope global eth4
inet 192.168.3.26/24 scope global secondary eth4
inet 192.168.3.22/24 scope global secondary eth4
inet 192.168.3.24/24 scope global secondary eth4
inet 192.168.4.24/24 scope global eth5
inet 192.168.4.27/24 scope global secondary eth5
inet 192.168.4.28/24 scope global secondary eth5
inet 192.168.4.21/24 scope global secondary eth5
inet 192.168.4.25/24 scope global secondary eth5
inet 192.168.4.20/24 scope global secondary eth5
root@ceph-node-mru-2:~# ./tools/move_path.py -id 00003 -ip 192.168.4.24
Done
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7
inet 192.168.3.29/24 scope global eth4
inet 192.168.3.26/24 scope global secondary eth4
inet 192.168.3.22/24 scope global secondary eth4
inet 192.168.3.24/24 scope global secondary eth4

Might that be because it is the main ip for that nic?

At the moment one LUN is inaccessable. In ESX one path seems to be down, but I cannot move IP (because it is the main ip) and reboot is not possible because recovery is in progress.

Do you have an idea why ESX is damn slow when only one path is down out of four pathes? And why does ESX not reconnect to the path? Do you need any further information?

Regards,

Dennis

The term main ip you mean in ESX this is/was the active i/o path whereas the rest of paths are active failover ?

In ESX , the one path that is down, is this the "main" ip ?

Why is there recovery happening, were any nodes/osds down ?

 

With main ip I mean for example:

inet 192.168.4.24/24 scope global eth5
inet 192.168.4.27/24 scope global secondary eth5
inet 192.168.4.28/24 scope global secondary eth5
inet 192.168.4.21/24 scope global secondary eth5
inet 192.168.4.25/24 scope global secondary eth5
inet 192.168.4.20/24 scope global secondary eth5

Because 192.168.4.24 is not a secondary ip. If I try to move those ips, all other secondary ips gone away but do not recover on other nodes.

Today morning there was another freeze off one ESX. It seems to me that this happens when petasan servers are overloaded. In this case the ESX is marked as not responding, has timeout messages on one petasan-lun, but does not reconnect. IPs are reachable. I tried to move paths, but got things like in my first post above. I restarted this server and added a disk. So it is a recovery/backfilling process. The path not reconnecting is not on this server, it is on server1.

No OSDs were down. Just messages like the following in demsg:

[Fri Sep 15 00:01:00 2017] Process accounting resumed
[Fri Sep 15 11:49:35 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76321186
[Fri Sep 15 11:49:35 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76321186
[Fri Sep 15 14:02:52 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76353302
[Fri Sep 15 14:02:52 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76353302
[Fri Sep 15 14:13:01 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76355154
[Fri Sep 15 14:13:01 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76355154
[Fri Sep 15 14:17:05 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76355839
[Fri Sep 15 14:17:05 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76355839
[Fri Sep 15 14:28:15 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76373676
[Fri Sep 15 14:28:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76373676
[Fri Sep 15 14:33:19 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76389604
[Fri Sep 15 14:33:19 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76389604
[Fri Sep 15 14:43:28 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76409064
[Fri Sep 15 14:43:28 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76409064
[Fri Sep 15 14:47:31 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76409989
[Fri Sep 15 14:47:31 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76409989
[Fri Sep 15 14:48:32 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76410200
[Fri Sep 15 14:48:32 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76410200
[Fri Sep 15 14:52:36 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76410941
[Fri Sep 15 14:52:36 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76410941
[Fri Sep 15 15:25:12 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76465507
[Fri Sep 15 15:25:12 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76465507
[Fri Sep 15 15:34:20 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76467224
[Fri Sep 15 15:34:20 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76467224
[Fri Sep 15 15:46:30 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76470739
[Fri Sep 15 15:46:30 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76470739
[Fri Sep 15 15:56:39 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76472483
[Fri Sep 15 15:56:39 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76472483
[Fri Sep 15 16:03:45 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76474447
[Fri Sep 15 16:03:45 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76474447
[Fri Sep 15 16:11:52 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76477603
[Fri Sep 15 16:11:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76477603
[Fri Sep 15 16:12:52 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76478040
[Fri Sep 15 16:12:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76478040
[Fri Sep 15 16:13:53 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76478425
[Fri Sep 15 16:13:53 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76478425
[Fri Sep 15 16:44:20 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76554367
[Fri Sep 15 16:44:20 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76554367
[Fri Sep 15 16:53:28 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76555949
[Fri Sep 15 16:53:28 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76555949
[Fri Sep 15 17:56:21 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76652952
[Fri Sep 15 17:56:21 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76652952
[Fri Sep 15 18:10:37 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76700300
[Fri Sep 15 18:10:37 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76700300
[Fri Sep 15 18:13:21 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76707468
[Fri Sep 15 18:13:23 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76707468
[Fri Sep 15 18:16:21 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76717224
[Fri Sep 15 18:16:22 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76717224
[Fri Sep 15 18:24:07 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76752051
[Fri Sep 15 18:24:08 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76752051
[Fri Sep 15 18:24:23 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76752710
[Fri Sep 15 18:24:23 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76752710
[Fri Sep 15 18:26:25 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76759904
[Fri Sep 15 18:26:25 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76759904
[Fri Sep 15 18:28:25 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76766387
[Fri Sep 15 18:28:26 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76766387
[Fri Sep 15 18:29:25 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76768649
[Fri Sep 15 18:29:25 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76768649
[Fri Sep 15 18:38:25 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76791937
[Fri Sep 15 18:38:26 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76791937
[Fri Sep 15 18:48:34 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76816265
[Fri Sep 15 18:48:34 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76816265
[Fri Sep 15 19:05:48 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76837793
[Fri Sep 15 20:58:47 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76941442
[Fri Sep 15 20:58:47 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 76941442
[Fri Sep 15 23:11:19 2017] ABORT_TASK: Found referenced iSCSI task_tag: 76981682
[Fri Sep 15 23:11:19 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 76981682
[Fri Sep 15 23:56:15 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77005026
[Fri Sep 15 23:56:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77005026
[Sat Sep 16 00:01:03 2017] Process accounting resumed
[Sat Sep 16 02:11:35 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77058120
[Sat Sep 16 02:11:35 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77058120
[Sat Sep 16 02:33:35 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77062716
[Sat Sep 16 02:33:35 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77062716
[Sat Sep 16 04:01:39 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77197934
[Sat Sep 16 04:01:39 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77197934
[Sat Sep 16 04:46:31 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77204741
[Sat Sep 16 04:46:31 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77204741
[Sat Sep 16 05:06:08 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77207592
[Sat Sep 16 05:06:08 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77207592
[Sat Sep 16 05:06:08 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77207595
[Sat Sep 16 05:06:08 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77207595
[Sat Sep 16 05:06:35 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77210428
[Sat Sep 16 05:06:35 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77210430
[Sat Sep 16 05:06:35 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77210430
[Sat Sep 16 05:06:35 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77210431
[Sat Sep 16 05:06:35 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77210431
[Sat Sep 16 06:36:30 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77412284
[Sat Sep 16 06:36:30 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77412284
[Sat Sep 16 10:03:33 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77703827
[Sat Sep 16 10:03:33 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77703827
[Sat Sep 16 10:36:39 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77736601
[Sat Sep 16 10:36:39 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 77736601
[Sat Sep 16 11:06:33 2017] ABORT_TASK: Found referenced iSCSI task_tag: 77741337
[Sat Sep 16 11:06:33 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 77741337
[Sat Sep 16 20:15:57 2017] COMPARE_AND_WRITE: miscompare at offset 0
[Sat Sep 16 21:53:10 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78202365
[Sat Sep 16 21:53:10 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78202365
[Sat Sep 16 21:56:25 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78205358
[Sat Sep 16 21:56:25 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78205358
[Sun Sep 17 00:01:05 2017] traps: atop[50325] trap divide error ip:4073c2 sp:7ffeb48906a0 error:0 in atop[400000+26000]traps:
[Sun Sep 17 01:16:38 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78301465
[Sun Sep 17 01:16:38 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78301465
[Sun Sep 17 01:16:38 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78301468
[Sun Sep 17 01:16:38 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78301468
[Sun Sep 17 03:40:50 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78353280
[Sun Sep 17 03:40:50 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78353280
[Sun Sep 17 03:47:08 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78353864
[Sun Sep 17 03:47:08 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78353864
[Sun Sep 17 04:45:32 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78375794
[Sun Sep 17 04:45:32 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78375794
[Sun Sep 17 05:23:35 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78437742
[Sun Sep 17 05:23:35 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78437742
[Sun Sep 17 07:25:26 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78602946
[Sun Sep 17 07:25:26 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78602946
[Sun Sep 17 08:04:27 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78646736
[Sun Sep 17 08:04:27 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 78646737
[Sun Sep 17 09:14:15 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78707964
[Sun Sep 17 09:14:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78707964
[Sun Sep 17 09:16:26 2017] COMPARE_AND_WRITE: miscompare at offset 0
[Sun Sep 17 11:20:00 2017] ABORT_TASK: Found referenced iSCSI task_tag: 78826906
[Sun Sep 17 11:20:00 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 78826906
[Sun Sep 17 13:21:48 2017] ABORT_TASK: Found referenced iSCSI task_tag: 79061065
[Sun Sep 17 13:21:48 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 79061065
[Sun Sep 17 13:46:36 2017] ABORT_TASK: Found referenced iSCSI task_tag: 79065611
[Sun Sep 17 13:46:36 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 79065611
[Mon Sep 18 00:01:07 2017] Process accounting resumed
[Mon Sep 18 04:30:42 2017] COMPARE_AND_WRITE: miscompare at offset 0
[Mon Sep 18 04:46:33 2017] COMPARE_AND_WRITE: miscompare at offset 0

ok i got you, the main ip is the first ip assigned on the nic ! we are looking into this

 

we are able to reproduce it, i will send you a fix shortly

 

To prevent primary deleting other ips:

echo 1 > /proc/sys/net/ipv4/conf/all/promote_secondaries

add to /etc/sysctl.conf
net.ipv4.conf.all.promote_secondaries=1

 

Now the PetaSAN node running iSCSI/LIO service has the ip configured in LIO but not on its nics,

To find the active paths in LIO:

targetcli  ls | grep 192.168

For paths listed as "enabled", need to make sure the ips are configured on the nic, if not add the ip to the nic

ip address add1 92.168.4.28/24 dev ethX

make sure choose the correct ethX nic based on subnet 1 or 2

 

After setting this (echo..) and using move_path the ip is on both servers!

root@ceph-node-mru-1:~# ip a |grep 3.21
inet 192.168.3.21/24 scope global eth4
root@ceph-node-mru-1:~# ip addr del 192.168.3.21/24 dev eth4
root@ceph-node-mru-1:~# ip a |grep 3.21
root@ceph-node-mru-1:~# targetcli ls | grep 192.168.3.21
| | | o- 192.168.3.21:3260 ................................................................................. [OK, iser disabled]

 

root@ceph-node-mru-2:~# ip a |grep 3.21
inet 192.168.3.21/24 scope global secondary eth4
root@ceph-node-mru-2:~# targetcli ls | grep 192.168.3.21
| | | o- 192.168.3.21:3260 ................................................................................. [OK, iser disabled]

 

This is ok, you need to check for "enabled" as per my prev post.

fyi they are listed in LIO but disabled for several reason like to support iSCSI discovery (so when you discover using one path it knows about the other paths ) as well as make it faster to activate it to enable when switching.

Only 1 server should have this "enabled" in LIO, the rest should be disabled.

Ok, that refers to LIO,yes? But ips on interfaces should not be on both servers, shouldn't they?

 

Pages: 1 2 3