ESX-Server ISCSI problems?
admin
2,930 Posts
August 8, 2017, 1:29 pmQuote from admin on August 8, 2017, 1:29 pmThe nodes shutdown is most probably due to fencing. A node kills another node if the later lost connection to the cluster and still has iSCSI disk resources not yet distributed. The best thing is to wait a few minutes before restarting a node that was down or was being upgraded, so to make sure all its paths were distributed.
This should fix the issue. If not let me know and i can help you disable fencing action.
The script to move paths from active nodes is being tested now and i will post it here once done.
The nodes shutdown is most probably due to fencing. A node kills another node if the later lost connection to the cluster and still has iSCSI disk resources not yet distributed. The best thing is to wait a few minutes before restarting a node that was down or was being upgraded, so to make sure all its paths were distributed.
This should fix the issue. If not let me know and i can help you disable fencing action.
The script to move paths from active nodes is being tested now and i will post it here once done.
Last edited on August 8, 2017, 1:30 pm · #31
admin
2,930 Posts
August 8, 2017, 3:38 pmQuote from admin on August 8, 2017, 3:38 pmWe made the following move_path.py script to move an active path from a node
https://drive.google.com/file/d/0B7VNYCjYBY2yOXRXUEpGajVlalU/view?usp=sharing
better place it in /opt/petasan/scripts
chmod +x move_path.py
run syntax:
./move_path.py -id DISK_ID -ip IP_ADDRESS
example
./move_path.py -id 00001 -ip 10.0.2.100
It needs to run from the node currently serving the path
You need to specify the full id string of the disk like 00001 and not 1
It will trigger a path move, it some cases the path may end up on same node, in this case retry. In the future we intend to support this in ui, where you can specify target node and allow dynamic moving based on load. So this is very crude.
We made the following move_path.py script to move an active path from a node
https://drive.google.com/file/d/0B7VNYCjYBY2yOXRXUEpGajVlalU/view?usp=sharing
better place it in /opt/petasan/scripts
chmod +x move_path.py
run syntax:
./move_path.py -id DISK_ID -ip IP_ADDRESS
example
./move_path.py -id 00001 -ip 10.0.2.100
It needs to run from the node currently serving the path
You need to specify the full id string of the disk like 00001 and not 1
It will trigger a path move, it some cases the path may end up on same node, in this case retry. In the future we intend to support this in ui, where you can specify target node and allow dynamic moving based on load. So this is very crude.
Last edited on August 8, 2017, 3:57 pm · #32
therm
121 Posts
August 9, 2017, 6:45 amQuote from therm on August 9, 2017, 6:45 amWorks like charm!
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
6
This night everything was good (paths were on node1 and node2). Hopefully this will stabilize the cluster. I will report things going on here.
Thanks again!
Works like charm!
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
6
This night everything was good (paths were on node1 and node2). Hopefully this will stabilize the cluster. I will report things going on here.
Thanks again!
Last edited on August 9, 2017, 6:47 am · #33
therm
121 Posts
August 11, 2017, 5:09 amQuote from therm on August 11, 2017, 5:09 amSystem seems to be stable now. Besides distributing paths to all nodes I now do scrubs in the daytime, but throttled:
# reduce background load due to scrub
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_sleep = 2
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd_deep_scrub_stride = 1048576
osd_scrub_load_threshold = 5
osd_scrub_begin_hour = 6
osd_scrub_end_hour = 22
This leads to about 30 IOPS background-reads which do not disturb.
Another problem was that after reboot interfaces eth4 and eth2 were switching names. While waiting for the next release, I fixed it for the meanwhile with:
vi /etc/udev/rules.d/70-persistent-net.rules
...
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="ac:16:2d:ac:2b:88", NAME="eth4"
..
Switching paths is now totally easy and painless!
System seems to be stable now. Besides distributing paths to all nodes I now do scrubs in the daytime, but throttled:
# reduce background load due to scrub
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_sleep = 2
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd_deep_scrub_stride = 1048576
osd_scrub_load_threshold = 5
osd_scrub_begin_hour = 6
osd_scrub_end_hour = 22
This leads to about 30 IOPS background-reads which do not disturb.
Another problem was that after reboot interfaces eth4 and eth2 were switching names. While waiting for the next release, I fixed it for the meanwhile with:
vi /etc/udev/rules.d/70-persistent-net.rules
...
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="ac:16:2d:ac:2b:88", NAME="eth4"
..
Switching paths is now totally easy and painless!
Last edited on August 11, 2017, 5:17 am · #34
admin
2,930 Posts
August 11, 2017, 9:57 amQuote from admin on August 11, 2017, 9:57 amHappy things are working well now 🙂 and thanks very much for sharing your scrub params.
Regarding the nic name change, this is strange, the only thing i can think of is that in v1.3.1 we included newer updated firmware for some kernel drivers. As you noted in v1.4 we will include a menu to name/rename nic cards, it will handle this case but it was really designed to support cases such as users changing hardware and also when running PetaSAN hyper-converged under ESX, if you add a new nic VMWare can change the order of existing ones.
Happy things are working well now 🙂 and thanks very much for sharing your scrub params.
Regarding the nic name change, this is strange, the only thing i can think of is that in v1.3.1 we included newer updated firmware for some kernel drivers. As you noted in v1.4 we will include a menu to name/rename nic cards, it will handle this case but it was really designed to support cases such as users changing hardware and also when running PetaSAN hyper-converged under ESX, if you add a new nic VMWare can change the order of existing ones.
ESX-Server ISCSI problems?
admin
2,930 Posts
Quote from admin on August 8, 2017, 1:29 pmThe nodes shutdown is most probably due to fencing. A node kills another node if the later lost connection to the cluster and still has iSCSI disk resources not yet distributed. The best thing is to wait a few minutes before restarting a node that was down or was being upgraded, so to make sure all its paths were distributed.
This should fix the issue. If not let me know and i can help you disable fencing action.
The script to move paths from active nodes is being tested now and i will post it here once done.
The nodes shutdown is most probably due to fencing. A node kills another node if the later lost connection to the cluster and still has iSCSI disk resources not yet distributed. The best thing is to wait a few minutes before restarting a node that was down or was being upgraded, so to make sure all its paths were distributed.
This should fix the issue. If not let me know and i can help you disable fencing action.
The script to move paths from active nodes is being tested now and i will post it here once done.
admin
2,930 Posts
Quote from admin on August 8, 2017, 3:38 pmWe made the following move_path.py script to move an active path from a node
https://drive.google.com/file/d/0B7VNYCjYBY2yOXRXUEpGajVlalU/view?usp=sharing
better place it in /opt/petasan/scripts
chmod +x move_path.py
run syntax:
./move_path.py -id DISK_ID -ip IP_ADDRESS
example
./move_path.py -id 00001 -ip 10.0.2.100
It needs to run from the node currently serving the path
You need to specify the full id string of the disk like 00001 and not 1
It will trigger a path move, it some cases the path may end up on same node, in this case retry. In the future we intend to support this in ui, where you can specify target node and allow dynamic moving based on load. So this is very crude.
We made the following move_path.py script to move an active path from a node
https://drive.google.com/file/d/0B7VNYCjYBY2yOXRXUEpGajVlalU/view?usp=sharing
better place it in /opt/petasan/scripts
chmod +x move_path.py
run syntax:
./move_path.py -id DISK_ID -ip IP_ADDRESS
example
./move_path.py -id 00001 -ip 10.0.2.100
It needs to run from the node currently serving the path
You need to specify the full id string of the disk like 00001 and not 1
It will trigger a path move, it some cases the path may end up on same node, in this case retry. In the future we intend to support this in ui, where you can specify target node and allow dynamic moving based on load. So this is very crude.
therm
121 Posts
Quote from therm on August 9, 2017, 6:45 amWorks like charm!
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
6This night everything was good (paths were on node1 and node2). Hopefully this will stabilize the cluster. I will report things going on here.
Thanks again!
Works like charm!
root@ceph-node-mru-1:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-2:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
7
root@ceph-node-mru-3:~# ip a |grep 192.168 |grep -v eth6 |grep -v eth7|wc -l
6
This night everything was good (paths were on node1 and node2). Hopefully this will stabilize the cluster. I will report things going on here.
Thanks again!
therm
121 Posts
Quote from therm on August 11, 2017, 5:09 amSystem seems to be stable now. Besides distributing paths to all nodes I now do scrubs in the daytime, but throttled:
# reduce background load due to scrub
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_sleep = 2
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd_deep_scrub_stride = 1048576
osd_scrub_load_threshold = 5
osd_scrub_begin_hour = 6
osd_scrub_end_hour = 22This leads to about 30 IOPS background-reads which do not disturb.
Another problem was that after reboot interfaces eth4 and eth2 were switching names. While waiting for the next release, I fixed it for the meanwhile with:
vi /etc/udev/rules.d/70-persistent-net.rules
...
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="ac:16:2d:ac:2b:88", NAME="eth4"
..
Switching paths is now totally easy and painless!
System seems to be stable now. Besides distributing paths to all nodes I now do scrubs in the daytime, but throttled:
# reduce background load due to scrub
osd_max_scrubs = 1
osd_scrub_during_recovery = false
osd_scrub_priority = 1
osd_scrub_sleep = 2
osd_scrub_chunk_min = 1
osd_scrub_chunk_max = 5
osd_deep_scrub_stride = 1048576
osd_scrub_load_threshold = 5
osd_scrub_begin_hour = 6
osd_scrub_end_hour = 22
This leads to about 30 IOPS background-reads which do not disturb.
Another problem was that after reboot interfaces eth4 and eth2 were switching names. While waiting for the next release, I fixed it for the meanwhile with:
vi /etc/udev/rules.d/70-persistent-net.rules
...
SUBSYSTEM=="net", ACTION=="add", ATTR{address}=="ac:16:2d:ac:2b:88", NAME="eth4"
..
Switching paths is now totally easy and painless!
admin
2,930 Posts
Quote from admin on August 11, 2017, 9:57 amHappy things are working well now 🙂 and thanks very much for sharing your scrub params.
Regarding the nic name change, this is strange, the only thing i can think of is that in v1.3.1 we included newer updated firmware for some kernel drivers. As you noted in v1.4 we will include a menu to name/rename nic cards, it will handle this case but it was really designed to support cases such as users changing hardware and also when running PetaSAN hyper-converged under ESX, if you add a new nic VMWare can change the order of existing ones.
Happy things are working well now 🙂 and thanks very much for sharing your scrub params.
Regarding the nic name change, this is strange, the only thing i can think of is that in v1.3.1 we included newer updated firmware for some kernel drivers. As you noted in v1.4 we will include a menu to name/rename nic cards, it will handle this case but it was really designed to support cases such as users changing hardware and also when running PetaSAN hyper-converged under ESX, if you add a new nic VMWare can change the order of existing ones.