Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

iSCSI drives stops when a PetaSAN node is put off-line and then is back up online again

Hi,

We are having the following problem with our cluster. The iSCSI drives work without problems even when a node is taken offline (we shut it down or manually restart it) but when the node is powered on and it is online again all the iSCSI disks stops. We have tested disabling the Fencing option in the Maintenance Section but the problem still persists.
Below is the output of the PetaSAN logs  from the three nodes of the cluster.

NODE1

28/07/2020 08:58:44 INFO PetaSAN cleaned iqns.

28/07/2020 08:58:44 INFO Image image-00004 unmapped successfully.

28/07/2020 08:58:44 INFO LIO deleted Target iqn.2016-05.com.petasan:00004

28/07/2020 08:58:44 INFO LIO deleted backstore image image-00004

28/07/2020 08:58:43 INFO Image image-00005 unmapped successfully.

28/07/2020 08:58:43 INFO LIO deleted Target iqn.2016-05.com.petasan:00005

28/07/2020 08:58:43 INFO LIO deleted backstore image image-00005

28/07/2020 08:58:43 INFO PetaSAN cleaned local paths not locked by this node in consul.

28/07/2020 08:58:43 INFO Cleaned disk path 00004/1.

28/07/2020 08:58:43 INFO Cleaned disk path 00005/2.

28/07/2020 08:58:43 INFO Cleaned disk path 00005/3.

28/07/2020 08:49:19 INFO Path 00005/2 acquired successfully

28/07/2020 08:49:14 INFO The path 00005/2 was locked by ceph-node2.

28/07/2020 08:49:14 INFO Found pool:rbd for disk:00005 via consul

28/07/2020 06:25:13 INFO GlusterFS mount attempt

 

NODE2

28/07/2020 08:59:26 INFO CIFS check_health ctdb not active, restarting.

28/07/2020 08:59:26 INFO CIFSService key change action

28/07/2020 08:58:51 WARNING CIFS init degraded Gluster FS : ceph-node2 down

28/07/2020 08:58:51 WARNING CIFS init degraded Gluster FS : ceph-node3 down

28/07/2020 08:58:51 WARNING CIFS init degraded Gluster FS : ceph-node1 down

28/07/2020 08:58:45 INFO LeaderElectionBase successfully dropped old sessions

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.eth0_received 20.48 `date +%s`" | nc -q0 192.168.0.212 2003

PetaSAN.NodeStats.ceph-node2.ifaces.percent_util.eth0 0.0 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond1_transmitted 0.0 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond1_received 40.96 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.percent_util.bond1 0.0 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond0_transmitted 186777.6 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond0_received 101611.52 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.percent_util.bond0 0.01 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.memory.percent_util 2.18 `date +%s`" "

Exception: Error running echo command :echo "PetaSAN.NodeStats.ceph-node2.cpu_all.percent_util 6.39 `date +%s`" "

raise Exception("Error running echo command :" + cmd)

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send

graphite_sender.send(leader_ip)

File "/opt/petasan/scripts/node_stats.py", line 64, in get_stats

get_stats()

File "/opt/petasan/scripts/node_stats.py", line 159, in <module>

Traceback (most recent call last):

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.eth0_received 20.48 `date +%s`" | nc -q0 192.168.0.212 2003

PetaSAN.NodeStats.ceph-node2.ifaces.percent_util.eth0 0.0 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond1_transmitted 0.0 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond1_received 40.96 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.percent_util.bond1 0.0 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond0_transmitted 186777.6 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.throughput.bond0_received 101611.52 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.ifaces.percent_util.bond0 0.01 `date +%s`" "

PetaSAN.NodeStats.ceph-node2.memory.percent_util 2.18 `date +%s`" "

28/07/2020 08:58:43 ERROR Error running echo command :echo "PetaSAN.NodeStats.ceph-node2.cpu_all.percent_util 6.39 `date +%s`" "

28/07/2020 08:58:43 ERROR Node Stats exception.

28/07/2020 08:58:40 INFO Service is starting.

28/07/2020 08:58:40 INFO Cluster is just starting, system will delete all active disk resources

28/07/2020 08:58:39 INFO sync_replication_node completed

28/07/2020 08:58:39 INFO syncing replication users ok

28/07/2020 08:58:39 INFO syncing cron ok

28/07/2020 08:58:37 INFO CIFSService init action

28/07/2020 08:58:37 INFO sync_replication_node starting

28/07/2020 08:58:37 INFO Starting Config Upload service

28/07/2020 08:58:37 INFO Starting CIFS Service

28/07/2020 08:58:37 INFO Starting petasan tuning service

28/07/2020 08:58:36 INFO Starting sync replication node service

28/07/2020 08:58:36 INFO Starting OSDs

28/07/2020 08:58:36 INFO stderr /dev/sdd: open failed: No medium found

28/07/2020 08:58:36 INFO stderr /dev/sdd: open failed: No medium found

28/07/2020 08:58:36 INFO stdout ceph-33eb9917-3356-4a8d-961b-560d08cb8c82";"1";"1";"0";"wz--n-";"1788.00g";"0g";"0

28/07/2020 08:58:36 INFO stdout ceph-25661c57-e624-40b6-ba30-d23a5e6fc4d2";"1";"1";"0";"wz--n-";"1788.00g";"0g";"0

28/07/2020 08:58:36 INFO Running command: /sbin/vgs --noheadings --readonly --units=g --separator=";" -o vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free,vg_free_count

28/07/2020 08:58:36 INFO Starting activating PetaSAN lvs

28/07/2020 08:58:36 INFO stderr /dev/sdd: open failed: No medium found

28/07/2020 08:58:36 INFO stderr /dev/sdd: open failed: No medium found

28/07/2020 08:58:36 INFO stdout ceph-33eb9917-3356-4a8d-961b-560d08cb8c82";"1";"1";"0";"wz--n-";"1788.00g";"0g";"0

28/07/2020 08:58:36 INFO stdout ceph-25661c57-e624-40b6-ba30-d23a5e6fc4d2";"1";"1";"0";"wz--n-";"1788.00g";"0g";"0

28/07/2020 08:58:35 INFO LeaderElectionBase dropping old sessions

28/07/2020 08:58:35 INFO Running command: /sbin/vgs --noheadings --readonly --units=g --separator=";" -o vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free,vg_free_count

28/07/2020 08:58:35 INFO Starting activating PetaSAN lvs

28/07/2020 08:58:35 INFO Starting Node Stats Service

28/07/2020 08:58:35 INFO Starting Cluster Management application

28/07/2020 08:58:35 INFO Starting iSCSI Service

28/07/2020 08:58:35 INFO Starting cluster file sync service

28/07/2020 08:58:33 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 10.11.12.2 -retry-join 10.11.12.1 -retry-join 10.11.12.3

28/07/2020 08:58:22 INFO GlusterFS mount attempt

28/07/2020 08:58:16 INFO Start settings IPs

28/07/2020 06:25:18 INFO GlusterFS mount attempt

 

NODE3

28/07/2020 08:58:44 INFO PetaSAN cleaned iqns.

28/07/2020 08:58:44 INFO Image image-00004 unmapped successfully.

28/07/2020 08:58:44 INFO LIO deleted Target iqn.2016-05.com.petasan:00004

28/07/2020 08:58:44 INFO LIO deleted backstore image image-00004

28/07/2020 08:58:43 INFO Image image-00005 unmapped successfully.

28/07/2020 08:58:43 INFO LIO deleted Target iqn.2016-05.com.petasan:00005

28/07/2020 08:58:43 INFO LIO deleted backstore image image-00005

28/07/2020 08:58:43 INFO PetaSAN cleaned local paths not locked by this node in consul.

28/07/2020 08:58:43 INFO Cleaned disk path 00004/1.

28/07/2020 08:58:43 INFO Cleaned disk path 00005/2.

28/07/2020 08:58:43 INFO Cleaned disk path 00005/3.

28/07/2020 08:49:19 INFO Path 00005/2 acquired successfully

28/07/2020 08:49:14 INFO The path 00005/2 was locked by ceph-node2.

28/07/2020 08:49:14 INFO Found pool:rbd for disk:00005 via consul

28/07/2020 06:25:13 INFO GlusterFS mount attempt

Thanks in advance for your time!

Can you upgrade to 2.6, we have fixed some bugs relating to this.

Thanks for your answer.
I just upgraded to 2.6 and it works!! Great job!!
I only noticed that once the node that is offline comes back online it came up without no assigned paths, however if I force the auto path assignment on the Paths Assignment tool the paths are assigned fine .

Thanks again!

Yes PetaSAN does not have a concept of node ownership of resources, if the node fails, its resources are assigned to other nodes, if it comes back it does not take them back, unless as you stated you use the path assignment. in the future we have plans for resource assignments in a dynamic way between nodes based on load stats we gather.

Thanks for the clarification.

This topic can be considered closed.

Thank you.