Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

iSCSI drives stopping for no reason randomly!

I dont know why but my iSCSI drives stop randomly. I have to go back in and start them manually.

I do not know too much about programming but maybe this helps:
16/01/2020 10:27:02 INFO Path 00014/1 acquired successfully
16/01/2020 10:27:02 INFO LIO add_target() disk wwn is 83f8f4a000014
16/01/2020 10:26:59 INFO Image image-00014 mapped successfully.
16/01/2020 10:26:58 INFO Found pool:NASSERVER for disk:00014 via consul
ImageNotFound: [errno 2] error opening image image-00012 at snapshot None
File "rbd.pyx", line 2755, in rbd.Image.__init__
size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size
Traceback (most recent call last):
16/01/2020 10:26:57 ERROR error opening image image-00012 at snapshot None
16/01/2020 10:26:57 ERROR Get ceph image size image-00012 error
16/01/2020 10:26:57 ERROR Cannot get image meta object from rbd header.
16/01/2020 10:26:55 INFO Could not lock path 00014/3 with session d150bd10-a481-e2e9-709d-7788d0f3ddc6.
16/01/2020 10:26:55 INFO Found pool:NASSERVER for disk:00014 via consul
ImageNotFound: [errno 2] error opening image image-00012 at snapshot None
File "rbd.pyx", line 2755, in rbd.Image.__init__
size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size
Traceback (most recent call last):
16/01/2020 10:26:54 ERROR error opening image image-00012 at snapshot None
16/01/2020 10:26:54 ERROR Get ceph image size image-00012 error
16/01/2020 10:26:54 ERROR Cannot get image meta object from rbd header.
16/01/2020 10:26:52 INFO Could not lock path 00014/2 with session d150bd10-a481-e2e9-709d-7788d0f3ddc6.
16/01/2020 10:26:51 INFO Found pool:NASSERVER for disk:00014 via consul
ImageNotFound: [errno 2] error opening image image-00012 at snapshot None
File "rbd.pyx", line 2755, in rbd.Image.__init__
size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size
Traceback (most recent call last):
16/01/2020 10:26:50 ERROR error opening image image-00012 at snapshot None
16/01/2020 10:26:50 ERROR Get ceph image size image-00012 error
16/01/2020 10:26:50 ERROR Cannot get image meta object from rbd header.
16/01/2020 10:26:48 INFO Path 00013/1 acquired successfully
16/01/2020 10:26:48 INFO LIO add_target() disk wwn is 00013
16/01/2020 10:26:45 INFO Image image-00013 mapped successfully.
16/01/2020 10:26:45 INFO Found pool:NASSERVER for disk:00013 via consul
ImageNotFound: [errno 2] error opening image image-00012 at snapshot None
File "rbd.pyx", line 2755, in rbd.Image.__init__
size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size
Traceback (most recent call last):
16/01/2020 10:26:45 ERROR error opening image image-00012 at snapshot None
16/01/2020 10:26:45 ERROR Get ceph image size image-00012 error
16/01/2020 10:26:45 ERROR Cannot get image meta object from rbd header.
16/01/2020 10:26:43 INFO Could not lock path 00013/2 with session d150bd10-a481-e2e9-709d-7788d0f3ddc6.
16/01/2020 10:26:43 INFO Found pool:NASSERVER for disk:00013 via consul
ImageNotFound: [errno 2] error opening image image-00012 at snapshot None
File "rbd.pyx", line 2755, in rbd.Image.__init__
size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size
Traceback (most recent call last):
16/01/2020 10:26:42 ERROR error opening image image-00012 at snapshot None
16/01/2020 10:26:42 ERROR Get ceph image size image-00012 error
16/01/2020 10:26:42 ERROR Cannot get image meta object from rbd header.
16/01/2020 06:25:19 INFO GlusterFS mount attempt
16/01/2020 01:52:53 INFO LeaderElectionBase successfully dropped old sessions
16/01/2020 01:52:47 INFO sync_replication_node completed
16/01/2020 01:52:47 INFO syncing replication users ok
16/01/2020 01:52:46 INFO syncing cron ok
16/01/2020 01:52:46 INFO Cluster is just starting, system will delete all active disk resources
16/01/2020 01:52:46 INFO sync_replication_node starting
16/01/2020 01:52:45 INFO Starting petasan tuning service
16/01/2020 01:52:45 INFO Starting sync replication node service
16/01/2020 01:52:45 INFO Starting OSDs
16/01/2020 01:52:44 INFO stdout ceph-9e7c4324-810d-48e7-8873-ea056332d5a4";"1";"1";"0";"wz--n-";"11176.00g";"0g";"0
16/01/2020 01:52:44 INFO Running command: /sbin/vgs --noheadings --readonly --units=g --separator=";" -o vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free,vg_free_count
16/01/2020 01:52:44 INFO Starting activating PetaSAN lvs
16/01/2020 01:52:44 INFO stdout ceph-9e7c4324-810d-48e7-8873-ea056332d5a4";"1";"1";"0";"wz--n-";"11176.00g";"0g";"0
16/01/2020 01:52:44 INFO Running command: /sbin/vgs --noheadings --readonly --units=g --separator=";" -o vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free,vg_free_count
16/01/2020 01:52:44 INFO Starting activating PetaSAN lvs
16/01/2020 01:52:44 INFO Service is starting.
16/01/2020 01:52:43 INFO LeaderElectionBase dropping old sessions
16/01/2020 01:52:43 INFO Starting Node Stats Service
16/01/2020 01:52:43 INFO Starting Cluster Management application
16/01/2020 01:52:43 INFO Starting iSCSI Service
16/01/2020 01:52:43 INFO Starting cluster file sync service
16/01/2020 01:52:39 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 10.0.1.106 -retry-join 10.0.1.107 -retry-join 10.0.1.105
16/01/2020 01:52:31 INFO GlusterFS mount attempt
16/01/2020 01:52:26 INFO Start settings IPs
15/01/2020 06:25:31 INFO GlusterFS mount attempt
14/01/2020 09:13:11 INFO Path 00013/2 acquired successfully
14/01/2020 09:13:11 INFO LIO add_target() disk wwn is 00013
14/01/2020 09:13:08 INFO Image image-00013 mapped successfully.
14/01/2020 09:13:07 INFO Found pool:NASSERVER for disk:00013 via consul
ImageNotFound: [errno 2] error opening image image-00012 at snapshot None
File "rbd.pyx", line 2755, in rbd.Image.__init__
size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size
Traceback (most recent call last):
14/01/2020 09:13:06 ERROR error opening image image-00012 at snapshot None
14/01/2020 09:13:06 ERROR Get ceph image size image-00012 error
14/01/2020 09:13:06 ERROR Cannot get image meta object from rbd header.

is image-00012 valid ? can you do a

rbd info image-00012

Image 00012 is a drive I’ve tried to delete over and over and it never deletes which I don’t care for as long as it isn’t posing a problem; however, the other drives that are online, stop randomly. I think this one helps. i found it on my node 2

 

19/01/2020 15:16:09 INFO /opt/petasan/scripts/util/benchmark.py storage

19/01/2020 15:16:09 INFO script

19/01/2020 15:15:48 INFO Path 00014/1 acquired successfully

19/01/2020 15:15:48 INFO LIO add_target() disk wwn is 83f8f4a000014

19/01/2020 15:15:45 INFO Image image-00014 mapped successfully.

19/01/2020 15:15:44 INFO Found pool:NASSERVER for disk:00014 via consul

ImageNotFound: [errno 2] error opening image image-00012 at snapshot None

File "rbd.pyx", line 2755, in rbd.Image.__init__

size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size

Traceback (most recent call last):

19/01/2020 15:15:43 ERROR error opening image image-00012 at snapshot None

19/01/2020 15:15:43 ERROR Get ceph image size image-00012 error

19/01/2020 15:15:43 ERROR Cannot get image meta object from rbd header.

19/01/2020 15:15:41 INFO Could not lock path 00014/3 with session d150bd10-a481-e2e9-709d-7788d0f3ddc6.

19/01/2020 15:15:41 INFO Found pool:NASSERVER for disk:00014 via consul

ImageNotFound: [errno 2] error opening image image-00012 at snapshot None

File "rbd.pyx", line 2755, in rbd.Image.__init__

size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size

Traceback (most recent call last):

19/01/2020 15:15:40 ERROR error opening image image-00012 at snapshot None

19/01/2020 15:15:40 ERROR Get ceph image size image-00012 error

19/01/2020 15:15:40 ERROR Cannot get image meta object from rbd header.

19/01/2020 15:15:38 INFO Could not lock path 00014/2 with session d150bd10-a481-e2e9-709d-7788d0f3ddc6.

19/01/2020 15:15:38 INFO Found pool:NASSERVER for disk:00014 via consul

ImageNotFound: [errno 2] error opening image image-00012 at snapshot None

File "rbd.pyx", line 2755, in rbd.Image.__init__

size = int(math.ceil(float(rbd.Image(ioctx, image).size()) / float(1024 ** 3)))

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/api.py", line 709, in __get_image_size

Traceback (most recent call last):

19/01/2020 15:15:37 ERROR error opening image image-00012 at snapshot None

19/01/2020 15:15:37 ERROR Get ceph image size image-00012 error

19/01/2020 15:15:36 ERROR Cannot get image meta object from rbd header.

19/01/2020 15:14:47 INFO PetaSAN removed key of stopped disk 00014 from consul.

19/01/2020 15:14:47 INFO Found pool:NASSERVER for disk:00014 via consul

19/01/2020 15:14:45 INFO Image image-00014 unmapped successfully.

19/01/2020 15:14:45 INFO LIO deleted Target iqn.2016-05.com.petasan:00014

19/01/2020 15:14:45 INFO LIO deleted backstore image image-00014

19/01/2020 15:14:45 ERROR Could not find ips for image-00014

19/01/2020 15:14:45 INFO Stopping disk 00014

19/01/2020 15:14:45 INFO PetaSAN cleaned local paths not locked by this node in consul.

19/01/2020 15:14:45 INFO Cleaned disk path 00014/2.

19/01/2020 06:25:22 INFO GlusterFS mount attempt

what version are you using ?
how large was image-00012 ? it could take several hours to delete a large image.
the image-00012 error logs would happen if the image was deleted from ceph via cli while its iSCSI layer was running, to clean these errors:
consul kv delete -recurse PetaSAN/Disks/00012
what is the output of:
rbd info -p NASSERVER image-00014
/opt/petasan/scripts/util/disk_meta.py read --image image-00014 --pool NASSERVER

It just stopped again, Here is the Logs that it gave me:

20/01/2020 18:09:30 INFO LeaderElectionBase successfully dropped old sessions

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.eth1_received 61.44 `date +%s`" | nc -q0 10.0.0.107 2003

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.percent_util.eth1 0.0 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.eth0_transmitted 16230.4 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.eth0_received 166205.44 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.percent_util.eth0 0.13 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.Management_transmitted 16230.4 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.Management_received 166266.88 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.percent_util.Management 0.13 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.memory.percent_util 4.27 `date +%s`" "

Exception: Error running echo command :echo "PetaSAN.NodeStats.MM-0242-NAS-02.cpu_all.percent_util 17.35 `date +%s`" "

raise Exception("Error running echo command :" + cmd)

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/common/graphite_sender.py", line 59, in send

graphite_sender.send(leader_ip)

File "/opt/petasan/scripts/node_stats.py", line 64, in get_stats

get_stats()

File "/opt/petasan/scripts/node_stats.py", line 159, in <module>

Traceback (most recent call last):

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.eth1_received 61.44 `date +%s`" | nc -q0 10.0.0.107 2003

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.percent_util.eth1 0.0 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.eth0_transmitted 16230.4 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.eth0_received 166205.44 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.percent_util.eth0 0.13 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.Management_transmitted 16230.4 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.throughput.Management_received 166266.88 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.ifaces.percent_util.Management 0.13 `date +%s`" "

PetaSAN.NodeStats.MM-0242-NAS-02.memory.percent_util 4.27 `date +%s`" "

20/01/2020 18:09:23 ERROR Error running echo command :echo "PetaSAN.NodeStats.MM-0242-NAS-02.cpu_all.percent_util 17.35 `date +%s`" "

20/01/2020 18:09:23 ERROR Node Stats exception.

20/01/2020 18:09:23 INFO sync_replication_node completed

20/01/2020 18:09:23 INFO syncing replication users ok

20/01/2020 18:09:22 INFO syncing cron ok

20/01/2020 18:09:22 INFO Cluster is just starting, system will delete all active disk resources

20/01/2020 18:09:22 INFO sync_replication_node starting

20/01/2020 18:09:21 INFO Starting petasan tuning service

20/01/2020 18:09:21 INFO Starting sync replication node service

20/01/2020 18:09:21 INFO Starting OSDs

20/01/2020 18:09:20 INFO stdout ceph-9e7c4324-810d-48e7-8873-ea056332d5a4";"1";"1";"0";"wz--n-";"11176.00g";"0g";"0

20/01/2020 18:09:20 INFO Running command: /sbin/vgs --noheadings --readonly --units=g --separator=";" -o vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free,vg_free_count

20/01/2020 18:09:20 INFO Service is starting.

20/01/2020 18:09:20 INFO Starting activating PetaSAN lvs

20/01/2020 18:09:20 INFO stdout ceph-9e7c4324-810d-48e7-8873-ea056332d5a4";"1";"1";"0";"wz--n-";"11176.00g";"0g";"0

20/01/2020 18:09:20 INFO Running command: /sbin/vgs --noheadings --readonly --units=g --separator=";" -o vg_name,pv_count,lv_count,snap_count,vg_attr,vg_size,vg_free,vg_free_count

20/01/2020 18:09:20 INFO Starting activating PetaSAN lvs

20/01/2020 18:09:20 INFO LeaderElectionBase dropping old sessions

20/01/2020 18:09:19 INFO Starting Node Stats Service

20/01/2020 18:09:19 INFO Starting Cluster Management application

20/01/2020 18:09:19 INFO Starting iSCSI Service

20/01/2020 18:09:19 INFO Starting cluster file sync service

20/01/2020 18:09:15 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 10.0.1.106 -retry-join 10.0.1.107 -retry-join 10.0.1.105

20/01/2020 18:09:06 INFO GlusterFS mount attempt

20/01/2020 18:09:01 INFO Start settings IPs

20/01/2020 06:25:29 INFO GlusterFS mount attempt

can you please reply to my prev post.

i am not sure if the logs you show is when you just rebooted a node, which be ok, or during operation, if the latter it could be a network issue or hardware power: what hardware do you have: osds type/count, ram, network speed.