Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD and iscsi working unstable!

Hello!

I have two OSD installed and one Mon instance installed; once in 24h I have iscsi connection crash, I have checked the logs at: /opt/petasan/log/PetaSAN.log; As I get it right from logs the issue is on osd2 that crashes, however I am not getting why iscsi also crashes, as I have two paths for a lun and onepath on OSD1 and second on OSD2

Please can you help me debug the issue?

OSD1

28/03/2019 01:07:28 INFO get object from MangePathAssignment.
28/03/2019 01:07:29 INFO call get assignments stats function.
28/03/2019 01:07:38 INFO get object from MangePathAssignment.
28/03/2019 01:07:44 INFO call get assignments stats function.
28/03/2019 01:07:48 INFO get object from MangePathAssignment.
28/03/2019 01:07:54 INFO call get assignments stats function.
28/03/2019 01:07:58 INFO get object from MangePathAssignment.
28/03/2019 01:08:04 INFO call get assignments stats function.
28/03/2019 01:08:25 WARNING , retrying in 1 seconds...
28/03/2019 01:08:25 WARNING , retrying in 1 seconds...
28/03/2019 01:08:25 WARNING , retrying in 1 seconds...
28/03/2019 01:08:25 WARNING , retrying in 1 seconds...
28/03/2019 01:08:25 WARNING , retrying in 1 seconds...
28/03/2019 01:08:26 WARNING , retrying in 1 seconds...
28/03/2019 01:08:26 INFO get object from MangePathAssignment.
28/03/2019 01:08:27 INFO get object from MangePathAssignment.
28/03/2019 01:08:28 INFO get object from MangePathAssignment.
28/03/2019 01:08:31 INFO call get assignments stats function.
28/03/2019 01:08:32 INFO call get assignments stats function.
28/03/2019 01:08:34 INFO call get assignments stats function.
28/03/2019 01:08:38 INFO get object from MangePathAssignment.
28/03/2019 01:08:39 INFO Found pool:rbd for disk:00002 via consul
28/03/2019 01:08:44 INFO call get assignments stats function.
28/03/2019 01:08:48 INFO get object from MangePathAssignment.
28/03/2019 01:08:54 INFO call get assignments stats function.
28/03/2019 01:08:58 INFO get object from MangePathAssignment.
28/03/2019 01:09:04 INFO call get assignments stats function.
28/03/2019 01:09:08 INFO get object from MangePathAssignment.

OSD2:

28/03/2019 01:09:55 ERROR 500 node is not the leader
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 642, in handle_cluster_startup
result = consul_api.set_leader_startup_time(current_node_name, str(i))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 372, in set_leader_startup_time
return consul_obj.kv.put(ConfigAPI().get_consul_leaders_path()+node_name,minutes)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 459, in put
'/v1/kv/%s' % key, params=params, data=value)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 83, in put
return callback(res)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 111, in cb
raise ConsulException("%d %s" % (response.code, response.body))
ConsulException: 500 node is not the leader
28/03/2019 06:11:32 INFO Start settings IPs
28/03/2019 06:11:33 INFO GlusterFS mount attempt
28/03/2019 06:11:52 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 10.129.2.194 -retry-join 10.129.2.193 -retry-join 10.129.2.195
28/03/2019 06:11:56 INFO Starting cluster file sync service
28/03/2019 06:11:56 INFO Starting iSCSI Service
28/03/2019 06:11:56 INFO Starting Cluster Management application
28/03/2019 06:11:56 INFO Starting Node Stats Service
28/03/2019 06:11:56 INFO Starting OSDs
28/03/2019 06:11:56 INFO Service is starting.
28/03/2019 06:11:56 INFO Cleaning unused configurations.
28/03/2019 06:11:56 INFO Cleaning all mapped disks
28/03/2019 06:11:56 INFO Cleaning unused rbd images.
28/03/2019 06:11:56 INFO Cleaning unused ips.
MON1:

28/03/2019 00:43:08 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 00:53:09 INFO {u'sda': u'Passed'}
28/03/2019 00:53:10 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 00:53:11 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:03:12 INFO {u'sda': u'Passed'}
28/03/2019 01:03:12 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:03:13 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:08:29 WARNING , retrying in 1 seconds...
28/03/2019 01:08:29 WARNING , retrying in 1 seconds...
28/03/2019 01:08:29 WARNING , retrying in 1 seconds...
28/03/2019 01:13:31 INFO {u'sda': u'Passed'}
28/03/2019 01:13:32 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:23:33 INFO {u'sda': u'Passed'}
28/03/2019 01:23:33 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:33:35 INFO {u'sda': u'Passed'}
28/03/2019 01:33:35 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:43:36 INFO {u'sda': u'Passed'}
28/03/2019 01:43:37 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
28/03/2019 01:53:38 INFO {u'sda': u'Passed'}

When you say crash, you ,mean you see your iSCSI disk loose connection at the client or do you mean the PetaSAN nodes stop/restart ?

How many OSDs do you have in each of your 2 nodes each ?

For : 28/03/2019 01:09:55 ERROR 500 node is not the leader: on OSD2, was this node just starting and if so was it rebooted ?

If you connect the ui application node OSD2, does it respond fine ? can you navigate and the see the Node List page and see all nodes ? do you see any "ConsulException: 500 node is not the leader" in the log file on this node  log file other than the ones you shows ?

Hi!

Just 30 min ago I got again an error;

Yes, vmware lose connection via iscsi, however it is possible to ping iscsi IP adreses.

I have one OSD on each OSD host and one host with only Monitor functionality;

For : 28/03/2019 01:09:55 ERROR 500 node is not the leader: on OSD2, was this node just starting and if so was it rebooted ?

petasan is still running (it seems that no problems at all exists) and if I execute: ceph status -c /etc/ceph/xxx.conf it give me report that status is ok:

cluster:
id: e443b5a2-152d-4f10-b8d0-cc5666852905
health: HEALTH_OK

services:
mon: 3 daemons, quorum svl-cephosd01,svl-cephosd02,svl-cephmon01
mgr: svl-cephosd02(active), standbys: svl-cephosd01
osd: 2 osds: 2 up, 2 in

data:
pools: 1 pools, 256 pgs
objects: 1246k objects, 4983 GB
usage: 9966 GB used, 96529 GB / 103 TB avail
pgs: 255 active+clean
1 active+clean+scrubbing+deep

io:
client: 46469 B/s rd, 20463 B/s wr, 1 op/s rd, 8 op/s wr

 

so as I get the ceph part is working OK, but the iSCSI agent not;

Yes the GUI interface and ssh all works with no any issue;

for now I have set up a ping from vmware host to petasan iscsi IP addresses to see if on crash it will go down aswell;

It seems that this error is what causing iscsi to be unavailable:

28/03/2019 09:10:16 INFO call get assignments stats function.
28/03/2019 09:10:26 INFO get object from MangePathAssignment.
28/03/2019 09:10:26 INFO call get assignments stats function.
28/03/2019 09:10:36 INFO get object from MangePathAssignment.
28/03/2019 09:10:36 INFO call get assignments stats function.
28/03/2019 11:21:24 INFO Reassignment paths script invoked to run clean action.
28/03/2019 11:21:24 INFO Updating path 10.129.2.129 status to 0
28/03/2019 11:21:24 INFO Path 10.129.2.129 status updated to 0
28/03/2019 11:21:24 INFO Found pool:rbd for disk:00001 via consul
28/03/2019 11:21:24 INFO Move action,cleaned disk 00001 path 0.
28/03/2019 11:21:24 INFO Move action,clean newtwork config for disk 00001 path 0.
28/03/2019 11:21:24 INFO {}
28/03/2019 11:21:24 INFO Move action,release disk 00001 path 1.
28/03/2019 11:21:31 INFO Reassignment paths script invoked to run clean action.
28/03/2019 11:21:31 INFO Updating path 10.129.2.130 status to 0
28/03/2019 11:21:31 INFO Path 10.129.2.130 status updated to 0
28/03/2019 11:21:31 INFO Found pool:rbd for disk:00002 via consul
28/03/2019 11:21:31 INFO Move action,cleaned disk 00002 path 0.
28/03/2019 11:21:31 INFO Move action,clean newtwork config for disk 00002 path 0.
28/03/2019 11:21:31 INFO {}
28/03/2019 11:21:31 INFO Move action,release disk 00002 path 1.
28/03/2019 11:22:48 ERROR Node Stats exception.
28/03/2019 11:22:48 ERROR Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 10.84' `date +%s` | nc -q0 10.129.2.74 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 10.84' `date +%s` | nc -q0 10.129.2.74 2003
28/03/2019 11:23:51 ERROR Node Stats exception.
28/03/2019 11:23:51 ERROR Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 15.59' `date +%s` | nc -q0 10.129.2.74 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 15.59' `date +%s` | nc -q0 10.129.2.74 2003
28/03/2019 11:24:18 INFO Found pool:rbd for disk:00002 via consul
28/03/2019 11:24:18 INFO Could not lock path 00002/1 with session 9d43604d-bf35-bcce-76a1-710f56beaa0a.
28/03/2019 11:24:25 INFO Found pool:rbd for disk:00002 via consul
28/03/2019 11:24:25 INFO The path 00002/1 was locked by svl-cephosd02.
28/03/2019 11:24:26 INFO This node will stop node svl-cephosd02/10.129.2.226.
28/03/2019 11:24:26 INFO Path 00002/1 acquired successfully
28/03/2019 11:24:48 INFO Found pool:rbd for disk:00001 via consul
28/03/2019 11:24:48 INFO Could not lock path 00001/1 with session 9d43604d-bf35-bcce-76a1-710f56beaa0a.
28/03/2019 11:24:55 ERROR Node Stats exception.
28/03/2019 11:24:55 ERROR Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 0.92' `date +%s` | nc -q0 10.129.2.70 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 0.92' `date +%s` | nc -q0 10.129.2.70 2003
28/03/2019 11:24:55 INFO Found pool:rbd for disk:00001 via consul
28/03/2019 11:24:56 INFO The path 00001/1 was locked by svl-cephosd02.
28/03/2019 11:25:01 INFO Path 00001/1 acquired successfully
28/03/2019 11:28:03 INFO Start settings IPs
28/03/2019 11:28:03 INFO GlusterFS mount attempt
28/03/2019 11:28:03 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind 10.129.2.193 -retry-join 10.129.2.194 -retry-join 10.129.2.195
28/03/2019 11:28:07 INFO Starting cluster file sync service
28/03/2019 11:28:07 INFO Starting iSCSI Service
28/03/2019 11:28:07 INFO Starting Cluster Management application
28/03/2019 11:28:07 INFO Starting Node Stats Service
28/03/2019 11:28:07 INFO Starting OSDs
28/03/2019 11:28:07 INFO Service is starting.
28/03/2019 11:28:07 INFO Cleaning unused configurations.
28/03/2019 11:28:07 INFO Cleaning all mapped disks
28/03/2019 11:28:07 INFO Cleaning unused rbd images.
28/03/2019 11:28:13 INFO Cleaning unused ips.

running a cluster with only 2 OSDs can lead to issues. The status and health of OSDs is based on them checking one another, but you need a couple of OSDs to agree on the status of another OSD, having 2 breaks this process and Ceph will not able to detect the status of the OSDs as it does not want have a case of 2 OSDs each reporting the other as an error.

I am not getting why iscsi also crashes, as I have two paths for a lun and onepath on OSD1 and second on OSD2

The paths connections from client to the iSCSI service and not to an OSD, they provide HA at the iSCSI layer. The iSCSI service in turn talks to the entire Ceph cluster and not to a specific OSD as Ceph handles the HreA of the OSDs. The iSCSI layer will freeze if the Ceph cluster/OSD freezes : This can occur if you have an OSD that went down out of the 2 but since you only have 2, Ceph will not flag the OSD as down and your iSCSI will freeze waiting for the OSD to respond and your ESXi will timeout.

To fix : add 2 more OSD disks so you have a total of 4  (this is still much below our recommendations, so the more the better), also do look if you have enough hardware resources for your load: benchmark your current cluster iops/throughput + look at your disk and cpu % busy.

Thank you for recomendations I will add more OSD and try again, will come back with results, thank you!

Is there any log to read about iscsi status and iscsi service that is possible to start|stop|reload?

O.

Hello!

I have added 4 more OSDs and one OSD server seems to work fine (it works for more that a month with no problem), however the second OSD server works OK for a week and the unexpectedly shuts down, that would be part of the case, while shutting down OSD2 it also somehow shuts down third host (that is only MON without any OSDs);

Is there a way how to find out why the OSD2 is shutting down and why monitor is also shutting down with OSD2?

seems things are better  but maybe still the hardware is under powered, can you check your specs with our recommended guide, also if you look at charts for %util for disks, cpu, ram, net are any getting over 90% busy ?  We do fencing to shut nodes running the iSCSI service if they fail to report a heartbeat back to the cluster, you can switch this temporarily off via the maintenance page and see if you do not get shutdowns, you can also look at the /opt/petasan/log/PetaSAN.log it should have a record on the initiating node to shutdown the other node.  Typically fencing will occur if there was a hardware issue, a network issue or in many cases your nodes or network are simply too busy which prevents heartbeat being sent on time.

//////////////////////////////
///////////////MON server:
//////////////////////////////

27/04/2019 06:45:04 INFO {u'sdd': u'Passed', u'sde': u'Passed', u'sdf': u'Passed', u'sdg': u'Passed', u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
27/04/2019 06:45:05 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
27/04/2019 06:55:06 INFO {u'sda': u'Passed', u'sdb': u'Passed'}
27/04/2019 06:55:07 INFO {u'sdd': u'Passed', u'sde': u'Passed', u'sdf': u'Passed', u'sdg': u'Passed', u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
27/04/2019 06:55:08 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
27/04/2019 07:05:09 INFO {u'sda': u'Passed', u'sdb': u'Passed'}
27/04/2019 07:05:10 INFO {u'sdd': u'Passed', u'sde': u'Passed', u'sdf': u'Passed', u'sdg': u'Passed', u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
27/04/2019 07:05:10 INFO {u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
////////// I suppose logs before are OK/////////////
27/04/2019 07:24:03 WARNING , retrying in 1 seconds...
27/04/2019 07:24:03 WARNING , retrying in 1 seconds...
27/04/2019 07:32:09 WARNING , retrying in 1 seconds...
27/04/2019 07:32:09 WARNING , retrying in 2 seconds...
27/04/2019 07:32:09 WARNING , retrying in 2 seconds...
27/04/2019 07:32:09 WARNING , retrying in 1 seconds...
27/04/2019 07:32:13 INFO ClusterLeader stop action
27/04/2019 07:32:14 INFO {u'sda': u'Passed', u'sdb': u'Passed'}
27/04/2019 07:32:15 INFO {u'sdd': u'Passed', u'sde': u'Passed', u'sdf': u'Passed', u'sdg': u'Passed', u'sda': u'Passed', u'sdb': u'Passed', u'sdc': u'Passed'}
27/04/2019 07:32:26 INFO Cleaned disk path 00001/2.
27/04/2019 07:32:26 INFO Cleaned disk path 00001/1.
27/04/2019 07:32:26 INFO Cleaned disk path 00002/1.
27/04/2019 07:32:26 INFO Cleaned disk path 00002/2.
27/04/2019 07:32:26 INFO PetaSAN cleaned local paths not locked by this node in consul.
27/04/2019 07:32:36 INFO LIO deleted backstore image image-00001
27/04/2019 07:32:36 INFO LIO deleted Target iqn.2016-05.com.petasan:00001
27/04/2019 07:32:36 INFO Image image-00001 unmapped successfully.
27/04/2019 07:32:36 INFO LIO deleted backstore image image-00002
27/04/2019 07:32:36 INFO LIO deleted Target iqn.2016-05.com.petasan:00002
27/04/2019 07:32:36 INFO Image image-00002 unmapped successfully.
27/04/2019 07:32:36 INFO PetaSAN cleaned iqns.
27/04/2019 07:32:51 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:32:59 ERROR Node Stats exception.
27/04/2019 07:32:59 ERROR Error running echo command :echo 'PetaSAN.NodeStats.svl-cephmon01.cpu_all.percent_util 1.01' `date +%s` | nc -q0 10.129.2.74 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.svl-cephmon01.cpu_all.percent_util 1.01' `date +%s` | nc -q0 10.129.2.74 2003
27/04/2019 07:33:01 WARNING , retrying in 1 seconds...
27/04/2019 07:33:02 WARNING , retrying in 2 seconds...
27/04/2019 07:33:04 WARNING , retrying in 4 seconds...
27/04/2019 07:33:08 WARNING , retrying in 8 seconds...
(last log before shutdown)

//////////////////////////////
///////////////OSD1 server logs on 27.04.2019:
//////////////////////////////

27/04/2019 07:15:29 WARNING , retrying in 1 seconds...
27/04/2019 07:15:29 WARNING , retrying in 1 seconds...
27/04/2019 07:15:29 WARNING , retrying in 1 seconds...
27/04/2019 07:15:32 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:32 INFO Could not lock path 00001/1 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:33 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:33 INFO Could not lock path 00001/1 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:33 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:33 INFO Could not lock path 00001/1 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:34 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:34 INFO Could not lock path 00001/1 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:34 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:34 INFO Could not lock path 00002/2 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:35 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:35 INFO Could not lock path 00001/2 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:36 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:36 INFO Could not lock path 00001/1 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:36 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:36 INFO Could not lock path 00002/1 with session 4428b7cb-4c9e-42cd-25c2-af74f2d487e3.
27/04/2019 07:15:37 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:37 INFO The path 00002/1 was locked by svl-cephmon01.
27/04/2019 07:15:37 INFO This node will stop node svl-cephmon01/10.129.2.226.
27/04/2019 07:15:51 ERROR Node Stats exception.
27/04/2019 07:15:51 ERROR Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 24.82' `date +%s` | nc -q0 10.129.2.74 2003
Traceback (most recent call last):
File "/opt/petasan/scripts/node_stats.py", line 137, in <module>
get_stats()
File "/opt/petasan/scripts/node_stats.py", line 54, in get_stats
_get_cpu(sar_result)
File "/opt/petasan/scripts/node_stats.py", line 74, in _get_cpu
_send_graphite(path_key, val)
File "/opt/petasan/scripts/node_stats.py", line 50, in _send_graphite
raise Exception("Error running echo command :" + cmd)
Exception: Error running echo command :echo 'PetaSAN.NodeStats.svl-cephosd01.cpu_all.percent_util 24.82' `date +%s` | nc -q0 10.129.2.74 2003
27/04/2019 07:15:54 WARNING , retrying in 1 seconds...
27/04/2019 07:15:58 WARNING , retrying in 1 seconds...
27/04/2019 07:16:02 WARNING , retrying in 1 seconds...
27/04/2019 07:16:03 WARNING , retrying in 2 seconds...
27/04/2019 07:16:06 WARNING , retrying in 2 seconds...
27/04/2019 07:16:10 WARNING , retrying in 1 seconds...
27/04/2019 07:16:10 WARNING , retrying in 2 seconds...
27/04/2019 07:16:12 WARNING , retrying in 4 seconds...
27/04/2019 07:16:16 WARNING , retrying in 4 seconds...
27/04/2019 07:16:18 WARNING , retrying in 2 seconds...
27/04/2019 07:16:19 WARNING , retrying in 4 seconds...
27/04/2019 07:16:23 WARNING , retrying in 8 seconds...
27/04/2019 07:16:25 WARNING , retrying in 1 seconds...
27/04/2019 07:16:27 WARNING , retrying in 8 seconds...
27/04/2019 07:16:27 WARNING , retrying in 4 seconds...
27/04/2019 07:16:30 WARNING , retrying in 8 seconds...
27/04/2019 07:16:32 WARNING , retrying in 1 seconds...
27/04/2019 07:16:33 WARNING , retrying in 2 seconds...
27/04/2019 07:16:38 WARNING , retrying in 16 seconds...
27/04/2019 07:16:48 WARNING , retrying in 2 seconds...
27/04/2019 07:16:49 WARNING , retrying in 4 seconds...
27/04/2019 07:16:53 WARNING , retrying in 8 seconds...
27/04/2019 07:16:54 WARNING , retrying in 16 seconds...
27/04/2019 07:16:54 WARNING , retrying in 1 seconds...
27/04/2019 07:16:57 WARNING , retrying in 4 seconds...
27/04/2019 07:17:00 WARNING , retrying in 8 seconds...
27/04/2019 07:17:03 WARNING , retrying in 2 seconds...
27/04/2019 07:17:05 ERROR Could not find key resource.
27/04/2019 07:17:05 ERROR
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 448, in get_key
index, data = cons.kv.get(key)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
callback, '/v1/kv/%s' % key, params=params)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 71, in get
raise RetryConsulException()
RetryConsulException
27/04/2019 07:17:08 WARNING , retrying in 8 seconds...
27/04/2019 07:17:09 WARNING , retrying in 16 seconds...
27/04/2019 07:17:09 ERROR
27/04/2019 07:17:10 WARNING , retrying in 1 seconds...
27/04/2019 07:17:12 WARNING , retrying in 4 seconds...
27/04/2019 07:17:15 WARNING , retrying in 16 seconds...
27/04/2019 07:17:18 WARNING , retrying in 2 seconds...
27/04/2019 07:17:23 WARNING , retrying in 8 seconds...
27/04/2019 07:17:23 WARNING , retrying in 16 seconds...
27/04/2019 07:17:25 WARNING , retrying in 1 seconds...
27/04/2019 07:17:26 WARNING , retrying in 1 seconds...
27/04/2019 07:17:27 WARNING , retrying in 4 seconds...
27/04/2019 07:17:33 WARNING , retrying in 2 seconds...
27/04/2019 07:17:34 WARNING , retrying in 2 seconds...
27/04/2019 07:17:38 WARNING , retrying in 16 seconds...
27/04/2019 07:17:38 WARNING , retrying in 8 seconds...
27/04/2019 07:17:39 ERROR
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 642, in handle_cluster_startup
result = consul_api.set_leader_startup_time(current_node_name, str(i))
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 372, in set_leader_startup_time
return consul_obj.kv.put(ConfigAPI().get_consul_leaders_path()+node_name,minutes)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 459, in put
'/v1/kv/%s' % key, params=params, data=value)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 82, in put
raise RetryConsulException()
RetryConsulException
27/04/2019 07:17:40 WARNING , retrying in 1 seconds...
27/04/2019 07:17:42 WARNING , retrying in 4 seconds...
27/04/2019 07:17:43 WARNING , retrying in 4 seconds...
27/04/2019 07:17:48 WARNING , retrying in 2 seconds...
27/04/2019 07:17:53 WARNING , retrying in 8 seconds...
27/04/2019 07:17:53 WARNING , retrying in 16 seconds...
27/04/2019 07:17:54 WARNING , retrying in 8 seconds...
27/04/2019 07:17:55 WARNING , retrying in 1 seconds...
27/04/2019 07:17:57 WARNING , retrying in 4 seconds...
27/04/2019 07:18:03 WARNING , retrying in 2 seconds...
27/04/2019 07:18:08 WARNING , retrying in 8 seconds...
27/04/2019 07:18:08 WARNING , retrying in 16 seconds...
27/04/2019 07:18:10 WARNING , retrying in 16 seconds...
27/04/2019 07:18:10 WARNING , retrying in 1 seconds...
(OSD1 didn't shutdown)

//////////////////////////////
///////////////OSD2 server logs on 27.04.2019:
//////////////////////////////

27/04/2019 07:15:23 WARNING , retrying in 1 seconds...
27/04/2019 07:15:23 WARNING , retrying in 1 seconds...
27/04/2019 07:15:23 WARNING , retrying in 1 seconds...
27/04/2019 07:15:23 WARNING , retrying in 1 seconds...
27/04/2019 07:15:26 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:26 INFO Could not lock path 00002/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:27 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:27 INFO Could not lock path 00002/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:27 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:27 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:28 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:28 INFO Could not lock path 00002/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:28 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:28 INFO Could not lock path 00001/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:29 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:29 INFO Could not lock path 00002/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:29 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:29 INFO Could not lock path 00002/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:30 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:30 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:30 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:31 INFO Could not lock path 00002/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:31 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:31 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:32 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:32 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:32 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:32 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:33 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:33 INFO Could not lock path 00002/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:33 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:33 INFO Could not lock path 00002/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:34 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:34 INFO Could not lock path 00002/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:34 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:34 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:35 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:35 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:35 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:35 INFO Could not lock path 00001/2 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:36 INFO Found pool:rbd for disk:00001 via consul
27/04/2019 07:15:36 INFO Could not lock path 00001/1 with session fd44ddae-6044-8bd5-20b4-ab1029483493.
27/04/2019 07:15:36 INFO Found pool:rbd for disk:00002 via consul
27/04/2019 07:15:36 WARNING , retrying in 1 seconds...
(last log before shutdown)

 

I have checked monitoring, it seems that CPU utilization is below 20% and ram usage is ~ 30%, I did turn fencing off, let's see how it will end, I will keep posting.

raise RetryConsulException()
RetryConsulException

27/04/2019 07:33:01 WARNING , retrying in 1 seconds...
27/04/2019 07:33:02 WARNING , retrying in 2 seconds...
27/04/2019 07:33:04 WARNING , retrying in 4 seconds...
27/04/2019 07:33:08 WARNING , retrying in 8 seconds...
(last log before shutdown)

This means the node was not able to connect to consul, which among other things, is responsible for maintaining the node as member of the cluster. We fail in approx 15 -20 sec. Moreover

27/04/2019 07:32:59 ERROR Error running echo command :echo 'PetaSAN.NodeStats.svl-cephmon01.cpu_all.percent_util 1.01' `date +%s` | nc -q0 10.129.2.74 2003

Is another connection failure to the stats server to send stats metrics, this is independent on consul but indicates a connection problem from this node to 10.129.2.74 on backend 1 network which is also used by consul.

Try to monitor these messages and see why/when then occur. How fast is your backend 1 network ?

For now it is 1 Gbit ethernet link, I have ordered 10Gbit SFP+

OK, I will keep monitor!

thanx for help!