Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

8 OSD's went down

I noticed that 8 OSD's went down (out of 10 on this system). We have 4 nodes each with 10 OSD's so usually there are 40 OSD's up but for some reason when I logged in on Monday 8 were down. Here is the info from the petasan log for this node:

21/07/2017 10:41:15 ERROR   
 21/07/2017 10:41:19 WARNING  , retrying in 1 seconds...
 21/07/2017 10:41:19 WARNING  , retrying in 1 seconds...
 21/07/2017 10:41:19 WARNING  , retrying in 1 seconds...
 21/07/2017 10:41:27 WARNING  , retrying in 2 seconds...
 21/07/2017 10:41:27 WARNING  , retrying in 2 seconds...
 21/07/2017 10:41:27 WARNING  , retrying in 2 seconds...
 21/07/2017 10:41:32 WARNING  , retrying in 1 seconds...
 21/07/2017 10:41:35 INFO     GlusterFS mount attempt
 21/07/2017 10:41:36 WARNING  , retrying in 4 seconds...
 21/07/2017 10:41:36 WARNING  , retrying in 4 seconds...
 21/07/2017 10:41:36 WARNING  , retrying in 4 seconds...
 21/07/2017 10:41:40 WARNING  , retrying in 2 seconds...
 21/07/2017 10:41:47 WARNING  , retrying in 8 seconds...
 21/07/2017 10:41:47 WARNING  , retrying in 8 seconds...
 21/07/2017 10:41:47 WARNING  , retrying in 8 seconds...
 21/07/2017 10:41:50 WARNING  , retrying in 4 seconds...
 21/07/2017 10:42:01 WARNING  , retrying in 8 seconds...
 21/07/2017 10:42:02 WARNING  , retrying in 16 seconds...
 21/07/2017 10:42:02 WARNING  , retrying in 16 seconds...
 21/07/2017 10:42:03 WARNING  , retrying in 16 seconds...
 21/07/2017 10:42:08 INFO     GlusterFS mount attempt
 21/07/2017 10:42:16 WARNING  , retrying in 16 seconds...
 21/07/2017 10:42:25 ERROR   
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 561, in handle_cluster_startup
    result = consul_api.set_leader_startup_time(current_node_name, str(i))
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 305, in set_leader_startup_time
    return consul_obj.kv.put(ConfigAPI().get_consul_leaders_path()+node_name,minutes)
  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 459, in put
    '/v1/kv/%s' % key, params=params, data=value)
  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
    return caller(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 82, in put
    raise RetryConsulException()
RetryConsulException
 21/07/2017 10:42:26 ERROR    Error during __proces.
 21/07/2017 10:42:26 ERROR   
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 84, in start
    self.__session = ConsulAPI().get_new_session_ID(self.__session_name,self.__node_info.name)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 38, in get_new_session_ID
    self.drop_all_node_sessions(session_name,node_name)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 282, in drop_all_node_sessions
    sessions = self.get_sessions_dict(session_name,node_name)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 250, in get_sessions_dict
    for sess in consul_obj.session.list()[1]:
  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 1440, in list
    '/v1/session/list', params=params)
  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
    return caller(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 71, in get
    raise RetryConsulException()
RetryConsulException
 21/07/2017 10:42:26 ERROR   
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync
    index, data = base.watch(self.root_path, current_index)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch
    return cons.kv.get(key, index=current_index, recurse=True)
  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
    callback, '/v1/kv/%s' % key, params=params)
  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
    return caller(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 71, in get
    raise RetryConsulException()
RetryConsulException
 21/07/2017 10:42:34 WARNING  , retrying in 1 seconds...
 21/07/2017 10:42:35 WARNING  , retrying in 1 seconds...
 21/07/2017 10:42:37 WARNING  , retrying in 1 seconds...
 21/07/2017 10:42:39 ERROR   
 21/07/2017 10:42:41 INFO     GlusterFS mount attempt
 21/07/2017 10:42:43 WARNING  , retrying in 2 seconds...
 21/07/2017 10:42:43 WARNING  , retrying in 2 seconds...
 21/07/2017 10:42:45 WARNING  , retrying in 2 seconds...
 21/07/2017 10:42:52 WARNING  , retrying in 4 seconds...
 21/07/2017 10:42:53 WARNING  , retrying in 4 seconds...
 21/07/2017 10:42:54 WARNING  , retrying in 4 seconds...
 21/07/2017 10:42:57 WARNING  , retrying in 1 seconds...
 21/07/2017 10:43:03 WARNING  , retrying in 8 seconds...
 21/07/2017 10:43:04 WARNING  , retrying in 8 seconds...
 21/07/2017 10:43:05 WARNING  , retrying in 2 seconds...
 21/07/2017 10:43:05 WARNING  , retrying in 8 seconds...
 21/07/2017 10:43:14 WARNING  , retrying in 4 seconds...
 21/07/2017 10:43:14 INFO     GlusterFS mount attempt
 21/07/2017 10:43:18 WARNING  , retrying in 16 seconds...
 21/07/2017 10:43:19 WARNING  , retrying in 16 seconds...
 21/07/2017 10:43:20 WARNING  , retrying in 16 seconds...
 21/07/2017 10:43:25 WARNING  , retrying in 8 seconds...
 21/07/2017 10:43:40 WARNING  , retrying in 16 seconds...
 21/07/2017 10:43:41 ERROR   
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 561, in handle_cluster_startup
    result = consul_api.set_leader_startup_time(current_node_name, str(i))
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 305, in set_leader_startup_time
    return consul_obj.kv.put(ConfigAPI().get_consul_leaders_path()+node_name,minutes)
  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 459, in put
    '/v1/kv/%s' % key, params=params, data=value)
  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
    return caller(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 82, in put
    raise RetryConsulException()
RetryConsulException
 21/07/2017 10:43:42 ERROR   
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync
    index, data = base.watch(self.root_path, current_index)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch
    return cons.kv.get(key, index=current_index, recurse=True)
  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
    callback, '/v1/kv/%s' % key, params=params)
  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
    return caller(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 71, in get
    raise RetryConsulException()
RetryConsulException
 21/07/2017 10:43:44 ERROR    Error during __proces.
 21/07/2017 10:43:44 ERROR   
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/iscsi_service.py", line 84, in start
    self.__session = ConsulAPI().get_new_session_ID(self.__session_name,self.__node_info.name)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 38, in get_new_session_ID
    self.drop_all_node_sessions(session_name,node_name)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 282, in drop_all_node_sessions
    sessions = self.get_sessions_dict(session_name,node_name)
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/api.py", line 250, in get_sessions_dict
    for sess in consul_obj.session.list()[1]:
  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 1440, in list
    '/v1/session/list', params=params)
  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
    return caller(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 71, in get
    raise RetryConsulException()
RetryConsulException

It appears like a networking connectivity problem. There are various services components like Ceph, Consul and Gluster all seem to be running in the cluster, yet on the problem node they are all failing, most of the logs are the Consul code not able to communicate with the remaining cluster, it keeps retrying then aborts.

Try via ssh or via the console menu to ping from that node to other nodes on the different subnets, specifically on backend 1 which the above service use. If the ping is ok i would try to reboot and see if this fixes things. Also if you suspect network issues, you can use bonded interfaces. Please let me know how things work out.