Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Vlan configuration

Pages: 1 2 3 4 5

Ok, a very bumpy upgrade.  I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.

Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes.  I try the first one, it comes back up, but none of the OSDs come online.  The petasan.log file has the following at the end:

ConsulException: No known Consul servers

 10/03/2018 19:28:08 ERROR    No known Consul servers

Traceback (most recent call last):

  File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync

    index, data = base.watch(self.root_path, current_index)

  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch

    return cons.kv.get(key, index=current_index, recurse=True)

  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get

    callback, '/v1/kv/%s' % key, params=params)

  File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper

    return caller(f, *args, **kwargs)

  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator

    logger)

  File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal

    return f()

  File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get

    return callback(res)

  File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback

    raise ConsulException(response.body)

ConsulException: No known Consul servers

Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:

For the first 3 nodes, can you run:

consul members
consul info | grep leader

Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?

Are all OSDs in the cluster up apart from the OSD node you upraded ?

From the OSD node you upgraded can you run

consul members

From the OSD node can you ping other nodes on backend 1 and backend 2 ?

How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?

What errors do you get if you start an OSD manually:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph

 

The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.

Hi,

Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.

My cluster rebuild is stopped at the following point.

root@ceph0:~# ceph health detail

HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set

pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]

pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]

pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')

1 ops are blocked > 67108.9 sec on osd.2

1 osds have very slow requests

Any suggestions on how to recover from this?

Hi,

Can you go over the other points in my prev post, aside from consul if consul is all up.

We should focus on getting all OSDs up, once they talk to one another, things should clear.

Quote from admin on March 11, 2018, 8:39 am

 

Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?

Yes

Are all OSDs in the cluster up apart from the OSD node you upraded ?

Yes

From the OSD node you upgraded can you run

consul members

 

root@ceph3:~# consul members

Node            Address           Status  Type    Build  Protocol  DC

ceph-lm3dc2-00  10.43.2.200:8301  alive   client  0.7.3  2         petasan

ceph-lm3dc2-01  10.43.2.201:8301  alive   client  0.7.3  2         petasan

ceph-lm3dc2-02  10.43.2.202:8301  alive   client  0.7.3  2         petasan

ceph0           10.43.2.100:8301  alive   server  0.7.3  2         petasan

ceph1           10.43.2.101:8301  alive   server  0.7.3  2         petasan

ceph10          10.43.2.110:8301  alive   client  0.7.3  2         petasan

ceph2           10.43.2.102:8301  alive   server  0.7.3  2         petasan

ceph3           10.43.2.103:8301  alive   client  0.7.3  2         petasan

ceph4           10.43.2.104:8301  alive   client  0.7.3  2         petasan

ceph5           10.43.2.105:8301  alive   client  0.7.3  2         petasan

ceph6           10.43.2.106:8301  alive   client  0.7.3  2         petasan

ceph7           10.43.2.107:8301  alive   client  0.7.3  2         petasan

ceph8           10.43.2.108:8301  alive   client  0.7.3  2         petasan

ceph9           10.43.2.109:8301  alive   client  0.7.3  2         petasan

 

 

From the OSD node can you ping other nodes on backend 1 and backend 2 ?

Yes

How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?

72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.

 

What errors do you get if you start an OSD manually:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph

All OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?

The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.

 

Hi again,

OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?

The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?

The problem OSD node, after upgrade, some OSD came up and some did not start ?

Can you perform

ceph osd pool set rbd min_size 1  --cluster XXXX

On the node running OSD 2

systemctl restart ceph-osd@2

 

Quote from admin on March 12, 2018, 5:31 pm

Hi again,

OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?

All are up.

The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?

Multiple nodes.  I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.

The problem OSD node, after upgrade, some OSD came up and some did not start ?

The problem node came up with no OSDs starting.

Can you perform

ceph osd pool set rbd min_size 1  --cluster XXXX

I now get the same error, but the suggestion about reducing min_size is gone:

root@ceph0:~# ceph health detail

HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set

pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]

pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]

pg 1.e0e is incomplete, acting [2,35,23]

1 ops are blocked > 67108.9 sec on osd.2

1 osds have very slow requests

The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?

Can you please get the output of

ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log

Quote from admin on March 12, 2018, 6:28 pm

The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?

Can you please get the output of

ceph pg 1.e0e query --cluster CLUSTER_NAME

https://pastebin.com/kdjCFtRH

ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e

root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e

dumped all

1.e0e         0                  0        0         0       0         0    0        0   incomplete 2018-03-12 10:34:50.678837          0'0  22338:11907  [2,35,23]          

2  [2,35,23]              2    5858'4107 2018-03-09 06:14:42.877071       5858'4107 2018-03-03 20:07:24.602472

 

On node with OSD2, get OSD log,

https://pastebin.com/YuFnRRxt

chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log

https://pastebin.com/wB5YZ5zZ

 

What is the status of the following OSDs, is any up ?

52, 56, 65

Pages: 1 2 3 4 5