Forums - PetaSAN

ForumGeneral DiscussionVlan configuration
You need to log in to create posts and topics. Login · Register
Vlan configuration

Pages: 1 2 3 4 5

erazmus
40 Posts

March 11, 2018, 4:35 am
Quote from erazmus on March 11, 2018, 4:35 am
Ok, a very bumpy upgrade. I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.

Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes. I try the first one, it comes back up, but none of the OSDs come online. The petasan.log file has the following at the end:

ConsulException: No known Consul servers

10/03/2018 19:28:08 ERROR No known Consul servers

Traceback (most recent call last):

File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync

index, data = base.watch(self.root_path, current_index)

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch

return cons.kv.get(key, index=current_index, recurse=True)

File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get

callback, '/v1/kv/%s' % key, params=params)

File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper

return caller(f, *args, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator

logger)

File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal

return f()

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get

return callback(res)

File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback

raise ConsulException(response.body)

ConsulException: No known Consul servers

Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:

Ceph Health

122 pgs are stuck inactive for more than 60 seconds
1550 pgs backfill_wait
7 pgs backfilling
649 pgs degraded
1 pgs recovering
49 pgs recovery_wait
649 pgs stuck degraded
122 pgs stuck inactive
1729 pgs stuck unclean
580 pgs stuck undersized
580 pgs undersized
1 requests are blocked > 4096 sec
recovery 75218/1553829 objects degraded (4.841%)
recovery 268774/1553829 objects misplaced (17.298%)

Ok, a very bumpy upgrade. I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.

Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes. I try the first one, it comes back up, but none of the OSDs come online. The petasan.log file has the following at the end:

ConsulException: No known Consul servers

10/03/2018 19:28:08 ERROR No known Consul servers

Traceback (most recent call last):

File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync

index, data = base.watch(self.root_path, current_index)

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch

return cons.kv.get(key, index=current_index, recurse=True)

File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get

callback, '/v1/kv/%s' % key, params=params)

File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper

return caller(f, *args, kwargs)

File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator

logger)

File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal

return f()

File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get

return callback(res)

File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback

raise ConsulException(response.body)

ConsulException: No known Consul servers

Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:

Ceph Health

122 pgs are stuck inactive for more than 60 seconds
1550 pgs backfill_wait
7 pgs backfilling
649 pgs degraded
1 pgs recovering
49 pgs recovery_wait
649 pgs stuck degraded
122 pgs stuck inactive
1729 pgs stuck unclean
580 pgs stuck undersized
580 pgs undersized
1 requests are blocked > 4096 sec
recovery 75218/1553829 objects degraded (4.841%)
recovery 268774/1553829 objects misplaced (17.298%)

#21

admin
2,962 Posts

March 11, 2018, 8:39 am
Quote from admin on March 11, 2018, 8:39 am
For the first 3 nodes, can you run:

consul members
consul info | grep leader

Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?

Are all OSDs in the cluster up apart from the OSD node you upraded ?

From the OSD node you upgraded can you run

consul members

From the OSD node can you ping other nodes on backend 1 and backend 2 ?

How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?

What errors do you get if you start an OSD manually:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph

The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.

For the first 3 nodes, can you run:

consul members
consul info | grep leader

Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?

Are all OSDs in the cluster up apart from the OSD node you upraded ?

From the OSD node you upgraded can you run

consul members

From the OSD node can you ping other nodes on backend 1 and backend 2 ?

How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?

What errors do you get if you start an OSD manually:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph

The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.

Last edited on March 11, 2018, 9:10 am by admin · #22

erazmus
40 Posts

March 12, 2018, 4:06 pm
Quote from erazmus on March 12, 2018, 4:06 pm
Hi,

Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.

My cluster rebuild is stopped at the following point.

root@ceph0:~# ceph health detail

HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set

pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]

pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]

pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')

1 ops are blocked > 67108.9 sec on osd.2

1 osds have very slow requests

Any suggestions on how to recover from this?

Hi,

Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.

My cluster rebuild is stopped at the following point.

root@ceph0:~# ceph health detail

HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set

pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]

pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]

pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')

1 ops are blocked > 67108.9 sec on osd.2

1 osds have very slow requests

Any suggestions on how to recover from this?

#23

admin
2,962 Posts

March 12, 2018, 4:22 pm
Quote from admin on March 12, 2018, 4:22 pm
Hi,

Can you go over the other points in my prev post, aside from consul if consul is all up.

We should focus on getting all OSDs up, once they talk to one another, things should clear.

Hi,

Can you go over the other points in my prev post, aside from consul if consul is all up.

We should focus on getting all OSDs up, once they talk to one another, things should clear.

Last edited on March 12, 2018, 4:22 pm by admin · #24

erazmus
40 Posts

March 12, 2018, 4:46 pm
Quote from erazmus on March 12, 2018, 4:46 pm

Quote from admin on March 11, 2018, 8:39 am

Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?

Yes

Are all OSDs in the cluster up apart from the OSD node you upraded ?

Yes

From the OSD node you upgraded can you run

consul members

root@ceph3:~# consul members

Node Address Status Type Build Protocol DC

ceph-lm3dc2-00 10.43.2.200:8301 alive client 0.7.3 2 petasan

ceph-lm3dc2-01 10.43.2.201:8301 alive client 0.7.3 2 petasan

ceph-lm3dc2-02 10.43.2.202:8301 alive client 0.7.3 2 petasan

ceph0 10.43.2.100:8301 alive server 0.7.3 2 petasan

ceph1 10.43.2.101:8301 alive server 0.7.3 2 petasan

ceph10 10.43.2.110:8301 alive client 0.7.3 2 petasan

ceph2 10.43.2.102:8301 alive server 0.7.3 2 petasan

ceph3 10.43.2.103:8301 alive client 0.7.3 2 petasan

ceph4 10.43.2.104:8301 alive client 0.7.3 2 petasan

ceph5 10.43.2.105:8301 alive client 0.7.3 2 petasan

ceph6 10.43.2.106:8301 alive client 0.7.3 2 petasan

ceph7 10.43.2.107:8301 alive client 0.7.3 2 petasan

ceph8 10.43.2.108:8301 alive client 0.7.3 2 petasan

ceph9 10.43.2.109:8301 alive client 0.7.3 2 petasan

From the OSD node can you ping other nodes on backend 1 and backend 2 ?

Yes

How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?

72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.

What errors do you get if you start an OSD manually:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph

All OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?

The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.

Quote from admin on March 11, 2018, 8:39 am

Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?

Yes

Are all OSDs in the cluster up apart from the OSD node you upraded ?

Yes

From the OSD node you upgraded can you run

consul members

root@ceph3:~# consul members

Node Address Status Type Build Protocol DC

ceph-lm3dc2-00 10.43.2.200:8301 alive client 0.7.3 2 petasan

ceph-lm3dc2-01 10.43.2.201:8301 alive client 0.7.3 2 petasan

ceph-lm3dc2-02 10.43.2.202:8301 alive client 0.7.3 2 petasan

ceph0 10.43.2.100:8301 alive server 0.7.3 2 petasan

ceph1 10.43.2.101:8301 alive server 0.7.3 2 petasan

ceph10 10.43.2.110:8301 alive client 0.7.3 2 petasan

ceph2 10.43.2.102:8301 alive server 0.7.3 2 petasan

ceph3 10.43.2.103:8301 alive client 0.7.3 2 petasan

ceph4 10.43.2.104:8301 alive client 0.7.3 2 petasan

ceph5 10.43.2.105:8301 alive client 0.7.3 2 petasan

ceph6 10.43.2.106:8301 alive client 0.7.3 2 petasan

ceph7 10.43.2.107:8301 alive client 0.7.3 2 petasan

ceph8 10.43.2.108:8301 alive client 0.7.3 2 petasan

ceph9 10.43.2.109:8301 alive client 0.7.3 2 petasan

From the OSD node can you ping other nodes on backend 1 and backend 2 ?

Yes

How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?

72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.

What errors do you get if you start an OSD manually:

/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph

All OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?

The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.

#25

admin
2,962 Posts

March 12, 2018, 5:31 pm
Quote from admin on March 12, 2018, 5:31 pm
Hi again,

OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?

The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?

The problem OSD node, after upgrade, some OSD came up and some did not start ?

Can you perform

ceph osd pool set rbd min_size 1 --cluster XXXX

On the node running OSD 2

systemctl restart ceph-osd@2

Hi again,

OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?

The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?

The problem OSD node, after upgrade, some OSD came up and some did not start ?

Can you perform

ceph osd pool set rbd min_size 1 --cluster XXXX

On the node running OSD 2

systemctl restart ceph-osd@2

Last edited on March 12, 2018, 5:36 pm by admin · #26

erazmus
40 Posts

March 12, 2018, 5:38 pm
Quote from erazmus on March 12, 2018, 5:38 pm

Quote from admin on March 12, 2018, 5:31 pm

Hi again,

OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?

All are up.

The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?

Multiple nodes. I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.

The problem OSD node, after upgrade, some OSD came up and some did not start ?

The problem node came up with no OSDs starting.

Can you perform

ceph osd pool set rbd min_size 1 --cluster XXXX

I now get the same error, but the suggestion about reducing min_size is gone:

root@ceph0:~# ceph health detail

HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set

pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]

pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]

pg 1.e0e is incomplete, acting [2,35,23]

1 ops are blocked > 67108.9 sec on osd.2

1 osds have very slow requests

Quote from admin on March 12, 2018, 5:31 pm

Hi again,

OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?

All are up.

The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?

Multiple nodes. I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.

The problem OSD node, after upgrade, some OSD came up and some did not start ?

The problem node came up with no OSDs starting.

Can you perform

ceph osd pool set rbd min_size 1 --cluster XXXX

I now get the same error, but the suggestion about reducing min_size is gone:

root@ceph0:~# ceph health detail

HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set

pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]

pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]

pg 1.e0e is incomplete, acting [2,35,23]

1 ops are blocked > 67108.9 sec on osd.2

1 osds have very slow requests

#27

admin
2,962 Posts

March 12, 2018, 6:28 pm
Quote from admin on March 12, 2018, 6:28 pm
The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?

Can you please get the output of

ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log

The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?

Can you please get the output of

ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log

Last edited on March 12, 2018, 6:28 pm by admin · #28

erazmus
40 Posts

March 12, 2018, 6:53 pm
Quote from erazmus on March 12, 2018, 6:53 pm

Quote from admin on March 12, 2018, 6:28 pm

The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?

Can you please get the output of

ceph pg 1.e0e query --cluster CLUSTER_NAME

https://pastebin.com/kdjCFtRH

ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e

root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e

dumped all

1.e0e 0 0 0 0 0 0 0 0 incomplete 2018-03-12 10:34:50.678837 0'0 22338:11907 [2,35,23]

2 [2,35,23] 2 5858'4107 2018-03-09 06:14:42.877071 5858'4107 2018-03-03 20:07:24.602472

On node with OSD2, get OSD log,

https://pastebin.com/YuFnRRxt

chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log

https://pastebin.com/wB5YZ5zZ

Quote from admin on March 12, 2018, 6:28 pm

The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?

Can you please get the output of

ceph pg 1.e0e query --cluster CLUSTER_NAME

https://pastebin.com/kdjCFtRH

ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e

root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e

dumped all

1.e0e 0 0 0 0 0 0 0 0 incomplete 2018-03-12 10:34:50.678837 0'0 22338:11907 [2,35,23]

2 [2,35,23] 2 5858'4107 2018-03-09 06:14:42.877071 5858'4107 2018-03-03 20:07:24.602472

On node with OSD2, get OSD log,

https://pastebin.com/YuFnRRxt

chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log

https://pastebin.com/wB5YZ5zZ

#29

admin**
2,962 Posts

March 12, 2018, 7:09 pm
Quote from admin on March 12, 2018, 7:09 pm
What is the status of the following OSDs, is any up ?

52, 56, 65

What is the status of the following OSDs, is any up ?

52, 56, 65

Last edited on March 12, 2018, 7:10 pm by admin · #30

Post Reply: Vlan configuration

Cancel

Pages: 1 2 3 4 5