Vlan configuration
erazmus
40 Posts
March 11, 2018, 4:35 amQuote from erazmus on March 11, 2018, 4:35 amOk, a very bumpy upgrade. I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.
Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes. I try the first one, it comes back up, but none of the OSDs come online. The petasan.log file has the following at the end:
ConsulException: No known Consul servers
10/03/2018 19:28:08 ERROR No known Consul servers
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync
index, data = base.watch(self.root_path, current_index)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch
return cons.kv.get(key, index=current_index, recurse=True)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
callback, '/v1/kv/%s' % key, params=params)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get
return callback(res)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback
raise ConsulException(response.body)
ConsulException: No known Consul servers
Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:
Ceph Health
Ok, a very bumpy upgrade. I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.
Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes. I try the first one, it comes back up, but none of the OSDs come online. The petasan.log file has the following at the end:
ConsulException: No known Consul servers
10/03/2018 19:28:08 ERROR No known Consul servers
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync
index, data = base.watch(self.root_path, current_index)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch
return cons.kv.get(key, index=current_index, recurse=True)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
callback, '/v1/kv/%s' % key, params=params)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get
return callback(res)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback
raise ConsulException(response.body)
ConsulException: No known Consul servers
Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:
Ceph Health
admin
2,930 Posts
March 11, 2018, 8:39 amQuote from admin on March 11, 2018, 8:39 amFor the first 3 nodes, can you run:
consul members
consul info | grep leader
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Are all OSDs in the cluster up apart from the OSD node you upraded ?
From the OSD node you upgraded can you run
consul members
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
For the first 3 nodes, can you run:
consul members
consul info | grep leader
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Are all OSDs in the cluster up apart from the OSD node you upraded ?
From the OSD node you upgraded can you run
consul members
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
Last edited on March 11, 2018, 9:10 am by admin · #22
erazmus
40 Posts
March 12, 2018, 4:06 pmQuote from erazmus on March 12, 2018, 4:06 pmHi,
Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.
My cluster rebuild is stopped at the following point.
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
Any suggestions on how to recover from this?
Hi,
Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.
My cluster rebuild is stopped at the following point.
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
Any suggestions on how to recover from this?
admin
2,930 Posts
March 12, 2018, 4:22 pmQuote from admin on March 12, 2018, 4:22 pmHi,
Can you go over the other points in my prev post, aside from consul if consul is all up.
We should focus on getting all OSDs up, once they talk to one another, things should clear.
Hi,
Can you go over the other points in my prev post, aside from consul if consul is all up.
We should focus on getting all OSDs up, once they talk to one another, things should clear.
Last edited on March 12, 2018, 4:22 pm by admin · #24
erazmus
40 Posts
March 12, 2018, 4:46 pmQuote from erazmus on March 12, 2018, 4:46 pm
Quote from admin on March 11, 2018, 8:39 am
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Yes
Are all OSDs in the cluster up apart from the OSD node you upraded ?
Yes
From the OSD node you upgraded can you run
consul members
root@ceph3:~# consul members
Node Address Status Type Build Protocol DC
ceph-lm3dc2-00 10.43.2.200:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-01 10.43.2.201:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-02 10.43.2.202:8301 alive client 0.7.3 2 petasan
ceph0 10.43.2.100:8301 alive server 0.7.3 2 petasan
ceph1 10.43.2.101:8301 alive server 0.7.3 2 petasan
ceph10 10.43.2.110:8301 alive client 0.7.3 2 petasan
ceph2 10.43.2.102:8301 alive server 0.7.3 2 petasan
ceph3 10.43.2.103:8301 alive client 0.7.3 2 petasan
ceph4 10.43.2.104:8301 alive client 0.7.3 2 petasan
ceph5 10.43.2.105:8301 alive client 0.7.3 2 petasan
ceph6 10.43.2.106:8301 alive client 0.7.3 2 petasan
ceph7 10.43.2.107:8301 alive client 0.7.3 2 petasan
ceph8 10.43.2.108:8301 alive client 0.7.3 2 petasan
ceph9 10.43.2.109:8301 alive client 0.7.3 2 petasan
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
Yes
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
All OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
Quote from admin on March 11, 2018, 8:39 am
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Yes
Are all OSDs in the cluster up apart from the OSD node you upraded ?
Yes
From the OSD node you upgraded can you run
consul members
root@ceph3:~# consul members
Node Address Status Type Build Protocol DC
ceph-lm3dc2-00 10.43.2.200:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-01 10.43.2.201:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-02 10.43.2.202:8301 alive client 0.7.3 2 petasan
ceph0 10.43.2.100:8301 alive server 0.7.3 2 petasan
ceph1 10.43.2.101:8301 alive server 0.7.3 2 petasan
ceph10 10.43.2.110:8301 alive client 0.7.3 2 petasan
ceph2 10.43.2.102:8301 alive server 0.7.3 2 petasan
ceph3 10.43.2.103:8301 alive client 0.7.3 2 petasan
ceph4 10.43.2.104:8301 alive client 0.7.3 2 petasan
ceph5 10.43.2.105:8301 alive client 0.7.3 2 petasan
ceph6 10.43.2.106:8301 alive client 0.7.3 2 petasan
ceph7 10.43.2.107:8301 alive client 0.7.3 2 petasan
ceph8 10.43.2.108:8301 alive client 0.7.3 2 petasan
ceph9 10.43.2.109:8301 alive client 0.7.3 2 petasan
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
Yes
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
All OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
admin
2,930 Posts
March 12, 2018, 5:31 pmQuote from admin on March 12, 2018, 5:31 pmHi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
The problem OSD node, after upgrade, some OSD came up and some did not start ?
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
On the node running OSD 2
systemctl restart ceph-osd@2
Hi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
The problem OSD node, after upgrade, some OSD came up and some did not start ?
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
On the node running OSD 2
systemctl restart ceph-osd@2
Last edited on March 12, 2018, 5:36 pm by admin · #26
erazmus
40 Posts
March 12, 2018, 5:38 pmQuote from erazmus on March 12, 2018, 5:38 pm
Quote from admin on March 12, 2018, 5:31 pm
Hi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
All are up.
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
Multiple nodes. I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.
The problem OSD node, after upgrade, some OSD came up and some did not start ?
The problem node came up with no OSDs starting.
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
I now get the same error, but the suggestion about reducing min_size is gone:
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23]
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
Quote from admin on March 12, 2018, 5:31 pm
Hi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
All are up.
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
Multiple nodes. I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.
The problem OSD node, after upgrade, some OSD came up and some did not start ?
The problem node came up with no OSDs starting.
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
I now get the same error, but the suggestion about reducing min_size is gone:
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23]
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
admin
2,930 Posts
March 12, 2018, 6:28 pmQuote from admin on March 12, 2018, 6:28 pmThe deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
Last edited on March 12, 2018, 6:28 pm by admin · #28
erazmus
40 Posts
March 12, 2018, 6:53 pmQuote from erazmus on March 12, 2018, 6:53 pm
Quote from admin on March 12, 2018, 6:28 pm
The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
https://pastebin.com/kdjCFtRH
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e
dumped all
1.e0e 0 0 0 0 0 0 0 0 incomplete 2018-03-12 10:34:50.678837 0'0 22338:11907 [2,35,23]
2 [2,35,23] 2 5858'4107 2018-03-09 06:14:42.877071 5858'4107 2018-03-03 20:07:24.602472
On node with OSD2, get OSD log,
https://pastebin.com/YuFnRRxt
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
https://pastebin.com/wB5YZ5zZ
Quote from admin on March 12, 2018, 6:28 pm
The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e
dumped all
1.e0e 0 0 0 0 0 0 0 0 incomplete 2018-03-12 10:34:50.678837 0'0 22338:11907 [2,35,23]
2 [2,35,23] 2 5858'4107 2018-03-09 06:14:42.877071 5858'4107 2018-03-03 20:07:24.602472
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
admin
2,930 Posts
March 12, 2018, 7:09 pmQuote from admin on March 12, 2018, 7:09 pmWhat is the status of the following OSDs, is any up ?
52, 56, 65
What is the status of the following OSDs, is any up ?
52, 56, 65
Last edited on March 12, 2018, 7:10 pm by admin · #30
Vlan configuration
erazmus
40 Posts
Quote from erazmus on March 11, 2018, 4:35 amOk, a very bumpy upgrade. I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.
Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes. I try the first one, it comes back up, but none of the OSDs come online. The petasan.log file has the following at the end:
ConsulException: No known Consul servers
10/03/2018 19:28:08 ERROR No known Consul servers
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync
index, data = base.watch(self.root_path, current_index)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch
return cons.kv.get(key, index=current_index, recurse=True)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
callback, '/v1/kv/%s' % key, params=params)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get
return callback(res)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback
raise ConsulException(response.body)
ConsulException: No known Consul servers
Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:
Ceph Health
Ok, a very bumpy upgrade. I now have 3 monitor nodes up and running and in a quorum. A very strange issue where one of them didn't have twisted installed. Not sure if that's a known issue.
Anyway, now that the cluster is healthy, I'm trying to update the software on the OSD nodes. I try the first one, it comes back up, but none of the OSDs come online. The petasan.log file has the following at the end:
ConsulException: No known Consul servers
10/03/2018 19:28:08 ERROR No known Consul servers
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/PetaSAN/backend/file_sync_manager.py", line 75, in sync
index, data = base.watch(self.root_path, current_index)
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/base.py", line 49, in watch
return cons.kv.get(key, index=current_index, recurse=True)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 391, in get
callback, '/v1/kv/%s' % key, params=params)
File "/usr/local/lib/python2.7/dist-packages/retry/compat.py", line 16, in wrapper
return caller(f, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python2.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/lib/python2.7/dist-packages/PetaSAN/core/consul/ps_consul.py", line 72, in get
return callback(res)
File "/usr/local/lib/python2.7/dist-packages/consul/base.py", line 377, in callback
raise ConsulException(response.body)
ConsulException: No known Consul servers
Any ideas what might have gone wrong? The cluster is stuck repairing and won't go any further:
Ceph Health
admin
2,930 Posts
Quote from admin on March 11, 2018, 8:39 amFor the first 3 nodes, can you run:
consul members
consul info | grep leaderBefore you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Are all OSDs in the cluster up apart from the OSD node you upraded ?
From the OSD node you upgraded can you run
consul members
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
For the first 3 nodes, can you run:
consul members
consul info | grep leader
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Are all OSDs in the cluster up apart from the OSD node you upraded ?
From the OSD node you upgraded can you run
consul members
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
erazmus
40 Posts
Quote from erazmus on March 12, 2018, 4:06 pmHi,
Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.
My cluster rebuild is stopped at the following point.
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
Any suggestions on how to recover from this?
Hi,
Okay, I think the consul error was a red herring. Checking timestamps it didn't correspond to the time of the issues. The consul debugging you give is showing it's all up.
My cluster rebuild is stopped at the following point.
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 131678.605358, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23] (reducing pool rbd min_size from 2 may help; search ceph.com/docs for 'incomplete')
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
Any suggestions on how to recover from this?
admin
2,930 Posts
Quote from admin on March 12, 2018, 4:22 pmHi,
Can you go over the other points in my prev post, aside from consul if consul is all up.
We should focus on getting all OSDs up, once they talk to one another, things should clear.
Hi,
Can you go over the other points in my prev post, aside from consul if consul is all up.
We should focus on getting all OSDs up, once they talk to one another, things should clear.
erazmus
40 Posts
Quote from erazmus on March 12, 2018, 4:46 pmQuote from admin on March 11, 2018, 8:39 am
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Yes
Are all OSDs in the cluster up apart from the OSD node you upraded ?
Yes
From the OSD node you upgraded can you run
consul members
root@ceph3:~# consul members
Node Address Status Type Build Protocol DC
ceph-lm3dc2-00 10.43.2.200:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-01 10.43.2.201:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-02 10.43.2.202:8301 alive client 0.7.3 2 petasan
ceph0 10.43.2.100:8301 alive server 0.7.3 2 petasan
ceph1 10.43.2.101:8301 alive server 0.7.3 2 petasan
ceph10 10.43.2.110:8301 alive client 0.7.3 2 petasan
ceph2 10.43.2.102:8301 alive server 0.7.3 2 petasan
ceph3 10.43.2.103:8301 alive client 0.7.3 2 petasan
ceph4 10.43.2.104:8301 alive client 0.7.3 2 petasan
ceph5 10.43.2.105:8301 alive client 0.7.3 2 petasan
ceph6 10.43.2.106:8301 alive client 0.7.3 2 petasan
ceph7 10.43.2.107:8301 alive client 0.7.3 2 petasan
ceph8 10.43.2.108:8301 alive client 0.7.3 2 petasan
ceph9 10.43.2.109:8301 alive client 0.7.3 2 petasan
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
Yes
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup cephAll OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
Quote from admin on March 11, 2018, 8:39 am
Before you upgraded the first OSD, the cluster was healthy as in status "OK" / "active/clean" ?
Yes
Are all OSDs in the cluster up apart from the OSD node you upraded ?
Yes
From the OSD node you upgraded can you run
consul members
root@ceph3:~# consul members
Node Address Status Type Build Protocol DC
ceph-lm3dc2-00 10.43.2.200:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-01 10.43.2.201:8301 alive client 0.7.3 2 petasan
ceph-lm3dc2-02 10.43.2.202:8301 alive client 0.7.3 2 petasan
ceph0 10.43.2.100:8301 alive server 0.7.3 2 petasan
ceph1 10.43.2.101:8301 alive server 0.7.3 2 petasan
ceph10 10.43.2.110:8301 alive client 0.7.3 2 petasan
ceph2 10.43.2.102:8301 alive server 0.7.3 2 petasan
ceph3 10.43.2.103:8301 alive client 0.7.3 2 petasan
ceph4 10.43.2.104:8301 alive client 0.7.3 2 petasan
ceph5 10.43.2.105:8301 alive client 0.7.3 2 petasan
ceph6 10.43.2.106:8301 alive client 0.7.3 2 petasan
ceph7 10.43.2.107:8301 alive client 0.7.3 2 petasan
ceph8 10.43.2.108:8301 alive client 0.7.3 2 petasan
ceph9 10.43.2.109:8301 alive client 0.7.3 2 petasan
From the OSD node can you ping other nodes on backend 1 and backend 2 ?
Yes
How many OSDs do you have ? How many PGs ? What capacity of disks in TB ? Do you have enough RAM ?
72 OSDs spread among 14 servers. 4096 PGs. ~20Tb total. Each node has between 16 and 32Gb. No nodes are going into swap.
What errors do you get if you start an OSD manually:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster CLUSTER_NAME --id OSD_ID
/usr/bin/ceph-osd -f --cluster CLUSTER_NAME --id OSD_ID --setuser ceph --setgroup ceph
example:
/usr/lib/ceph/ceph-osd-prestart.sh --cluster demo --id 2
/usr/bin/ceph-osd -f --cluster demo --id 2 --setuser ceph --setgroup ceph
All OSDs are either deleted or up. OSDs that would not start have been deleted (I know - I don't need a lecture. It was a rough weekend). I am unable to create a new OSD, but I suspect this is due to the hybrid nature of the cluster at the moment?
The twisted package not installed is also very strange, could be a failure during install, but let us take it one step at a time. Our first priority is to get the OSD node connect to consul and all its OSD up.
admin
2,930 Posts
Quote from admin on March 12, 2018, 5:31 pmHi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
The problem OSD node, after upgrade, some OSD came up and some did not start ?
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
On the node running OSD 2
systemctl restart ceph-osd@2
Hi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
The problem OSD node, after upgrade, some OSD came up and some did not start ?
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
On the node running OSD 2
systemctl restart ceph-osd@2
erazmus
40 Posts
Quote from erazmus on March 12, 2018, 5:38 pmQuote from admin on March 12, 2018, 5:31 pmHi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
All are up.
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
Multiple nodes. I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.
The problem OSD node, after upgrade, some OSD came up and some did not start ?
The problem node came up with no OSDs starting.
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
I now get the same error, but the suggestion about reducing min_size is gone:
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23]
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
Quote from admin on March 12, 2018, 5:31 pmHi again,
OSD # 2, 35, 23 : are any down or all up ? is any on the problem OSD node ?
All are up.
The deleted OSDs, they were all from the same problem OSD node or there were deleted OSD on multiple nodes ?
Multiple nodes. I deleted all of the ones on the node that was upgraded to 2.0, as well as one or two elsewhere during my troubleshooting.
The problem OSD node, after upgrade, some OSD came up and some did not start ?
The problem node came up with no OSDs starting.
Can you perform
ceph osd pool set rbd min_size 1 --cluster XXXX
I now get the same error, but the suggestion about reducing min_size is gone:
root@ceph0:~# ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 60 seconds; 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean; 1 requests are blocked > 4096 sec; 1 osds have very slow requests; noscrub,nodeep-scrub flag(s) set
pg 1.e0e is stuck inactive for 137793.803178, current state incomplete, last acting [2,35,23]
pg 1.e0e is stuck unclean since forever, current state incomplete, last acting [2,35,23]
pg 1.e0e is incomplete, acting [2,35,23]
1 ops are blocked > 67108.9 sec on osd.2
1 osds have very slow requests
admin
2,930 Posts
Quote from admin on March 12, 2018, 6:28 pmThe deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
The deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
erazmus
40 Posts
Quote from erazmus on March 12, 2018, 6:53 pmQuote from admin on March 12, 2018, 6:28 pmThe deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
https://pastebin.com/kdjCFtRH
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e
dumped all
1.e0e 0 0 0 0 0 0 0 0 incomplete 2018-03-12 10:34:50.678837 0'0 22338:11907 [2,35,23]
2 [2,35,23] 2 5858'4107 2018-03-09 06:14:42.877071 5858'4107 2018-03-03 20:07:24.602472
On node with OSD2, get OSD log,
https://pastebin.com/YuFnRRxt
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.loghttps://pastebin.com/wB5YZ5zZ
Quote from admin on March 12, 2018, 6:28 pmThe deletion of OSDs on other than the problem node, were the OSD down ? On how many nodes ?
Can you please get the output of
ceph pg 1.e0e query --cluster CLUSTER_NAME
ceph pg dump --cluster CLUSTER_NAME | grep 1.e0e
root@ceph0:~# ceph pg dump --cluster ceph | grep 1.e0e
dumped all
1.e0e 0 0 0 0 0 0 0 0 incomplete 2018-03-12 10:34:50.678837 0'0 22338:11907 [2,35,23]
2 [2,35,23] 2 5858'4107 2018-03-09 06:14:42.877071 5858'4107 2018-03-03 20:07:24.602472
On node with OSD2, get OSD log,
chown -R ceph:ceph /var/log/ceph
systemctl restart ceph-osd@2
wait for 10 min:
cat /var/log/ceph/CLUSTER_NAME-osd.2.log
admin
2,930 Posts
Quote from admin on March 12, 2018, 7:09 pmWhat is the status of the following OSDs, is any up ?
52, 56, 65
What is the status of the following OSDs, is any up ?
52, 56, 65