pools and iscsi disks sometimes fluctuate inactive active,
Pages: 1 2
hjallisnorra
19 Posts
January 7, 2020, 11:49 amQuote from hjallisnorra on January 7, 2020, 11:49 amHi we are testing a setup for production,
with 3 mgm nodes in virtual,
5 nodes with iscsi and some storage,
6 nodes with only storage, (50 *1T sata 2* 1Tssd)
we have created pools and crush maps,
all seams to be working fine,
created some disks mounted in ovirt and are testing some vm's on the storage, no issues with anything there.
but in the petasan pool config the pools are fluctuating active inactive.
and sometimes in the iscsi disk list there is only: "No data available in table" and after refresing a few times the disks are visible.
this is rather frustrating since the cluster is in ok state and no visible issues anywhere.
ceph status:
root@av-petasan-mgm-ash1-001:~# ceph status
cluster:
id: 48cc93e7-b2ee-4fe1-8b01-b6aacb9dda66
health: HEALTH_OK
services:
mon: 3 daemons, quorum av-petasan-mgm-ash1-003,av-petasan-mgm-ash1-001,av-petasan-mgm-ash1-002 (age 8h)
mgr: av-petasan-mgm-ash1-003(active, since 4d), standbys: av-petasan-mgm-ash1-002, av-petasan-mgm-ash1-001
osd: 367 osds: 367 up (since 8h), 367 in (since 8h)
data:
pools: 4 pools, 16384 pgs
objects: 48.27k objects, 188 GiB
usage: 11 TiB used, 499 TiB / 510 TiB avail
pgs: 16384 active+clean
io:
client: 29 KiB/s rd, 11 KiB/s wr, 0 op/s rd, 0 op/s wr
root@av-petasan-mgm-ash1-001:~#
and last, is it possible to add to more mgm servers?
Hi we are testing a setup for production,
with 3 mgm nodes in virtual,
5 nodes with iscsi and some storage,
6 nodes with only storage, (50 *1T sata 2* 1Tssd)
we have created pools and crush maps,
all seams to be working fine,
created some disks mounted in ovirt and are testing some vm's on the storage, no issues with anything there.
but in the petasan pool config the pools are fluctuating active inactive.
and sometimes in the iscsi disk list there is only: "No data available in table" and after refresing a few times the disks are visible.
this is rather frustrating since the cluster is in ok state and no visible issues anywhere.
ceph status:
root@av-petasan-mgm-ash1-001:~# ceph status
cluster:
id: 48cc93e7-b2ee-4fe1-8b01-b6aacb9dda66
health: HEALTH_OK
services:
mon: 3 daemons, quorum av-petasan-mgm-ash1-003,av-petasan-mgm-ash1-001,av-petasan-mgm-ash1-002 (age 8h)
mgr: av-petasan-mgm-ash1-003(active, since 4d), standbys: av-petasan-mgm-ash1-002, av-petasan-mgm-ash1-001
osd: 367 osds: 367 up (since 8h), 367 in (since 8h)
data:
pools: 4 pools, 16384 pgs
objects: 48.27k objects, 188 GiB
usage: 11 TiB used, 499 TiB / 510 TiB avail
pgs: 16384 active+clean
io:
client: 29 KiB/s rd, 11 KiB/s wr, 0 op/s rd, 0 op/s wr
root@av-petasan-mgm-ash1-001:~#
and last, is it possible to add to more mgm servers?
admin
2,930 Posts
January 7, 2020, 12:19 pmQuote from admin on January 7, 2020, 12:19 pmThis is most likely either under powered hardware or network hardware problem.
You can add more ceph monitor and consul servers via cli if you want. PetaSAN setup 3 nodes but you can increase manually.
This is most likely either under powered hardware or network hardware problem.
You can add more ceph monitor and consul servers via cli if you want. PetaSAN setup 3 nodes but you can increase manually.
Last edited on January 7, 2020, 12:22 pm by admin · #2
hjallisnorra
19 Posts
January 7, 2020, 1:48 pmQuote from hjallisnorra on January 7, 2020, 1:48 pmHi again,
do i need especially powerful mgm nodes depending on how many osd, or pg we are using?
hardware:
2 G9 hp dl380 64G mem 4 25G nic's, 4 1G nic's. (9 * 8T sata disks) 1 * 1T ssd. iscsi/storage nodes
3 G9 hp dl380 128G mem 4 25G nic's, 4 1G nic's. (22 * 2T sas disks) 2 * 1T ssd. iscsi/storage nodes
6 G8 hp dl360 64G mem 2 25G nic's. 4 1G nic's storage nodes (added 2 dummy nic's when deploying) 50 1T sata 2 1T ssd. (changed the bluestore_block_db_size = 32212238227 so as the journal disk partition is 30G per disk), (changed the osd_memory_target = 1073741824)
3 virtual mgm server 16G mem 8 cpu cores.
cluster_info.txt:
{
"backend_1_base_ip": "10.118.64.0",
"backend_1_eth_name": "bond0",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "2064",
"backend_2_base_ip": "10.118.65.0",
"backend_2_eth_name": "bond0",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "2065",
"bonds": [
{
"interfaces": "eth2,eth3",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond0",
"primary_interface": ""
},
{
"interfaces": "eth4,eth5",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond1",
"primary_interface": ""
}
],
"eth_count": 8,
"iscsi_1_eth_name": "bond1",
"iscsi_2_eth_name": "bond1",
"jumbo_frames": [
"eth4",
"eth2",
"eth5",
"eth3"
],
"management_eth_name": "eth1",
"management_nodes": [
{
"backend_1_ip": "10.118.64.101",
"backend_2_ip": "10.118.65.101",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.101",
"name": "av-petasan-mgm-ash1-001"
},
{
"backend_1_ip": "10.118.64.102",
"backend_2_ip": "10.118.65.102",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.102",
"name": "av-petasan-mgm-ash1-002"
},
{
"backend_1_ip": "10.118.64.103",
"backend_2_ip": "10.118.65.103",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.103",
"name": "av-petasan-mgm-ash1-003"
}
],
"name": "ash1-petasan",
"storage_engine": "bluestore"
}
we did some ping tests for all the ip numbers in the cluster and no package loss detected.
Hi again,
do i need especially powerful mgm nodes depending on how many osd, or pg we are using?
hardware:
2 G9 hp dl380 64G mem 4 25G nic's, 4 1G nic's. (9 * 8T sata disks) 1 * 1T ssd. iscsi/storage nodes
3 G9 hp dl380 128G mem 4 25G nic's, 4 1G nic's. (22 * 2T sas disks) 2 * 1T ssd. iscsi/storage nodes
6 G8 hp dl360 64G mem 2 25G nic's. 4 1G nic's storage nodes (added 2 dummy nic's when deploying) 50 1T sata 2 1T ssd. (changed the bluestore_block_db_size = 32212238227 so as the journal disk partition is 30G per disk), (changed the osd_memory_target = 1073741824)
3 virtual mgm server 16G mem 8 cpu cores.
cluster_info.txt:
{
"backend_1_base_ip": "10.118.64.0",
"backend_1_eth_name": "bond0",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "2064",
"backend_2_base_ip": "10.118.65.0",
"backend_2_eth_name": "bond0",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "2065",
"bonds": [
{
"interfaces": "eth2,eth3",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond0",
"primary_interface": ""
},
{
"interfaces": "eth4,eth5",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond1",
"primary_interface": ""
}
],
"eth_count": 8,
"iscsi_1_eth_name": "bond1",
"iscsi_2_eth_name": "bond1",
"jumbo_frames": [
"eth4",
"eth2",
"eth5",
"eth3"
],
"management_eth_name": "eth1",
"management_nodes": [
{
"backend_1_ip": "10.118.64.101",
"backend_2_ip": "10.118.65.101",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.101",
"name": "av-petasan-mgm-ash1-001"
},
{
"backend_1_ip": "10.118.64.102",
"backend_2_ip": "10.118.65.102",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.102",
"name": "av-petasan-mgm-ash1-002"
},
{
"backend_1_ip": "10.118.64.103",
"backend_2_ip": "10.118.65.103",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.103",
"name": "av-petasan-mgm-ash1-003"
}
],
"name": "ash1-petasan",
"storage_engine": "bluestore"
}
we did some ping tests for all the ip numbers in the cluster and no package loss detected.
admin
2,930 Posts
January 7, 2020, 3:14 pmQuote from admin on January 7, 2020, 3:14 pmIt seems the management nodes are slow to respond, hard to say why. 16 G ram should be enough, but it could be an issue with the vm setup or maybe the network connection between them.
When you see pools going active/in-active in the ui, can you ssh to the vm you are connecting too and run
ceph osd dump
ceph pg ls-by-pool POOL_NAME
example:
ceph pg ls-by-pool rbd
run them a couple of times in a row and see if they responsive or if they take a long time to complete.
It seems the management nodes are slow to respond, hard to say why. 16 G ram should be enough, but it could be an issue with the vm setup or maybe the network connection between them.
When you see pools going active/in-active in the ui, can you ssh to the vm you are connecting too and run
ceph osd dump
ceph pg ls-by-pool POOL_NAME
example:
ceph pg ls-by-pool rbd
run them a couple of times in a row and see if they responsive or if they take a long time to complete.
Last edited on January 7, 2020, 3:36 pm by admin · #4
hjallisnorra
19 Posts
January 7, 2020, 4:57 pmQuote from hjallisnorra on January 7, 2020, 4:57 pmYes they take a long time to respond:
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m16.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m2.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.540s
user 0m0.524s
sys 0m0.113s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.206s
user 0m0.609s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m5.542s
user 0m0.496s
sys 0m0.096s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.810s
user 0m0.507s
sys 0m0.052s
started long and then quicker.
Yes they take a long time to respond:
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m16.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m2.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.540s
user 0m0.524s
sys 0m0.113s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.206s
user 0m0.609s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m5.542s
user 0m0.496s
sys 0m0.096s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.810s
user 0m0.507s
sys 0m0.052s
started long and then quicker.
hjallisnorra
19 Posts
January 7, 2020, 5:25 pmQuote from hjallisnorra on January 7, 2020, 5:25 pmBut running "ceph osd dump" is always within 1 sec.
root@av-petasan-mgm-ash1-001:~# time ceph osd dump | wc
385 6367 95761
real 0m0.700s
user 0m0.603s
sys 0m0.048s
But running "ceph osd dump" is always within 1 sec.
root@av-petasan-mgm-ash1-001:~# time ceph osd dump | wc
385 6367 95761
real 0m0.700s
user 0m0.603s
sys 0m0.048s
admin
2,930 Posts
January 7, 2020, 5:57 pmQuote from admin on January 7, 2020, 5:57 pmThis is taking a long time, apart from checking hardware speed and network connections you can increase the timeout as follows
line 72 in /usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/pool_checker.py
class PoolChecker():
def __init__(self, timeout=5.0):
self.timeout = timeout
change timeout=5.0 to for example timeout=30.0
even if this fixes the issue, i cannot say all is OK as it could be masking some other issue
This is taking a long time, apart from checking hardware speed and network connections you can increase the timeout as follows
line 72 in /usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/pool_checker.py
class PoolChecker():
def __init__(self, timeout=5.0):
self.timeout = timeout
change timeout=5.0 to for example timeout=30.0
even if this fixes the issue, i cannot say all is OK as it could be masking some other issue
Last edited on January 7, 2020, 5:58 pm by admin · #7
hjallisnorra
19 Posts
January 8, 2020, 11:26 amQuote from hjallisnorra on January 8, 2020, 11:26 amThank you, this is working.
Thank you, this is working.
hjallisnorra
19 Posts
January 16, 2020, 11:46 amQuote from hjallisnorra on January 16, 2020, 11:46 amHi again,
we have this same problem with iscsi path assignment list, it is there not there fluctuating,
we are not seeing any problems, it just frustrating when the list is populated and then empty and populated and empty,
was wondering if there is another timeout we can adjust to get rid of the fluctuation?
Thanks in advance
hjalli.
Hi again,
we have this same problem with iscsi path assignment list, it is there not there fluctuating,
we are not seeing any problems, it just frustrating when the list is populated and then empty and populated and empty,
was wondering if there is another timeout we can adjust to get rid of the fluctuation?
Thanks in advance
hjalli.
admin
2,930 Posts
January 16, 2020, 12:15 pmQuote from admin on January 16, 2020, 12:15 pmIt is the same timeout used for both.
basically ceph will block if a pool is not responding, this is because a pool could be in the process of recovery. From a ui we need to specify a timeout so we do not hang forever, it should also not be too long so in case a pool is down we do not delay the ui too much, we chose 5 sec, we have not seen issues with this value.
If you wish you can increase the timeout over your current 30 sec which is already too high, as indicated earlier, increasing the timeout could mask a root cause of why in your setup it is taking too long to respond, as indicated it could be underpowered hardware or network issues.
It is the same timeout used for both.
basically ceph will block if a pool is not responding, this is because a pool could be in the process of recovery. From a ui we need to specify a timeout so we do not hang forever, it should also not be too long so in case a pool is down we do not delay the ui too much, we chose 5 sec, we have not seen issues with this value.
If you wish you can increase the timeout over your current 30 sec which is already too high, as indicated earlier, increasing the timeout could mask a root cause of why in your setup it is taking too long to respond, as indicated it could be underpowered hardware or network issues.
Last edited on January 16, 2020, 12:19 pm by admin · #10
Pages: 1 2
pools and iscsi disks sometimes fluctuate inactive active,
hjallisnorra
19 Posts
Quote from hjallisnorra on January 7, 2020, 11:49 amHi we are testing a setup for production,
with 3 mgm nodes in virtual,
5 nodes with iscsi and some storage,
6 nodes with only storage, (50 *1T sata 2* 1Tssd)
we have created pools and crush maps,
all seams to be working fine,
created some disks mounted in ovirt and are testing some vm's on the storage, no issues with anything there.
but in the petasan pool config the pools are fluctuating active inactive.
and sometimes in the iscsi disk list there is only: "No data available in table" and after refresing a few times the disks are visible.
this is rather frustrating since the cluster is in ok state and no visible issues anywhere.
ceph status:
root@av-petasan-mgm-ash1-001:~# ceph status
cluster:
id: 48cc93e7-b2ee-4fe1-8b01-b6aacb9dda66
health: HEALTH_OKservices:
mon: 3 daemons, quorum av-petasan-mgm-ash1-003,av-petasan-mgm-ash1-001,av-petasan-mgm-ash1-002 (age 8h)
mgr: av-petasan-mgm-ash1-003(active, since 4d), standbys: av-petasan-mgm-ash1-002, av-petasan-mgm-ash1-001
osd: 367 osds: 367 up (since 8h), 367 in (since 8h)data:
pools: 4 pools, 16384 pgs
objects: 48.27k objects, 188 GiB
usage: 11 TiB used, 499 TiB / 510 TiB avail
pgs: 16384 active+cleanio:
client: 29 KiB/s rd, 11 KiB/s wr, 0 op/s rd, 0 op/s wrroot@av-petasan-mgm-ash1-001:~#
and last, is it possible to add to more mgm servers?
Hi we are testing a setup for production,
with 3 mgm nodes in virtual,
5 nodes with iscsi and some storage,
6 nodes with only storage, (50 *1T sata 2* 1Tssd)
we have created pools and crush maps,
all seams to be working fine,
created some disks mounted in ovirt and are testing some vm's on the storage, no issues with anything there.
but in the petasan pool config the pools are fluctuating active inactive.
and sometimes in the iscsi disk list there is only: "No data available in table" and after refresing a few times the disks are visible.
this is rather frustrating since the cluster is in ok state and no visible issues anywhere.
ceph status:
root@av-petasan-mgm-ash1-001:~# ceph status
cluster:
id: 48cc93e7-b2ee-4fe1-8b01-b6aacb9dda66
health: HEALTH_OK
services:
mon: 3 daemons, quorum av-petasan-mgm-ash1-003,av-petasan-mgm-ash1-001,av-petasan-mgm-ash1-002 (age 8h)
mgr: av-petasan-mgm-ash1-003(active, since 4d), standbys: av-petasan-mgm-ash1-002, av-petasan-mgm-ash1-001
osd: 367 osds: 367 up (since 8h), 367 in (since 8h)
data:
pools: 4 pools, 16384 pgs
objects: 48.27k objects, 188 GiB
usage: 11 TiB used, 499 TiB / 510 TiB avail
pgs: 16384 active+clean
io:
client: 29 KiB/s rd, 11 KiB/s wr, 0 op/s rd, 0 op/s wr
root@av-petasan-mgm-ash1-001:~#
and last, is it possible to add to more mgm servers?
admin
2,930 Posts
Quote from admin on January 7, 2020, 12:19 pmThis is most likely either under powered hardware or network hardware problem.
You can add more ceph monitor and consul servers via cli if you want. PetaSAN setup 3 nodes but you can increase manually.
This is most likely either under powered hardware or network hardware problem.
You can add more ceph monitor and consul servers via cli if you want. PetaSAN setup 3 nodes but you can increase manually.
hjallisnorra
19 Posts
Quote from hjallisnorra on January 7, 2020, 1:48 pmHi again,
do i need especially powerful mgm nodes depending on how many osd, or pg we are using?
hardware:
2 G9 hp dl380 64G mem 4 25G nic's, 4 1G nic's. (9 * 8T sata disks) 1 * 1T ssd. iscsi/storage nodes
3 G9 hp dl380 128G mem 4 25G nic's, 4 1G nic's. (22 * 2T sas disks) 2 * 1T ssd. iscsi/storage nodes
6 G8 hp dl360 64G mem 2 25G nic's. 4 1G nic's storage nodes (added 2 dummy nic's when deploying) 50 1T sata 2 1T ssd. (changed the bluestore_block_db_size = 32212238227 so as the journal disk partition is 30G per disk), (changed the osd_memory_target = 1073741824)
3 virtual mgm server 16G mem 8 cpu cores.
cluster_info.txt:
{
"backend_1_base_ip": "10.118.64.0",
"backend_1_eth_name": "bond0",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "2064",
"backend_2_base_ip": "10.118.65.0",
"backend_2_eth_name": "bond0",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "2065",
"bonds": [
{
"interfaces": "eth2,eth3",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond0",
"primary_interface": ""
},
{
"interfaces": "eth4,eth5",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond1",
"primary_interface": ""
}
],
"eth_count": 8,
"iscsi_1_eth_name": "bond1",
"iscsi_2_eth_name": "bond1",
"jumbo_frames": [
"eth4",
"eth2",
"eth5",
"eth3"
],
"management_eth_name": "eth1",
"management_nodes": [
{
"backend_1_ip": "10.118.64.101",
"backend_2_ip": "10.118.65.101",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.101",
"name": "av-petasan-mgm-ash1-001"
},
{
"backend_1_ip": "10.118.64.102",
"backend_2_ip": "10.118.65.102",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.102",
"name": "av-petasan-mgm-ash1-002"
},
{
"backend_1_ip": "10.118.64.103",
"backend_2_ip": "10.118.65.103",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.103",
"name": "av-petasan-mgm-ash1-003"
}
],
"name": "ash1-petasan",
"storage_engine": "bluestore"
}we did some ping tests for all the ip numbers in the cluster and no package loss detected.
Hi again,
do i need especially powerful mgm nodes depending on how many osd, or pg we are using?
hardware:
2 G9 hp dl380 64G mem 4 25G nic's, 4 1G nic's. (9 * 8T sata disks) 1 * 1T ssd. iscsi/storage nodes
3 G9 hp dl380 128G mem 4 25G nic's, 4 1G nic's. (22 * 2T sas disks) 2 * 1T ssd. iscsi/storage nodes
6 G8 hp dl360 64G mem 2 25G nic's. 4 1G nic's storage nodes (added 2 dummy nic's when deploying) 50 1T sata 2 1T ssd. (changed the bluestore_block_db_size = 32212238227 so as the journal disk partition is 30G per disk), (changed the osd_memory_target = 1073741824)
3 virtual mgm server 16G mem 8 cpu cores.
cluster_info.txt:
{
"backend_1_base_ip": "10.118.64.0",
"backend_1_eth_name": "bond0",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "2064",
"backend_2_base_ip": "10.118.65.0",
"backend_2_eth_name": "bond0",
"backend_2_mask": "255.255.255.0",
"backend_2_vlan_id": "2065",
"bonds": [
{
"interfaces": "eth2,eth3",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond0",
"primary_interface": ""
},
{
"interfaces": "eth4,eth5",
"is_jumbo_frames": true,
"mode": "802.3ad",
"name": "bond1",
"primary_interface": ""
}
],
"eth_count": 8,
"iscsi_1_eth_name": "bond1",
"iscsi_2_eth_name": "bond1",
"jumbo_frames": [
"eth4",
"eth2",
"eth5",
"eth3"
],
"management_eth_name": "eth1",
"management_nodes": [
{
"backend_1_ip": "10.118.64.101",
"backend_2_ip": "10.118.65.101",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.101",
"name": "av-petasan-mgm-ash1-001"
},
{
"backend_1_ip": "10.118.64.102",
"backend_2_ip": "10.118.65.102",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.102",
"name": "av-petasan-mgm-ash1-002"
},
{
"backend_1_ip": "10.118.64.103",
"backend_2_ip": "10.118.65.103",
"is_backup": false,
"is_iscsi": false,
"is_management": true,
"is_storage": false,
"management_ip": "10.117.64.103",
"name": "av-petasan-mgm-ash1-003"
}
],
"name": "ash1-petasan",
"storage_engine": "bluestore"
}
we did some ping tests for all the ip numbers in the cluster and no package loss detected.
admin
2,930 Posts
Quote from admin on January 7, 2020, 3:14 pmIt seems the management nodes are slow to respond, hard to say why. 16 G ram should be enough, but it could be an issue with the vm setup or maybe the network connection between them.
When you see pools going active/in-active in the ui, can you ssh to the vm you are connecting too and run
ceph osd dump
ceph pg ls-by-pool POOL_NAME
example:
ceph pg ls-by-pool rbdrun them a couple of times in a row and see if they responsive or if they take a long time to complete.
It seems the management nodes are slow to respond, hard to say why. 16 G ram should be enough, but it could be an issue with the vm setup or maybe the network connection between them.
When you see pools going active/in-active in the ui, can you ssh to the vm you are connecting too and run
ceph osd dump
ceph pg ls-by-pool POOL_NAME
example:
ceph pg ls-by-pool rbd
run them a couple of times in a row and see if they responsive or if they take a long time to complete.
hjallisnorra
19 Posts
Quote from hjallisnorra on January 7, 2020, 4:57 pmYes they take a long time to respond:
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710real 0m16.016s
user 0m0.559s
sys 0m0.052sroot@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710real 0m2.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710real 0m3.540s
user 0m0.524s
sys 0m0.113s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710real 0m3.206s
user 0m0.609s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710real 0m5.542s
user 0m0.496s
sys 0m0.096s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710real 0m3.810s
user 0m0.507s
sys 0m0.052s
started long and then quicker.
Yes they take a long time to respond:
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m16.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m2.016s
user 0m0.559s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.540s
user 0m0.524s
sys 0m0.113s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.206s
user 0m0.609s
sys 0m0.052s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m5.542s
user 0m0.496s
sys 0m0.096s
root@av-petasan-mgm-ash1-001:~# time ceph pg ls-by-pool SATA1T3 | wc
8195 155688 1761710
real 0m3.810s
user 0m0.507s
sys 0m0.052s
started long and then quicker.
hjallisnorra
19 Posts
Quote from hjallisnorra on January 7, 2020, 5:25 pmBut running "ceph osd dump" is always within 1 sec.
root@av-petasan-mgm-ash1-001:~# time ceph osd dump | wc
385 6367 95761real 0m0.700s
user 0m0.603s
sys 0m0.048s
But running "ceph osd dump" is always within 1 sec.
root@av-petasan-mgm-ash1-001:~# time ceph osd dump | wc
385 6367 95761
real 0m0.700s
user 0m0.603s
sys 0m0.048s
admin
2,930 Posts
Quote from admin on January 7, 2020, 5:57 pmThis is taking a long time, apart from checking hardware speed and network connections you can increase the timeout as follows
line 72 in /usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/pool_checker.py
class PoolChecker():def __init__(self, timeout=5.0):
self.timeout = timeoutchange timeout=5.0 to for example timeout=30.0
even if this fixes the issue, i cannot say all is OK as it could be masking some other issue
This is taking a long time, apart from checking hardware speed and network connections you can increase the timeout as follows
line 72 in /usr/lib/python2.7/dist-packages/PetaSAN/core/ceph/pool_checker.py
class PoolChecker():
def __init__(self, timeout=5.0):
self.timeout = timeout
change timeout=5.0 to for example timeout=30.0
even if this fixes the issue, i cannot say all is OK as it could be masking some other issue
hjallisnorra
19 Posts
Quote from hjallisnorra on January 8, 2020, 11:26 amThank you, this is working.
Thank you, this is working.
hjallisnorra
19 Posts
Quote from hjallisnorra on January 16, 2020, 11:46 amHi again,
we have this same problem with iscsi path assignment list, it is there not there fluctuating,
we are not seeing any problems, it just frustrating when the list is populated and then empty and populated and empty,
was wondering if there is another timeout we can adjust to get rid of the fluctuation?
Thanks in advance
hjalli.
Hi again,
we have this same problem with iscsi path assignment list, it is there not there fluctuating,
we are not seeing any problems, it just frustrating when the list is populated and then empty and populated and empty,
was wondering if there is another timeout we can adjust to get rid of the fluctuation?
Thanks in advance
hjalli.
admin
2,930 Posts
Quote from admin on January 16, 2020, 12:15 pmIt is the same timeout used for both.
basically ceph will block if a pool is not responding, this is because a pool could be in the process of recovery. From a ui we need to specify a timeout so we do not hang forever, it should also not be too long so in case a pool is down we do not delay the ui too much, we chose 5 sec, we have not seen issues with this value.
If you wish you can increase the timeout over your current 30 sec which is already too high, as indicated earlier, increasing the timeout could mask a root cause of why in your setup it is taking too long to respond, as indicated it could be underpowered hardware or network issues.
It is the same timeout used for both.
basically ceph will block if a pool is not responding, this is because a pool could be in the process of recovery. From a ui we need to specify a timeout so we do not hang forever, it should also not be too long so in case a pool is down we do not delay the ui too much, we chose 5 sec, we have not seen issues with this value.
If you wish you can increase the timeout over your current 30 sec which is already too high, as indicated earlier, increasing the timeout could mask a root cause of why in your setup it is taking too long to respond, as indicated it could be underpowered hardware or network issues.