Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Moving management interface to a bond after cluster is built.

Pages: 1 2

Hey there,

I have been trying some new things and have come across some odd activity, and I'm just curious if anyone has had this issue. So I have a fully built cluster that did not use bonding, it had a single interface for back-end and a single interface for management. I decided I wanted to create a bond that would use both. In order to not have any weird syntax errors, I just created a VM and installed Petasan on it and used the exact configuration I wanted for my already built cluster so I could use the cluster_info.json file.

So, I made the required edits to my cluster_info.json and the file looks like so:

{
    "backend_1_base_ip": "10.10.1.1",
    "backend_1_eth_name": "bond0",
    "backend_1_mask": "255.255.255.0",
    "backend_1_vlan_id": "",
    "backend_2_base_ip": "",
    "backend_2_eth_name": "",
    "backend_2_mask": "",
    "backend_2_vlan_id": "",
    "bonds": [
        {
            "interfaces": "eth0,eth1",
            "is_jumbo_frames": false,
            "mode": "active-backup",
            "name": "bond0",
            "primary_interface": "eth0"
        }
    ],
    "default_pool": "both",
    "default_pool_pgs": "256",
    "default_pool_replicas": "2",
    "eth_count": 4,
    "jf_mtu_size": "",
    "jumbo_frames": [],
    "management_eth_name": "bond0",
    "management_nodes": [
        {
            "backend_1_ip": "10.10.1.1",
            "backend_2_ip": "",
            "is_backup": false,
            "is_cifs": true,
            "is_iscsi": true,
            "is_management": true,
            "is_nfs": true,
            "is_storage": true,
            "management_ip": "192.168.202.101",
            "name": "petasan01"
        },
        {
            "backend_1_ip": "10.10.1.2",
            "backend_2_ip": "",
            "is_backup": false,
            "is_cifs": false,
            "is_iscsi": true,
            "is_management": true,
            "is_nfs": false,
            "is_storage": true,
            "management_ip": "192.168.202.102",
            "name": "petasan02"
        },
        {
            "backend_1_ip": "10.10.1.3",
            "backend_2_ip": "",
            "is_backup": false,
            "is_cifs": true,
            "is_iscsi": true,
            "is_management": true,
            "is_nfs": true,
            "is_storage": true,
            "management_ip": "192.168.202.103",
            "name": "petasan03"
        }
    ],
    "name": "petademo",
    "storage_engine": "bluestore"
}
Next, I rebooted all of the nodes. The cluster came back up healthy and the bond was created, however there are some very odd quirks now with the cluster that aren't working correctly. The first one is, if I go into the WebUI and goto Nodes List, and click on the Settings button, it just sits forever until it finally times out with "Bad gateway"
Node 1 is also acting strange as I can't access the web UI from the management IP of node 1, and when I look at the Nodes list in the Web UI it says node 1 is down, however when I look at Ceph, it shows node 1 as up and working properly - All OSD's are up and I am able to run Ceph commands from node 1.
Any input would be great!
Thanks

Also - Here is my ORIGINAL cluster_info.json that all nodes used prior to me making the changes I showed above:

 

{
"backend_1_base_ip": "10.10.1.1",
"backend_1_eth_name": "eth1",
"backend_1_mask": "255.255.255.0",
"backend_1_vlan_id": "",
"backend_2_base_ip": "",
"backend_2_eth_name": "",
"backend_2_mask": "",
"backend_2_vlan_id": "",
"bonds": [],
"default_pool": "both",
"default_pool_pgs": "256",
"default_pool_replicas": "2",
"eth_count": 4,
"jf_mtu_size": "",
"jumbo_frames": [],
"management_eth_name": "eth0",
"management_nodes": [
{
"backend_1_ip": "10.10.1.1",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_nfs": true,
"is_storage": true,
"management_ip": "192.168.202.101",
"name": "petasan01"
},
{
"backend_1_ip": "10.10.1.2",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": false,
"is_iscsi": true,
"is_management": true,
"is_nfs": false,
"is_storage": true,
"management_ip": "192.168.202.102",
"name": "petasan02"
},
{
"backend_1_ip": "10.10.1.3",
"backend_2_ip": "",
"is_backup": false,
"is_cifs": true,
"is_iscsi": true,
"is_management": true,
"is_nfs": true,
"is_storage": true,
"management_ip": "192.168.202.103",
"name": "petasan03"
}
],
"name": "petademo",
"storage_engine": "bluestore"
}

if you ssh to the nodes via backend network, can the nodes ping each other over the management ips ?

Yes I can.

I can also SSH into node 1 from the management IP as well.
Petasan.log has some connection refused errors - such as this:

18/01/2021 11:55:48 INFO Checking backend latencies :
18/01/2021 11:55:48 INFO Network latency for backend 10.10.1.1 =
18/01/2021 11:55:48 INFO Network latency for backend 10.10.1.2 =
18/01/2021 11:55:48 INFO Network latency for backend 10.10.1.3 = 24.4 us
18/01/2021 11:55:50 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2a200b06d8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/
18/01/2021 11:55:52 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2a200b0a58>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/
18/01/2021 11:55:56 WARNING Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2a200b0c18>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/
18/01/2021 11:56:04 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f2a200b0dd8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/
18/01/2021 11:56:08 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f76bf4074a8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Services/ClusterLeader?index=3914040&wait=20s
18/01/2021 11:56:10 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff0d5ebda58>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Config/Files?index=150&recurse=1

 

This is only happening on node 1. All other nodes are quiet.

So I have tracked all of the odd behavior down to node 1.
If I shutdown node 1, I can go into the Settings page of the remaining nodes From within "Nodes list" and see all the information as normal.

 

The problem seems to be down to the refused connections that are happening when Node 1 is up.

Sorry just going to dump some more errors from Petasan.log that I am seeing.

 

It appears as though for some reason Node 1 is having issues running/getting information from some of the scripts that Petasan runs when a node starts up. It is very strange however, since the other 2 nodes are fine.

Here is the dump:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/file_sync_manager.py", line 81, in sync
index, data = base.watch(self.root_path, current_index)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 77, in watch
index, data = cons.kv.get(key, index=current_index, recurse=True)
File "/usr/lib/python3/dist-packages/consul/base.py", line 554, in get
params=params)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 85, in get
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 70, in get
res = self.response(self.session.get(uri, verify=self.verify))
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in get
return self.request('GET', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 630, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Config /Files?index=150&recurse=1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f364b4a7ac8>: Failed to establis h a new connection: [Errno 111] Connection refused',))
18/01/2021 13:26:17 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'N ewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f364b59c080>: Failed to establish a new connection: [Errno 111] Connect ion refused',)': /v1/kv/PetaSAN/Config/Files?index=150&recurse=1
18/01/2021 13:26:19 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'N ewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f364b59c0b8>: Failed to establish a new connection: [Errno 111] Connect ion refused',)': /v1/kv/PetaSAN/Config/Files?index=150&recurse=1
18/01/2021 13:26:23 WARNING Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'N ewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f364b59cf98>: Failed to establish a new connection: [Errno 111] Connect ion refused',)': /v1/kv/PetaSAN/Config/Files?index=150&recurse=1
18/01/2021 13:26:31 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f364b567eb8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Config/Files?index=150&recurse=1
18/01/2021 13:26:33 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdb0b435c18>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/session/list
18/01/2021 13:26:39 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f33fb9569b0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Services/ClusterLeader?index=3914040&wait=20s

 

can you run

consul members

on all 3 nodes

do you see any ping latency on backend network from/to node 1 with the other nodes  ?

Hey !

I know I am just double-posting away like crazy here, but I have been able to track down the issue and fix it.

 

It appears that when switching the management and the back-end network from single interfaces to a bond, node 1 had an issue not being able to join the consul cluster. The other nodes did successfully which is what threw me off.

I was able to fix it by manually re-joining the consul cluster with the following command:

consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind node1backendIP -retry-join node2backendIP -retry-join node3backendIP

In case anyone else ever has this issue!

 

Quote from admin on January 18, 2021, 5:43 pm

can you run

consul members

on all 3 nodes

do you see any ping latency on backend network from/to node 1 with the other nodes  ?

Haha that would have helped me ! Fortunately I ended up figuring that out !! It was the issue. THanks for the help

It looks as though the reason why you have to run the manual bind command is because consul tries to use a loopback address for its own address, but now there is more than 1 IP on the network interface, so consul is trying to use the management IP rather than the backend.

running the bind command manually is great to get it up and running again, but its not a longterm solution as it does not stick on reboot.

Pages: 1 2