Forums - PetaSAN

ForumGeneral DiscussionDeploy node 3 long time
You need to log in to create posts and topics. Login · Register
Deploy node 3 long time

Pages: 1 2

pedro6161
36 Posts

December 17, 2020, 1:54 pm
Quote from pedro6161 on December 17, 2020, 1:54 pm
Hi,

i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?

for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish

root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log

17/12/2020 16:24:00 INFO Start settings IPs

17/12/2020 17:08:25 ERROR Config file error. The PetaSAN os maybe just installed.

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status

node_name = config.get_node_info().name

File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info

with open(config.get_node_info_file_path(), 'r') as f:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'

17/12/2020 17:08:51 INFO Starting node join

17/12/2020 17:08:51 INFO Successfully copied public keys.

17/12/2020 17:08:51 INFO Successfully copied private keys.

17/12/2020 17:08:51 INFO password set successfully.

17/12/2020 17:08:52 INFO Start copying cluster info file.

17/12/2020 17:08:52 INFO Successfully copied cluster info file.

17/12/2020 17:08:52 INFO Start copying services interfaces file.

17/12/2020 17:08:52 INFO Successfully copied services interfaces file.

17/12/2020 17:08:52 INFO Joined cluster CEPH-CLUSTER

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 INFO Set node role completed successfully.

17/12/2020 17:09:20 INFO Set node info completed successfully.

17/12/2020 17:07:48 INFO Stopping petasan services on all nodes.

17/12/2020 17:07:48 INFO Stopping all petasan services.

17/12/2020 17:07:48 INFO files_sync.py process is 4157

17/12/2020 17:07:48 INFO files_sync.py process stopped

17/12/2020 17:07:48 INFO iscsi_service.py process is 4159

17/12/2020 17:07:48 INFO iscsi_service.py process stopped

17/12/2020 17:07:48 INFO admin.py process is 4161

17/12/2020 17:07:48 INFO admin.py process stopped

17/12/2020 17:07:49 INFO Starting local clean_ceph.

17/12/2020 17:07:49 INFO Starting clean_ceph

17/12/2020 17:07:49 INFO Stopping ceph services

17/12/2020 17:07:49 INFO Start cleaning config files

17/12/2020 17:07:49 INFO Starting ceph services

17/12/2020 17:07:50 INFO Starting local clean_consul.

17/12/2020 17:07:50 INFO Trying to clean Consul on local node

17/12/2020 17:07:50 INFO delete /opt/petasan/config/etc/consul.d

17/12/2020 17:07:50 INFO delete /opt/petasan/config/var/consul

17/12/2020 17:07:50 INFO Trying to clean Consul on 192.168.74.21

17/12/2020 17:07:51 INFO Trying to clean Consul on 192.168.74.22

17/12/2020 17:07:52 INFO cluster_name: CEPH-CLUSTER

17/12/2020 17:07:52 INFO local_node_info.name: ceph-03

17/12/2020 17:07:52 INFO Start consul leaders remotely.

17/12/2020 17:07:55 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22

17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-03

17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-01

17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-02

17/12/2020 17:07:56 INFO Consul leaders are ready

17/12/2020 17:08:06 INFO NFSServer : Changing NFS Settings in Consul

17/12/2020 17:08:06 INFO NFSServer : NFS Settings has been changed in Consul

17/12/2020 17:08:45 INFO Checking backend latencies :

17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.21 =

17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.22 =

17/12/2020 17:09:37 INFO Checking backend latencies :

17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.21 =

17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.22 =

17/12/2020 17:10:31 INFO Checking backend latencies :

17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.21 =

17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.22 =

17/12/2020 17:10:31 ERROR set_key, could not set key: PetaSAN/CIFS

17/12/2020 17:10:31 ERROR

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key

consul_obj.put(key, val)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put

token=token, dc=dc)

File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put

CB.json(), '/v1/kv/%s' % key, params=params, data=value)

File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f

return Retrying(*dargs, **dkw).call(f, *args, **kw)

File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call

raise attempt.get()

File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get

six.reraise(self.value[0], self.value[1], self.value[2])

File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise

raise value

File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call

attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put

raise e

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put

raise RetryConsulException()

PetaSAN.core.consul.ps_consul.RetryConsulException

17/12/2020 17:10:31 ERROR GeneralConsulError

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key

consul_obj.put(key, val)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put

token=token, dc=dc)

File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put

CB.json(), '/v1/kv/%s' % key, params=params, data=value)

File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f

return Retrying(*dargs, **dkw).call(f, *args, **kw)

File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call

raise attempt.get()

File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get

six.reraise(self.value[0], self.value[1], self.value[2])

File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise

raise value

File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call

attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put

raise e

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put

raise RetryConsulException()

PetaSAN.core.consul.ps_consul.RetryConsulException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul

manage_cifs.save_cifs_base_settings(cifs_settings)

File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings

cifs_server.set_consul_base_settings(base_settings)

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings

self.set_consul_cifs_settings(cifs_settings)

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings

ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key

raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')

PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError

17/12/2020 17:10:32 INFO First monitor started successfully

17/12/2020 17:10:32 INFO create_mgr() fresh install

17/12/2020 17:10:32 INFO create_mgr() started

17/12/2020 17:10:32 INFO create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03

17/12/2020 17:10:32 INFO create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03 mon 'allow profile mgr' osd 'allow ' mds 'allow ' -o /var/lib/ceph/mgr/ceph-ceph-03/keyring

17/12/2020 17:10:46 INFO create_mgr() ended successfully

17/12/2020 17:10:46 INFO create_mds() started

17/12/2020 17:10:46 INFO create_mds() cmd : mkdir -p /var/lib/ceph/mds/ceph-ceph-03

17/12/2020 17:11:16 INFO create_mds() ended successfully

17/12/2020 17:11:16 INFO Starting to deploy remote monitors

root@ceph-03:~#

Hi,

i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?

for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish

root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log

17/12/2020 16:24:00 INFO Start settings IPs

17/12/2020 17:08:25 ERROR Config file error. The PetaSAN os maybe just installed.

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status

node_name = config.get_node_info().name

File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info

with open(config.get_node_info_file_path(), 'r') as f:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'

17/12/2020 17:08:51 INFO Starting node join

17/12/2020 17:08:51 INFO Successfully copied public keys.

17/12/2020 17:08:51 INFO Successfully copied private keys.

17/12/2020 17:08:51 INFO password set successfully.

17/12/2020 17:08:52 INFO Start copying cluster info file.

17/12/2020 17:08:52 INFO Successfully copied cluster info file.

17/12/2020 17:08:52 INFO Start copying services interfaces file.

17/12/2020 17:08:52 INFO Successfully copied services interfaces file.

17/12/2020 17:08:52 INFO Joined cluster CEPH-CLUSTER

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

17/12/2020 17:09:19 INFO Set node role completed successfully.

17/12/2020 17:09:20 INFO Set node info completed successfully.

17/12/2020 17:07:48 INFO Stopping petasan services on all nodes.

17/12/2020 17:07:48 INFO Stopping all petasan services.

17/12/2020 17:07:48 INFO files_sync.py process is 4157

17/12/2020 17:07:48 INFO files_sync.py process stopped

17/12/2020 17:07:48 INFO iscsi_service.py process is 4159

17/12/2020 17:07:48 INFO iscsi_service.py process stopped

17/12/2020 17:07:48 INFO admin.py process is 4161

17/12/2020 17:07:48 INFO admin.py process stopped

17/12/2020 17:07:49 INFO Starting local clean_ceph.

17/12/2020 17:07:49 INFO Starting clean_ceph

17/12/2020 17:07:49 INFO Stopping ceph services

17/12/2020 17:07:49 INFO Start cleaning config files

17/12/2020 17:07:49 INFO Starting ceph services

17/12/2020 17:07:50 INFO Starting local clean_consul.

17/12/2020 17:07:50 INFO Trying to clean Consul on local node

17/12/2020 17:07:50 INFO delete /opt/petasan/config/etc/consul.d

17/12/2020 17:07:50 INFO delete /opt/petasan/config/var/consul

17/12/2020 17:07:50 INFO Trying to clean Consul on 192.168.74.21

17/12/2020 17:07:51 INFO Trying to clean Consul on 192.168.74.22

17/12/2020 17:07:52 INFO cluster_name: CEPH-CLUSTER

17/12/2020 17:07:52 INFO local_node_info.name: ceph-03

17/12/2020 17:07:52 INFO Start consul leaders remotely.

17/12/2020 17:07:55 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22

17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-03

17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-01

17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-02

17/12/2020 17:07:56 INFO Consul leaders are ready

17/12/2020 17:08:06 INFO NFSServer : Changing NFS Settings in Consul

17/12/2020 17:08:06 INFO NFSServer : NFS Settings has been changed in Consul

17/12/2020 17:08:45 INFO Checking backend latencies :

17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.21 =

17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.22 =

17/12/2020 17:09:37 INFO Checking backend latencies :

17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.21 =

17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.22 =

17/12/2020 17:10:31 INFO Checking backend latencies :

17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.21 =

17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.22 =

17/12/2020 17:10:31 ERROR set_key, could not set key: PetaSAN/CIFS

17/12/2020 17:10:31 ERROR

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key

consul_obj.put(key, val)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put

token=token, dc=dc)

File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put

CB.json(), '/v1/kv/%s' % key, params=params, data=value)

File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f

return Retrying(*dargs, **dkw).call(f, *args, **kw)

File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call

raise attempt.get()

File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get

six.reraise(self.value[0], self.value[1], self.value[2])

File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise

raise value

File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call

attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put

raise e

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put

raise RetryConsulException()

PetaSAN.core.consul.ps_consul.RetryConsulException

17/12/2020 17:10:31 ERROR GeneralConsulError

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key

consul_obj.put(key, val)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put

token=token, dc=dc)

File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put

CB.json(), '/v1/kv/%s' % key, params=params, data=value)

File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f

return Retrying(*dargs, **dkw).call(f, *args, **kw)

File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call

raise attempt.get()

File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get

six.reraise(self.value[0], self.value[1], self.value[2])

File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise

raise value

File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call

attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put

raise e

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put

raise RetryConsulException()

PetaSAN.core.consul.ps_consul.RetryConsulException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul

manage_cifs.save_cifs_base_settings(cifs_settings)

File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings

cifs_server.set_consul_base_settings(base_settings)

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings

self.set_consul_cifs_settings(cifs_settings)

File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings

ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())

File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key

raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')

PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError

17/12/2020 17:10:32 INFO First monitor started successfully

17/12/2020 17:10:32 INFO create_mgr() fresh install

17/12/2020 17:10:32 INFO create_mgr() started

17/12/2020 17:10:32 INFO create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03

17/12/2020 17:10:32 INFO create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03 mon 'allow profile mgr' osd 'allow ' mds 'allow ' -o /var/lib/ceph/mgr/ceph-ceph-03/keyring

17/12/2020 17:10:46 INFO create_mgr() ended successfully

17/12/2020 17:10:46 INFO create_mds() started

17/12/2020 17:10:46 INFO create_mds() cmd : mkdir -p /var/lib/ceph/mds/ceph-ceph-03

17/12/2020 17:11:16 INFO create_mds() ended successfully

17/12/2020 17:11:16 INFO Starting to deploy remote monitors

root@ceph-03:~#

Last edited on December 17, 2020, 1:57 pm by pedro6161 · #1

admin
2,977 Posts

December 17, 2020, 2:26 pm
Quote from admin on December 17, 2020, 2:26 pm
i would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.

if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?

yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.

i would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.

if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?

yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.

#2

neiltorda
99 Posts

December 17, 2020, 4:24 pm
Quote from neiltorda on December 17, 2020, 4:24 pm
i had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.

i had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.

#3

admin
2,977 Posts

December 17, 2020, 9:26 pm
Quote from admin on December 17, 2020, 9:26 pm
I had to make sure they were completely erased/wiped before the cluster would build.

This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.

I had to make sure they were completely erased/wiped before the cluster would build.

This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.

Last edited on December 17, 2020, 9:28 pm by admin · #4

pedro6161
36 Posts

December 20, 2020, 2:00 pm
Quote from pedro6161 on December 20, 2020, 2:00 pm
Hi,

yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :

Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout

from PetaSAN.log :

root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#

root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#

From syslog on Ceph-01 node :

Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: SSH login from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer

i don't know why node-3 still processing and never finish but node-1 and 2 it's done

Hi,

yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :

Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout

from PetaSAN.log :

root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#

root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#

From syslog on Ceph-01 node :

Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: SSH login from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer

i don't know why node-3 still processing and never finish but node-1 and 2 it's done

Last edited on December 20, 2020, 2:04 pm by pedro6161 · #5

admin
2,977 Posts

December 20, 2020, 3:35 pm
Quote from admin on December 20, 2020, 3:35 pm
look like a network connectivity issue. double check your setup, ips, subnets do not overlap.

for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow

not sure what you mean, but you should not block any ports, the systems uses more ports than those.

look like a network connectivity issue. double check your setup, ips, subnets do not overlap.

for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow

not sure what you mean, but you should not block any ports, the systems uses more ports than those.

#6

pedro6161
36 Posts

December 20, 2020, 4:24 pm
Quote from pedro6161 on December 20, 2020, 4:24 pm

Quote from admin on December 20, 2020, 3:35 pm

look like a network connectivity issue. double check your setup, ips, subnets do not overlap.

for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow

not sure what you mean, but you should not block any ports, the systems uses more ports than those.

i'm sure no overlapping, are management and backend should be same subnet ?

Quote from admin on December 20, 2020, 3:35 pm

look like a network connectivity issue. double check your setup, ips, subnets do not overlap.

for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow

not sure what you mean, but you should not block any ports, the systems uses more ports than those.

i'm sure no overlapping, are management and backend should be same subnet ?

#7

admin
2,977 Posts

December 20, 2020, 4:56 pm
Quote from admin on December 20, 2020, 4:56 pm
no management and backend should be different subnets.

no management and backend should be different subnets.

#8

pedro6161
36 Posts

December 21, 2020, 11:05 am
Quote from pedro6161 on December 21, 2020, 11:05 am
All sorted now, i can see the PetaSAN dashboard at the first time

the installations without LACP, i have questions :

can i change directly file cluster_info.json to change all using LACP ?

which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?

i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success

regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?

All sorted now, i can see the PetaSAN dashboard at the first time

the installations without LACP, i have questions :

can i change directly file cluster_info.json to change all using LACP ?

which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?

i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success

regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?

Last edited on December 21, 2020, 11:17 am by pedro6161 · #9

admin
2,977 Posts

December 21, 2020, 4:42 pm
Quote from admin on December 21, 2020, 4:42 pm
1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.

2- Since you change it manually, you need to copy it manually on all nodes.

3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:

/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh

4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via

/opt/petasan/scripts/util/get_cluster_leader.py

note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.

1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.

2- Since you change it manually, you need to copy it manually on all nodes.

3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:

/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh

4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via

/opt/petasan/scripts/util/get_cluster_leader.py

note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.

Last edited on December 21, 2020, 4:43 pm by admin · #10

Post Reply: Deploy node 3 long time

Cancel

Pages: 1 2