Deploy node 3 long time
Pages: 1 2
pedro6161
36 Posts
December 17, 2020, 1:54 pmQuote from pedro6161 on December 17, 2020, 1:54 pmHi,
i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?
for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
17/12/2020 16:24:00 INFO Start settings IPs
17/12/2020 17:08:25 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
17/12/2020 17:08:51 INFO Starting node join
17/12/2020 17:08:51 INFO Successfully copied public keys.
17/12/2020 17:08:51 INFO Successfully copied private keys.
17/12/2020 17:08:51 INFO password set successfully.
17/12/2020 17:08:52 INFO Start copying cluster info file.
17/12/2020 17:08:52 INFO Successfully copied cluster info file.
17/12/2020 17:08:52 INFO Start copying services interfaces file.
17/12/2020 17:08:52 INFO Successfully copied services interfaces file.
17/12/2020 17:08:52 INFO Joined cluster CEPH-CLUSTER
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 INFO Set node role completed successfully.
17/12/2020 17:09:20 INFO Set node info completed successfully.
17/12/2020 17:07:48 INFO Stopping petasan services on all nodes.
17/12/2020 17:07:48 INFO Stopping all petasan services.
17/12/2020 17:07:48 INFO files_sync.py process is 4157
17/12/2020 17:07:48 INFO files_sync.py process stopped
17/12/2020 17:07:48 INFO iscsi_service.py process is 4159
17/12/2020 17:07:48 INFO iscsi_service.py process stopped
17/12/2020 17:07:48 INFO admin.py process is 4161
17/12/2020 17:07:48 INFO admin.py process stopped
17/12/2020 17:07:49 INFO Starting local clean_ceph.
17/12/2020 17:07:49 INFO Starting clean_ceph
17/12/2020 17:07:49 INFO Stopping ceph services
17/12/2020 17:07:49 INFO Start cleaning config files
17/12/2020 17:07:49 INFO Starting ceph services
17/12/2020 17:07:50 INFO Starting local clean_consul.
17/12/2020 17:07:50 INFO Trying to clean Consul on local node
17/12/2020 17:07:50 INFO delete /opt/petasan/config/etc/consul.d
17/12/2020 17:07:50 INFO delete /opt/petasan/config/var/consul
17/12/2020 17:07:50 INFO Trying to clean Consul on 192.168.74.21
17/12/2020 17:07:51 INFO Trying to clean Consul on 192.168.74.22
17/12/2020 17:07:52 INFO cluster_name: CEPH-CLUSTER
17/12/2020 17:07:52 INFO local_node_info.name: ceph-03
17/12/2020 17:07:52 INFO Start consul leaders remotely.
17/12/2020 17:07:55 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-03
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-01
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-02
17/12/2020 17:07:56 INFO Consul leaders are ready
17/12/2020 17:08:06 INFO NFSServer : Changing NFS Settings in Consul
17/12/2020 17:08:06 INFO NFSServer : NFS Settings has been changed in Consul
17/12/2020 17:08:45 INFO Checking backend latencies :
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:09:37 INFO Checking backend latencies :
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 INFO Checking backend latencies :
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 ERROR set_key, could not set key: PetaSAN/CIFS
17/12/2020 17:10:31 ERROR
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
17/12/2020 17:10:31 ERROR GeneralConsulError
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul
manage_cifs.save_cifs_base_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings
cifs_server.set_consul_base_settings(base_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings
self.set_consul_cifs_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings
ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key
raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')
PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError
17/12/2020 17:10:32 INFO First monitor started successfully
17/12/2020 17:10:32 INFO create_mgr() fresh install
17/12/2020 17:10:32 INFO create_mgr() started
17/12/2020 17:10:32 INFO create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03
17/12/2020 17:10:32 INFO create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03 mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-ceph-03/keyring
17/12/2020 17:10:46 INFO create_mgr() ended successfully
17/12/2020 17:10:46 INFO create_mds() started
17/12/2020 17:10:46 INFO create_mds() cmd : mkdir -p /var/lib/ceph/mds/ceph-ceph-03
17/12/2020 17:11:16 INFO create_mds() ended successfully
17/12/2020 17:11:16 INFO Starting to deploy remote monitors
root@ceph-03:~#
Hi,
i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?
for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
17/12/2020 16:24:00 INFO Start settings IPs
17/12/2020 17:08:25 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
17/12/2020 17:08:51 INFO Starting node join
17/12/2020 17:08:51 INFO Successfully copied public keys.
17/12/2020 17:08:51 INFO Successfully copied private keys.
17/12/2020 17:08:51 INFO password set successfully.
17/12/2020 17:08:52 INFO Start copying cluster info file.
17/12/2020 17:08:52 INFO Successfully copied cluster info file.
17/12/2020 17:08:52 INFO Start copying services interfaces file.
17/12/2020 17:08:52 INFO Successfully copied services interfaces file.
17/12/2020 17:08:52 INFO Joined cluster CEPH-CLUSTER
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 INFO Set node role completed successfully.
17/12/2020 17:09:20 INFO Set node info completed successfully.
17/12/2020 17:07:48 INFO Stopping petasan services on all nodes.
17/12/2020 17:07:48 INFO Stopping all petasan services.
17/12/2020 17:07:48 INFO files_sync.py process is 4157
17/12/2020 17:07:48 INFO files_sync.py process stopped
17/12/2020 17:07:48 INFO iscsi_service.py process is 4159
17/12/2020 17:07:48 INFO iscsi_service.py process stopped
17/12/2020 17:07:48 INFO admin.py process is 4161
17/12/2020 17:07:48 INFO admin.py process stopped
17/12/2020 17:07:49 INFO Starting local clean_ceph.
17/12/2020 17:07:49 INFO Starting clean_ceph
17/12/2020 17:07:49 INFO Stopping ceph services
17/12/2020 17:07:49 INFO Start cleaning config files
17/12/2020 17:07:49 INFO Starting ceph services
17/12/2020 17:07:50 INFO Starting local clean_consul.
17/12/2020 17:07:50 INFO Trying to clean Consul on local node
17/12/2020 17:07:50 INFO delete /opt/petasan/config/etc/consul.d
17/12/2020 17:07:50 INFO delete /opt/petasan/config/var/consul
17/12/2020 17:07:50 INFO Trying to clean Consul on 192.168.74.21
17/12/2020 17:07:51 INFO Trying to clean Consul on 192.168.74.22
17/12/2020 17:07:52 INFO cluster_name: CEPH-CLUSTER
17/12/2020 17:07:52 INFO local_node_info.name: ceph-03
17/12/2020 17:07:52 INFO Start consul leaders remotely.
17/12/2020 17:07:55 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-03
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-01
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-02
17/12/2020 17:07:56 INFO Consul leaders are ready
17/12/2020 17:08:06 INFO NFSServer : Changing NFS Settings in Consul
17/12/2020 17:08:06 INFO NFSServer : NFS Settings has been changed in Consul
17/12/2020 17:08:45 INFO Checking backend latencies :
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:09:37 INFO Checking backend latencies :
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 INFO Checking backend latencies :
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 ERROR set_key, could not set key: PetaSAN/CIFS
17/12/2020 17:10:31 ERROR
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
17/12/2020 17:10:31 ERROR GeneralConsulError
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul
manage_cifs.save_cifs_base_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings
cifs_server.set_consul_base_settings(base_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings
self.set_consul_cifs_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings
ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key
raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')
PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError
17/12/2020 17:10:32 INFO First monitor started successfully
17/12/2020 17:10:32 INFO create_mgr() fresh install
17/12/2020 17:10:32 INFO create_mgr() started
17/12/2020 17:10:32 INFO create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03
17/12/2020 17:10:32 INFO create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03 mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-ceph-03/keyring
17/12/2020 17:10:46 INFO create_mgr() ended successfully
17/12/2020 17:10:46 INFO create_mds() started
17/12/2020 17:10:46 INFO create_mds() cmd : mkdir -p /var/lib/ceph/mds/ceph-ceph-03
17/12/2020 17:11:16 INFO create_mds() ended successfully
17/12/2020 17:11:16 INFO Starting to deploy remote monitors
root@ceph-03:~#
Last edited on December 17, 2020, 1:57 pm by pedro6161 · #1
admin
2,930 Posts
December 17, 2020, 2:26 pmQuote from admin on December 17, 2020, 2:26 pmi would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.
if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?
yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.
i would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.
if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?
yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.
neiltorda
98 Posts
December 17, 2020, 4:24 pmQuote from neiltorda on December 17, 2020, 4:24 pmi had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.
i had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.
admin
2,930 Posts
December 17, 2020, 9:26 pmQuote from admin on December 17, 2020, 9:26 pmI had to make sure they were completely erased/wiped before the cluster would build.
This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.
I had to make sure they were completely erased/wiped before the cluster would build.
This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.
Last edited on December 17, 2020, 9:28 pm by admin · #4
pedro6161
36 Posts
December 20, 2020, 2:00 pmQuote from pedro6161 on December 20, 2020, 2:00 pmHi,
yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout
from PetaSAN.log :
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#
From syslog on Ceph-01 node :
Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: *SSH login* from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer
i don't know why node-3 still processing and never finish but node-1 and 2 it's done
Hi,
yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout
from PetaSAN.log :
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#
From syslog on Ceph-01 node :
Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: *SSH login* from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer
i don't know why node-3 still processing and never finish but node-1 and 2 it's done
Last edited on December 20, 2020, 2:04 pm by pedro6161 · #5
admin
2,930 Posts
December 20, 2020, 3:35 pmQuote from admin on December 20, 2020, 3:35 pmlook like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
look like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
pedro6161
36 Posts
December 20, 2020, 4:24 pmQuote from pedro6161 on December 20, 2020, 4:24 pm
Quote from admin on December 20, 2020, 3:35 pm
look like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
i'm sure no overlapping, are management and backend should be same subnet ?
Quote from admin on December 20, 2020, 3:35 pm
look like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
i'm sure no overlapping, are management and backend should be same subnet ?
admin
2,930 Posts
December 20, 2020, 4:56 pmQuote from admin on December 20, 2020, 4:56 pmno management and backend should be different subnets.
no management and backend should be different subnets.
pedro6161
36 Posts
December 21, 2020, 11:05 amQuote from pedro6161 on December 21, 2020, 11:05 amAll sorted now, i can see the PetaSAN dashboard at the first time 😀
the installations without LACP, i have questions :
- can i change directly file cluster_info.json to change all using LACP ?
- which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?
- i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success
- regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?
All sorted now, i can see the PetaSAN dashboard at the first time 😀
the installations without LACP, i have questions :
- can i change directly file cluster_info.json to change all using LACP ?
- which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?
- i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success
- regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?
Last edited on December 21, 2020, 11:17 am by pedro6161 · #9
admin
2,930 Posts
December 21, 2020, 4:42 pmQuote from admin on December 21, 2020, 4:42 pm1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.
2- Since you change it manually, you need to copy it manually on all nodes.
3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:
/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh
4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via
/opt/petasan/scripts/util/get_cluster_leader.py
note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.
1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.
2- Since you change it manually, you need to copy it manually on all nodes.
3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:
/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh
4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via
/opt/petasan/scripts/util/get_cluster_leader.py
note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.
Last edited on December 21, 2020, 4:43 pm by admin · #10
Pages: 1 2
Deploy node 3 long time
pedro6161
36 Posts
Quote from pedro6161 on December 17, 2020, 1:54 pmHi,
i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?
for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
17/12/2020 16:24:00 INFO Start settings IPs
17/12/2020 17:08:25 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
17/12/2020 17:08:51 INFO Starting node join
17/12/2020 17:08:51 INFO Successfully copied public keys.
17/12/2020 17:08:51 INFO Successfully copied private keys.
17/12/2020 17:08:51 INFO password set successfully.
17/12/2020 17:08:52 INFO Start copying cluster info file.
17/12/2020 17:08:52 INFO Successfully copied cluster info file.
17/12/2020 17:08:52 INFO Start copying services interfaces file.
17/12/2020 17:08:52 INFO Successfully copied services interfaces file.
17/12/2020 17:08:52 INFO Joined cluster CEPH-CLUSTER
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 INFO Set node role completed successfully.
17/12/2020 17:09:20 INFO Set node info completed successfully.
17/12/2020 17:07:48 INFO Stopping petasan services on all nodes.
17/12/2020 17:07:48 INFO Stopping all petasan services.
17/12/2020 17:07:48 INFO files_sync.py process is 4157
17/12/2020 17:07:48 INFO files_sync.py process stopped
17/12/2020 17:07:48 INFO iscsi_service.py process is 4159
17/12/2020 17:07:48 INFO iscsi_service.py process stopped
17/12/2020 17:07:48 INFO admin.py process is 4161
17/12/2020 17:07:48 INFO admin.py process stopped
17/12/2020 17:07:49 INFO Starting local clean_ceph.
17/12/2020 17:07:49 INFO Starting clean_ceph
17/12/2020 17:07:49 INFO Stopping ceph services
17/12/2020 17:07:49 INFO Start cleaning config files
17/12/2020 17:07:49 INFO Starting ceph services
17/12/2020 17:07:50 INFO Starting local clean_consul.
17/12/2020 17:07:50 INFO Trying to clean Consul on local node
17/12/2020 17:07:50 INFO delete /opt/petasan/config/etc/consul.d
17/12/2020 17:07:50 INFO delete /opt/petasan/config/var/consul
17/12/2020 17:07:50 INFO Trying to clean Consul on 192.168.74.21
17/12/2020 17:07:51 INFO Trying to clean Consul on 192.168.74.22
17/12/2020 17:07:52 INFO cluster_name: CEPH-CLUSTER
17/12/2020 17:07:52 INFO local_node_info.name: ceph-03
17/12/2020 17:07:52 INFO Start consul leaders remotely.
17/12/2020 17:07:55 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-03
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-01
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-02
17/12/2020 17:07:56 INFO Consul leaders are ready
17/12/2020 17:08:06 INFO NFSServer : Changing NFS Settings in Consul
17/12/2020 17:08:06 INFO NFSServer : NFS Settings has been changed in Consul
17/12/2020 17:08:45 INFO Checking backend latencies :
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:09:37 INFO Checking backend latencies :
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 INFO Checking backend latencies :
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 ERROR set_key, could not set key: PetaSAN/CIFS
17/12/2020 17:10:31 ERROR
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
17/12/2020 17:10:31 ERROR GeneralConsulError
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul
manage_cifs.save_cifs_base_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings
cifs_server.set_consul_base_settings(base_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings
self.set_consul_cifs_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings
ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key
raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')
PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError
17/12/2020 17:10:32 INFO First monitor started successfully
17/12/2020 17:10:32 INFO create_mgr() fresh install
17/12/2020 17:10:32 INFO create_mgr() started
17/12/2020 17:10:32 INFO create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03
17/12/2020 17:10:32 INFO create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03 mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-ceph-03/keyring
17/12/2020 17:10:46 INFO create_mgr() ended successfully
17/12/2020 17:10:46 INFO create_mds() started
17/12/2020 17:10:46 INFO create_mds() cmd : mkdir -p /var/lib/ceph/mds/ceph-ceph-03
17/12/2020 17:11:16 INFO create_mds() ended successfully
17/12/2020 17:11:16 INFO Starting to deploy remote monitors
root@ceph-03:~#
Hi,
i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?
for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
17/12/2020 16:24:00 INFO Start settings IPs
17/12/2020 17:08:25 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
17/12/2020 17:08:51 INFO Starting node join
17/12/2020 17:08:51 INFO Successfully copied public keys.
17/12/2020 17:08:51 INFO Successfully copied private keys.
17/12/2020 17:08:51 INFO password set successfully.
17/12/2020 17:08:52 INFO Start copying cluster info file.
17/12/2020 17:08:52 INFO Successfully copied cluster info file.
17/12/2020 17:08:52 INFO Start copying services interfaces file.
17/12/2020 17:08:52 INFO Successfully copied services interfaces file.
17/12/2020 17:08:52 INFO Joined cluster CEPH-CLUSTER
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
17/12/2020 17:09:19 INFO Set node role completed successfully.
17/12/2020 17:09:20 INFO Set node info completed successfully.
17/12/2020 17:07:48 INFO Stopping petasan services on all nodes.
17/12/2020 17:07:48 INFO Stopping all petasan services.
17/12/2020 17:07:48 INFO files_sync.py process is 4157
17/12/2020 17:07:48 INFO files_sync.py process stopped
17/12/2020 17:07:48 INFO iscsi_service.py process is 4159
17/12/2020 17:07:48 INFO iscsi_service.py process stopped
17/12/2020 17:07:48 INFO admin.py process is 4161
17/12/2020 17:07:48 INFO admin.py process stopped
17/12/2020 17:07:49 INFO Starting local clean_ceph.
17/12/2020 17:07:49 INFO Starting clean_ceph
17/12/2020 17:07:49 INFO Stopping ceph services
17/12/2020 17:07:49 INFO Start cleaning config files
17/12/2020 17:07:49 INFO Starting ceph services
17/12/2020 17:07:50 INFO Starting local clean_consul.
17/12/2020 17:07:50 INFO Trying to clean Consul on local node
17/12/2020 17:07:50 INFO delete /opt/petasan/config/etc/consul.d
17/12/2020 17:07:50 INFO delete /opt/petasan/config/var/consul
17/12/2020 17:07:50 INFO Trying to clean Consul on 192.168.74.21
17/12/2020 17:07:51 INFO Trying to clean Consul on 192.168.74.22
17/12/2020 17:07:52 INFO cluster_name: CEPH-CLUSTER
17/12/2020 17:07:52 INFO local_node_info.name: ceph-03
17/12/2020 17:07:52 INFO Start consul leaders remotely.
17/12/2020 17:07:55 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-03
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-01
17/12/2020 17:07:56 INFO Cluster Node {} joined the cluster and is aliveceph-02
17/12/2020 17:07:56 INFO Consul leaders are ready
17/12/2020 17:08:06 INFO NFSServer : Changing NFS Settings in Consul
17/12/2020 17:08:06 INFO NFSServer : NFS Settings has been changed in Consul
17/12/2020 17:08:45 INFO Checking backend latencies :
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:08:45 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:09:37 INFO Checking backend latencies :
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:09:37 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 INFO Checking backend latencies :
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.21 =
17/12/2020 17:10:31 INFO Network latency for backend 172.16.91.22 =
17/12/2020 17:10:31 ERROR set_key, could not set key: PetaSAN/CIFS
17/12/2020 17:10:31 ERROR
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
17/12/2020 17:10:31 ERROR GeneralConsulError
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key
consul_obj.put(key, val)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put
token=token, dc=dc)
File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put
CB.json(), '/v1/kv/%s' % key, params=params, data=value)
File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call
raise attempt.get()
File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put
raise e
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put
raise RetryConsulException()
PetaSAN.core.consul.ps_consul.RetryConsulException
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul
manage_cifs.save_cifs_base_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings
cifs_server.set_consul_base_settings(base_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings
self.set_consul_cifs_settings(cifs_settings)
File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings
ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())
File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key
raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')
PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError
17/12/2020 17:10:32 INFO First monitor started successfully
17/12/2020 17:10:32 INFO create_mgr() fresh install
17/12/2020 17:10:32 INFO create_mgr() started
17/12/2020 17:10:32 INFO create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03
17/12/2020 17:10:32 INFO create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03 mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-ceph-03/keyring
17/12/2020 17:10:46 INFO create_mgr() ended successfully
17/12/2020 17:10:46 INFO create_mds() started
17/12/2020 17:10:46 INFO create_mds() cmd : mkdir -p /var/lib/ceph/mds/ceph-ceph-03
17/12/2020 17:11:16 INFO create_mds() ended successfully
17/12/2020 17:11:16 INFO Starting to deploy remote monitors
root@ceph-03:~#
admin
2,930 Posts
Quote from admin on December 17, 2020, 2:26 pmi would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.
if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?
yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.
i would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.
if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?
yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.
neiltorda
98 Posts
Quote from neiltorda on December 17, 2020, 4:24 pmi had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.
i had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.
admin
2,930 Posts
Quote from admin on December 17, 2020, 9:26 pmI had to make sure they were completely erased/wiped before the cluster would build.
This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.
I had to make sure they were completely erased/wiped before the cluster would build.
This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.
pedro6161
36 Posts
Quote from pedro6161 on December 20, 2020, 2:00 pmHi,
yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout
from PetaSAN.log :
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#
From syslog on Ceph-01 node :
Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: *SSH login* from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer
i don't know why node-3 still processing and never finish but node-1 and 2 it's done
Hi,
yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout
from PetaSAN.log :
root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#
From syslog on Ceph-01 node :
Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: *SSH login* from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer
i don't know why node-3 still processing and never finish but node-1 and 2 it's done
admin
2,930 Posts
Quote from admin on December 20, 2020, 3:35 pmlook like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
look like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
pedro6161
36 Posts
Quote from pedro6161 on December 20, 2020, 4:24 pmQuote from admin on December 20, 2020, 3:35 pmlook like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
i'm sure no overlapping, are management and backend should be same subnet ?
Quote from admin on December 20, 2020, 3:35 pmlook like a network connectivity issue. double check your setup, ips, subnets do not overlap.
for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow
not sure what you mean, but you should not block any ports, the systems uses more ports than those.
i'm sure no overlapping, are management and backend should be same subnet ?
admin
2,930 Posts
Quote from admin on December 20, 2020, 4:56 pmno management and backend should be different subnets.
no management and backend should be different subnets.
pedro6161
36 Posts
Quote from pedro6161 on December 21, 2020, 11:05 amAll sorted now, i can see the PetaSAN dashboard at the first time 😀
the installations without LACP, i have questions :
- can i change directly file cluster_info.json to change all using LACP ?
- which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?
- i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success
- regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?
All sorted now, i can see the PetaSAN dashboard at the first time 😀
the installations without LACP, i have questions :
- can i change directly file cluster_info.json to change all using LACP ?
- which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?
- i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success
- regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?
admin
2,930 Posts
Quote from admin on December 21, 2020, 4:42 pm1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.
2- Since you change it manually, you need to copy it manually on all nodes.
3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:
/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via
/opt/petasan/scripts/util/get_cluster_leader.py
note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.
1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.
2- Since you change it manually, you need to copy it manually on all nodes.
3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:
/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh
4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via
/opt/petasan/scripts/util/get_cluster_leader.py
note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.