Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Deploy node 3 long time

Pages: 1 2

Hi,

i having issue on node 3 for deployment is takes time, below is log from PetaSAN.log, my server has 22 disk 1TB but i only use 2 Disk on each server(1-3), are log on below is normal ?

for node 1 and 2 very fast the deployment wizard but node 3 is very-very takes times and never finish

 

root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log 

 17/12/2020 16:24:00 INFO     Start settings IPs

 17/12/2020 17:08:25 ERROR    Config file error. The PetaSAN os maybe just installed.

Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status

    node_name = config.get_node_info().name

  File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info

    with open(config.get_node_info_file_path(), 'r') as f:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'

 17/12/2020 17:08:51 INFO     Starting node join

 17/12/2020 17:08:51 INFO     Successfully copied public keys.

 17/12/2020 17:08:51 INFO     Successfully copied private keys.

 17/12/2020 17:08:51 INFO     password set successfully.

 17/12/2020 17:08:52 INFO     Start copying  cluster info file.

 17/12/2020 17:08:52 INFO     Successfully copied cluster info file.

 17/12/2020 17:08:52 INFO     Start copying  services interfaces file.

 17/12/2020 17:08:52 INFO     Successfully copied services interfaces file.

 17/12/2020 17:08:52 INFO     Joined cluster CEPH-CLUSTER

 17/12/2020 17:09:19 ERROR    400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

 17/12/2020 17:09:19 ERROR    400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

 17/12/2020 17:09:19 ERROR    400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

 17/12/2020 17:09:19 ERROR    400 Bad Request: The browser (or proxy) sent a request that this server could not understand.

 17/12/2020 17:09:19 INFO     Set node role completed successfully.

 17/12/2020 17:09:20 INFO     Set node info completed successfully.

 17/12/2020 17:07:48 INFO     Stopping petasan services on all nodes.

 17/12/2020 17:07:48 INFO     Stopping all petasan services.

 17/12/2020 17:07:48 INFO     files_sync.py process is 4157

 17/12/2020 17:07:48 INFO     files_sync.py process stopped

 17/12/2020 17:07:48 INFO     iscsi_service.py process is 4159

 17/12/2020 17:07:48 INFO     iscsi_service.py process stopped

 17/12/2020 17:07:48 INFO     admin.py process is 4161

 17/12/2020 17:07:48 INFO     admin.py process stopped

 17/12/2020 17:07:49 INFO     Starting local clean_ceph.

 17/12/2020 17:07:49 INFO     Starting clean_ceph

 17/12/2020 17:07:49 INFO     Stopping ceph services

 17/12/2020 17:07:49 INFO     Start cleaning config files

 17/12/2020 17:07:49 INFO     Starting ceph services

 17/12/2020 17:07:50 INFO     Starting local clean_consul.

 17/12/2020 17:07:50 INFO     Trying to clean Consul on local node

 17/12/2020 17:07:50 INFO     delete /opt/petasan/config/etc/consul.d

 17/12/2020 17:07:50 INFO     delete /opt/petasan/config/var/consul

 17/12/2020 17:07:50 INFO     Trying to clean Consul on 192.168.74.21

 17/12/2020 17:07:51 INFO     Trying to clean Consul on 192.168.74.22

 17/12/2020 17:07:52 INFO     cluster_name: CEPH-CLUSTER

 17/12/2020 17:07:52 INFO     local_node_info.name: ceph-03

 17/12/2020 17:07:52 INFO     Start consul leaders remotely.

 17/12/2020 17:07:55 INFO     str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23  -retry-join 172.16.91.21 -retry-join 172.16.91.22

 17/12/2020 17:07:56 INFO     Cluster Node {} joined the cluster and is aliveceph-03

 17/12/2020 17:07:56 INFO     Cluster Node {} joined the cluster and is aliveceph-01

 17/12/2020 17:07:56 INFO     Cluster Node {} joined the cluster and is aliveceph-02

 17/12/2020 17:07:56 INFO     Consul leaders are ready

 17/12/2020 17:08:06 INFO     NFSServer : Changing NFS Settings in Consul

 17/12/2020 17:08:06 INFO     NFSServer : NFS Settings has been changed in Consul

 17/12/2020 17:08:45 INFO     Checking backend latencies :

 17/12/2020 17:08:45 INFO     Network latency for backend 172.16.91.21 = 

 17/12/2020 17:08:45 INFO     Network latency for backend 172.16.91.22 = 

 17/12/2020 17:09:37 INFO     Checking backend latencies :

 17/12/2020 17:09:37 INFO     Network latency for backend 172.16.91.21 = 

 17/12/2020 17:09:37 INFO     Network latency for backend 172.16.91.22 = 

 17/12/2020 17:10:31 INFO     Checking backend latencies :

 17/12/2020 17:10:31 INFO     Network latency for backend 172.16.91.21 = 

 17/12/2020 17:10:31 INFO     Network latency for backend 172.16.91.22 = 

 17/12/2020 17:10:31 ERROR    set_key, could not set key: PetaSAN/CIFS

 17/12/2020 17:10:31 ERROR    

Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key

    consul_obj.put(key, val)

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put

    token=token, dc=dc)

  File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put

    CB.json(), '/v1/kv/%s' % key, params=params, data=value)

  File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f

    return Retrying(*dargs, **dkw).call(f, *args, **kw)

  File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call

    raise attempt.get()

  File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get

    six.reraise(self.value[0], self.value[1], self.value[2])

  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise

    raise value

  File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call

    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put

    raise e

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put

    raise RetryConsulException()

PetaSAN.core.consul.ps_consul.RetryConsulException

 17/12/2020 17:10:31 ERROR    GeneralConsulError

Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 631, in set_key

    consul_obj.put(key, val)

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/base.py", line 162, in put

    token=token, dc=dc)

  File "/usr/lib/python3/dist-packages/consul/base.py", line 621, in put

    CB.json(), '/v1/kv/%s' % key, params=params, data=value)

  File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f

    return Retrying(*dargs, **dkw).call(f, *args, **kw)

  File "/usr/lib/python3/dist-packages/retrying.py", line 212, in call

    raise attempt.get()

  File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get

    six.reraise(self.value[0], self.value[1], self.value[2])

  File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise

    raise value

  File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call

    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 107, in put

    raise e

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/ps_consul.py", line 96, in put

    raise RetryConsulException()

PetaSAN.core.consul.ps_consul.RetryConsulException

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 1302, in move_services_interfaces_to_consul

    manage_cifs.save_cifs_base_settings(cifs_settings)

  File "/usr/lib/python3/dist-packages/PetaSAN/backend/manage_cifs.py", line 282, in save_cifs_base_settings

    cifs_server.set_consul_base_settings(base_settings)

  File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 840, in set_consul_base_settings

    self.set_consul_cifs_settings(cifs_settings)

  File "/usr/lib/python3/dist-packages/PetaSAN/core/cifs/cifs_server.py", line 821, in set_consul_cifs_settings

    ConsulAPI().set_key(self.CONSUL_KEY, settings.write_json())

  File "/usr/lib/python3/dist-packages/PetaSAN/core/consul/api.py", line 637, in set_key

    raise ConsulException(ConsulException.GENERAL_EXCEPTION, 'GeneralConsulError')

PetaSAN.core.common.CustomException.ConsulException: GeneralConsulError

 17/12/2020 17:10:32 INFO     First monitor started successfully

 17/12/2020 17:10:32 INFO     create_mgr() fresh install

 17/12/2020 17:10:32 INFO     create_mgr() started

 17/12/2020 17:10:32 INFO     create_mgr() cmd : mkdir -p /var/lib/ceph/mgr/ceph-ceph-03

 17/12/2020 17:10:32 INFO     create_mgr() cmd : ceph --cluster ceph auth get-or-create mgr.ceph-03  mon  'allow profile mgr' osd 'allow *' mds 'allow *'   -o /var/lib/ceph/mgr/ceph-ceph-03/keyring

 17/12/2020 17:10:46 INFO     create_mgr() ended successfully

 17/12/2020 17:10:46 INFO     create_mds() started

 17/12/2020 17:10:46 INFO     create_mds() cmd :  mkdir -p /var/lib/ceph/mds/ceph-ceph-03

 17/12/2020 17:11:16 INFO     create_mds() ended successfully

 17/12/2020 17:11:16 INFO     Starting to deploy remote monitors

root@ceph-03:~# 

i would recommend you check your network/settings, make sure ips are correct, subnets do not overlap, switch ports are correct and re-install once more, if the problem persists lets us know. If i understand correctly you did install the cluster before, if so try to see what changed.

if you retry and still you have an issue, post the log of the 3-rd node and let us know if you get a final error displayed in the ui (and what that error says) or does it get stuck ?

yes node 3 will take more time as it builds the cluster, so it could be 5-15 min.

 

i had a similar issue when i was first trying to setup. Had the disks been used previously? I had to make sure they were completely erased/wiped before the cluster would build.

 

I had to make sure they were completely erased/wiped before the cluster would build.

This used to be an issue a while back, now we do use wipefs, ceph lvm zap and dd to prepare disks, so it is a bit better...still some disks  (mostly from FreeBSD/FreeNAS/ZFS) are not cleaned well by the above tools and may require user wiping of entire disk.

Hi,

yes, early i test on my lab using VMware(only 1 interface on each vm) and successfully  now i'm test using bare metal server with multiple interface and still error, i already wipe all disk and nothing happen, for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow and i already test 3x times installations for 3 server with new error now on below from syslog :

 

Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:59613->172.16.91.22:8300: i/o timeout
Dec 20 17:48:32 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:59649->172.16.91.21:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:45127->172.16.91.22:8300: i/o timeout
Dec 20 17:48:52 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:60339->172.16.91.21:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.22:8300 172.16.91.22:8300}: read tcp 172.16.91.23:48795->172.16.91.22:8300: i/o timeout
Dec 20 17:49:13 ceph-03 consul[10056]: raft: Failed to AppendEntries to {Voter 172.16.91.21:8300 172.16.91.21:8300}: read tcp 172.16.91.23:52391->172.16.91.21:8300: i/o timeout

 

from PetaSAN.log :

root@ceph-03:~# cat /opt/petasan/log/PetaSAN.log
18/12/2020 22:48:16 INFO Start settings IPs
20/12/2020 16:30:29 ERROR Config file error. The PetaSAN os maybe just installed.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/PetaSAN/backend/cluster/deploy.py", line 69, in get_node_status
node_name = config.get_node_info().name
File "/usr/lib/python3/dist-packages/PetaSAN/core/cluster/configuration.py", line 99, in get_node_info
with open(config.get_node_info_file_path(), 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/petasan/config/node_info.json'
20/12/2020 16:30:43 INFO Starting node join
20/12/2020 16:31:13 ERROR Timeout exceeded.
<pexpect.pty_spawn.spawn object at 0x7f7f7c71bb38>
command: /usr/bin/scp
args: ['/usr/bin/scp', '-o', 'StrictHostKeyChecking=no', 'root@192.168.74.21:/root/.ssh/id_rsa.pub', '/root/.ssh/id_rsa.pub']
buffer (last 100 chars): b''
before (last 100 chars): b''
after: <class 'pexpect.exceptions.TIMEOUT'>
match: None
match_index: None
exitstatus: None
flag_eof: False
pid: 8205
child_fd: 9
closed: False
timeout: 30
delimiter: <class 'pexpect.exceptions.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
searcher: searcher_re:
0: re.compile("b'Are you sure you want to continue connecting'")
1: re.compile("b'password:'")
2: EOF
20/12/2020 16:31:13 ERROR Error while copying keys or setting password.
20/12/2020 16:35:30 INFO Starting node join
20/12/2020 16:35:30 INFO Successfully copied public keys.
20/12/2020 16:35:30 INFO Successfully copied private keys.
20/12/2020 16:35:30 INFO password set successfully.
20/12/2020 16:35:31 INFO Start copying cluster info file.
20/12/2020 16:35:31 INFO Successfully copied cluster info file.
20/12/2020 16:35:31 INFO Start copying services interfaces file.
20/12/2020 16:35:31 INFO Successfully copied services interfaces file.
20/12/2020 16:35:31 INFO Joined cluster CEPH_CLUSTER
20/12/2020 16:37:15 ERROR 400 Bad Request: The browser (or proxy) sent a request that this server could not understand.
20/12/2020 16:37:15 INFO Set node role completed successfully.
20/12/2020 16:37:15 INFO Set node info completed successfully.
20/12/2020 16:38:10 INFO Stopping petasan services on all nodes.
20/12/2020 16:38:10 INFO Stopping all petasan services.
20/12/2020 16:38:10 INFO files_sync.py process is 9897
20/12/2020 16:38:10 INFO files_sync.py process stopped
20/12/2020 16:38:10 INFO iscsi_service.py process is 9899
20/12/2020 16:38:10 INFO iscsi_service.py process stopped
20/12/2020 16:38:10 INFO admin.py process is 9901
20/12/2020 16:38:10 INFO admin.py process stopped
20/12/2020 16:38:11 INFO Starting local clean_ceph.
20/12/2020 16:38:11 INFO Starting clean_ceph
20/12/2020 16:38:11 INFO Stopping ceph services
20/12/2020 16:38:11 INFO Start cleaning config files
20/12/2020 16:38:11 INFO Starting ceph services
20/12/2020 16:38:12 INFO Starting local clean_consul.
20/12/2020 16:38:12 INFO Trying to clean Consul on local node
20/12/2020 16:38:12 INFO delete /opt/petasan/config/etc/consul.d
20/12/2020 16:38:12 INFO delete /opt/petasan/config/var/consul
20/12/2020 16:38:12 INFO Trying to clean Consul on 192.168.74.21
20/12/2020 16:38:13 INFO Trying to clean Consul on 192.168.74.22
20/12/2020 16:38:13 INFO cluster_name: CEPH_CLUSTER
20/12/2020 16:38:13 INFO local_node_info.name: ceph-03
20/12/2020 16:38:14 INFO Start consul leaders remotely.
20/12/2020 16:38:17 INFO str_start_command: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> consul agent -raft-protocol 2 -config-dir /opt/petasan/config/etc/consul.d/server -bind 172.16.91.23 -retry-join 172.16.91.21 -retry-join 172.16.91.22
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-03
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-01
20/12/2020 16:38:18 INFO Cluster Node {} joined the cluster and is aliveceph-02
20/12/2020 16:38:18 INFO Consul leaders are ready
20/12/2020 16:38:28 INFO NFSServer : Changing NFS Settings in Consul
20/12/2020 16:38:28 INFO NFSServer : NFS Settings has been changed in Consul
root@ceph-03:~#

 

root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.21 8300
Trying 172.16.91.21...
Connected to 172.16.91.21.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~# telnet -b 172.16.91.23 172.16.91.22 8300
Trying 172.16.91.22...
Connected to 172.16.91.22.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@ceph-03:~#

 

From syslog on Ceph-01 node :

Dec 20 17:19:56 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:60875: read: connection reset by peer
Dec 20 17:25:01 ceph-01 CRON[11592]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:33:08 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:33239: read: connection reset by peer
Dec 20 17:35:01 ceph-01 CRON[11670]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:45:01 ceph-01 CRON[11752]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:47:45 ceph-01 lfd[10693]: *SSH login* from 10.3.1.78 into the root account using password authentication - ignored
Dec 20 17:55:01 ceph-01 CRON[11853]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 20 17:57:51 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:42523: read: connection reset by peer
Dec 20 18:00:14 ceph-01 consul[11041]: raft-net: Failed to decode incoming command: read tcp 172.16.91.21:8300->172.16.91.23:55521: read: connection reset by peer

 

i don't know why node-3 still processing and never finish but node-1 and 2 it's done

 

look like a network connectivity issue. double check your setup, ips, subnets do not overlap.

for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow

not sure what you mean, but you should not block any ports, the systems uses more ports than those.

Quote from admin on December 20, 2020, 3:35 pm

look like a network connectivity issue. double check your setup, ips, subnets do not overlap.

for network side i already allow port require from management to reach backend (TCP 3300 and TCP 6789) already allow

not sure what you mean, but you should not block any ports, the systems uses more ports than those.

i'm sure no overlapping, are management and backend should be same subnet ?

 

no management and backend should be different subnets.

All sorted now, i can see the PetaSAN dashboard at the first time 😀

the installations without LACP, i have questions :

  • can i change directly file cluster_info.json to change all using LACP ?
  • which node i can make changes ? or i need to change cluster_info.json from one by one server to convert into LACP ? and reboot one by one the server ?
  • i tried to add new dummy interface and successful but how to create automatic when server reboot ? i tried from files on /etc/systemd/network/ but not success
  • regarding SNMP and graph, i see you using grafana, zabix and collectd, i have centralize monitoring using librenms and publish into grafana, can i grab all snmp informations including ceph cluster info using my librenms and publish into my external grafana ? or i can only copy the grafana dashboard files (json) ?

 

 

1-Yes you can change cluster_info.json config file. you can modify the bond types there if you wish.

2- Since you change it manually, you need to copy it manually on all nodes.

3-Not sure why you need dummy interfaces, but you can customise any network at boot via editing the following custom scripts:

/opt/petasan/scripts/custom/pre_start_network.sh
/opt/petasan/scripts/custom/post_start_network.sh

4- The charts data is stored in graphite database, it runs on port 8080 on 1 node out of the first 3 management nodes (active/backup) you could detect the current active graphite server via

/opt/petasan/scripts/util/get_cluster_leader.py

note that for security, only localhost is allowed to access graphite, so you would need to write a script to run locally and export the data via snmp.

Pages: 1 2