Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Difficulty recovering one of three servers after power outage

Pages: 1 2 3 4 » Last

I have three Dell Poweredge 1950's in a cluster, there was a power outage and one of the 3 servers shows down in the node status screen. The affected server is powered up. I executed a restart from the console and have as well done a hard reset on the server.  I can via putty ssh to it and if I target the server's IP address from a browser on port 5001, it replies in some seconds: "Cluster has been already created."

Then states: “Use any of the following Cluster Management URLs to manage your cluster” followed by a list of my servers at port 5000.

 

If I target the affected server from a browser on port 5000, I get the following response:

 

requests.exceptions.ConnectionError

 

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Sessions/eyJfcGVybWFuZW50Ijp0cnVlfQ.Dqb3SQ.nfpap5B5v6HLQ5rCcX4lVbiBh8E/_exp (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f22727ac6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))

 

Where the connection refused indicates a stopped service, is there a shell command I would use to force the listener back up?

Thank-you

What is the status of the cluster : OK/degraded ? any OSDs down ? Are iSCSI disks running ?

what is the output of

consul members

when run from failed + good nodes ?

Sorry for late response. Finally got failed server powered backup. Of the three servers, the output from the two working servers is as follows:

 

root@Peta-SAN-01:~# consul members
Node         Address             Status  Type    Build  Protocol  DC
Peta-SAN-01  10.250.252.11:8301  alive   server  0.7.3  2         petasan
Peta-San-03  10.250.252.13:8301  alive   server  0.7.3  2         petasan
root@Peta-SAN-01:~#

and

root@Peta-San-03:~# consul members
Node         Address             Status  Type    Build  Protocol  DC
Peta-SAN-01  10.250.252.11:8301  alive   server  0.7.3  2         petasan
Peta-San-03  10.250.252.13:8301  alive   server  0.7.3  2         petasan
root@Peta-San-03:~#

 

The output from the failed server is:

root@Peta-San-02:~# consul members
Error connecting to Consul agent: dial tcp 127.0.0.1:8400: getsockopt: connection refused
root@Peta-San-02:~#

 

Seems a service is down which needs to be restarted.

Please advise.

Thhank-you

 

What is the status of the cluster : OK/degraded ? any OSDs down ? Are iSCSI disks running ?

Yes the Consul service is not starting, on the problem node, do you  see errors if starting manually:

consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2

From the dashboard, there are zeroes across the board since I have not completed my configurations to make the cluster accessible to my VMware hosts.  On the affected host, the command fails with the response:

 

root@Peta-San-02:/# consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2
==> Error reading '/opt/petasan/config/etc/consul.d/server': open /opt/petasan/config/etc/consul.d/server: no such file or directory
root@Peta-San-02:/#

 

and indeed, the target directory has only the following objects:

 

root@Peta-San-02:/# ls /opt/petasan/config/etc/consul.d/                        client
root@Peta-San-02:/#

 

Connecting to the gui gets me to the login screen but with the response:

 

Alert! Consul connection error.

 

I inspected the indicated directory on all three servers and found the contents of the directory differ:

root@Peta-SAN-01:~#  ls /opt/petasan/config/etc/consul.d/
server
root@Peta-SAN-01:~#

 

root@Peta-San-02:/# ls /opt/petasan/config/etc/consul.d/
client
root@Peta-San-02:/#

 

and

 

root@Peta-San-03:~# ls /opt/petasan/config/etc/consul.d/
server
root@Peta-San-03:~#

 

I duplicated the server directory from a working server to the affected server, and executed the specified recovery command. The output is as follows:

 

root@Peta-San-02:/# consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Error starting agent: Failed to start Consul server: Failed to start lan serf: Failed to create memberlist: Failed to parse advertise address!
root@Peta-San-02:/#

 

 

 

I reloaded the server remotely and it is now showing in the node list as up.

I can as well access the server on the port 5000 management access.

I am now going to see if I can complete the configuration for access from my local VMware servers.

Node 2 having a /opt/petasan/config/etc/consul.d/client directory rather than /opt/petasan/config/etc/consul.d/server  is very strange. The first 3 Management nodes take on  (among other things) the role of Consul servers, Storage nodes (nodes greater than 3) are Consul clients. Has this node 2 been deployed in a prior cluster as node 4 or greater ? Or maybe it has an old install disk that it is booting from ?

The cluster with 2 nodes should be functioning, even without node 2, is this the case ? Do you have any OSDs on such nodes ? are they up ? What is the cluster status on the dashboard: OK/Warning/Error ?

I found the /opt/petasan/config/etc/consul.d/ on the 3 servers to have either client or server but no both files.

Should the directory contain both client and server?

The dashboard "Ceph Cluster OSD status" window shows total/up/down as "0".

On the other hand, I have not completed configuration tasks as the servers will not activate the pools I created. If I press refresh on the browser, the status goes to "checking" and after some seconds returns to a state of "Inactive"

I executed from a shell session a status command and got the following:

root@Peta-San-03:~# ceph osd status --cluster=Cameron-SAN-01
+----+------+------+-------+--------+---------+--------+---------+-------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+------+-------+--------+---------+--------+---------+-------+
+----+------+------+-------+--------+---------+--------+---------+-------+
root@Peta-San-03:~#

Pages: 1 2 3 4 » Last