Forums - PetaSAN

ForumGeneral DiscussionDifficulty recovering one of thre …
You need to log in to create posts and topics. Login · Register
Difficulty recovering one of three servers after power outage

Pages: 1 2 3 4 » Last

southcoast
50 Posts

October 16, 2018, 1:12 am
Quote from southcoast on October 16, 2018, 1:12 am
I have three Dell Poweredge 1950's in a cluster, there was a power outage and one of the 3 servers shows down in the node status screen. The affected server is powered up. I executed a restart from the console and have as well done a hard reset on the server. I can via putty ssh to it and if I target the server's IP address from a browser on port 5001, it replies in some seconds: "Cluster has been already created."

Then states: “Use any of the following Cluster Management URLs to manage your cluster” followed by a list of my servers at port 5000.

If I target the affected server from a browser on port 5000, I get the following response:

requests.exceptions.ConnectionError

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Sessions/eyJfcGVybWFuZW50Ijp0cnVlfQ.Dqb3SQ.nfpap5B5v6HLQ5rCcX4lVbiBh8E/_exp (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f22727ac6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))

Where the connection refused indicates a stopped service, is there a shell command I would use to force the listener back up?

Thank-you

I have three Dell Poweredge 1950's in a cluster, there was a power outage and one of the 3 servers shows down in the node status screen. The affected server is powered up. I executed a restart from the console and have as well done a hard reset on the server. I can via putty ssh to it and if I target the server's IP address from a browser on port 5001, it replies in some seconds: "Cluster has been already created."

Then states: “Use any of the following Cluster Management URLs to manage your cluster” followed by a list of my servers at port 5000.

If I target the affected server from a browser on port 5000, I get the following response:

requests.exceptions.ConnectionError

ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/PetaSAN/Sessions/eyJfcGVybWFuZW50Ijp0cnVlfQ.Dqb3SQ.nfpap5B5v6HLQ5rCcX4lVbiBh8E/_exp (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f22727ac6d0>: Failed to establish a new connection: [Errno 111] Connection refused',))

Where the connection refused indicates a stopped service, is there a shell command I would use to force the listener back up?

Thank-you

#1

admin
2,967 Posts

October 16, 2018, 9:36 am
Quote from admin on October 16, 2018, 9:36 am
What is the status of the cluster : OK/degraded ? any OSDs down ? Are iSCSI disks running ?

what is the output of

consul members

when run from failed + good nodes ?

What is the status of the cluster : OK/degraded ? any OSDs down ? Are iSCSI disks running ?

what is the output of

consul members

when run from failed + good nodes ?

Last edited on October 16, 2018, 9:36 am by admin · #2

southcoast
50 Posts

October 25, 2018, 12:58 am
Quote from southcoast on October 25, 2018, 12:58 am
Sorry for late response. Finally got failed server powered backup. Of the three servers, the output from the two working servers is as follows:

root@Peta-SAN-01:~# consul members
Node         Address             Status Type    Build Protocol DC
Peta-SAN-01 10.250.252.11:8301 alive   server 0.7.3 2         petasan
Peta-San-03 10.250.252.13:8301 alive   server 0.7.3 2         petasan
root@Peta-SAN-01:~#

and

root@Peta-San-03:~# consul members
Node         Address             Status Type    Build Protocol DC
Peta-SAN-01 10.250.252.11:8301 alive   server 0.7.3 2         petasan
Peta-San-03 10.250.252.13:8301 alive   server 0.7.3 2         petasan
root@Peta-San-03:~#

The output from the failed server is:

root@Peta-San-02:~# consul members
Error connecting to Consul agent: dial tcp 127.0.0.1:8400: getsockopt: connection refused
root@Peta-San-02:~#

Seems a service is down which needs to be restarted.

Please advise.

Thhank-you

Sorry for late response. Finally got failed server powered backup. Of the three servers, the output from the two working servers is as follows:

root@Peta-SAN-01:~# consul members
Node         Address             Status Type    Build Protocol DC
Peta-SAN-01 10.250.252.11:8301 alive   server 0.7.3 2         petasan
Peta-San-03 10.250.252.13:8301 alive   server 0.7.3 2         petasan
root@Peta-SAN-01:~#

and

root@Peta-San-03:~# consul members
Node         Address             Status Type    Build Protocol DC
Peta-SAN-01 10.250.252.11:8301 alive   server 0.7.3 2         petasan
Peta-San-03 10.250.252.13:8301 alive   server 0.7.3 2         petasan
root@Peta-San-03:~#

The output from the failed server is:

root@Peta-San-02:~# consul members
Error connecting to Consul agent: dial tcp 127.0.0.1:8400: getsockopt: connection refused
root@Peta-San-02:~#

Seems a service is down which needs to be restarted.

Please advise.

Thhank-you

#3

admin
2,967 Posts

October 25, 2018, 11:37 am
Quote from admin on October 25, 2018, 11:37 am
What is the status of the cluster : OK/degraded ? any OSDs down ? Are iSCSI disks running ?

Yes the Consul service is not starting, on the problem node, do you see errors if starting manually:

consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2

What is the status of the cluster : OK/degraded ? any OSDs down ? Are iSCSI disks running ?

Yes the Consul service is not starting, on the problem node, do you see errors if starting manually:

consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2

Last edited on October 25, 2018, 11:37 am by admin · #4

southcoast
50 Posts

October 25, 2018, 1:22 pm
Quote from southcoast on October 25, 2018, 1:22 pm
From the dashboard, there are zeroes across the board since I have not completed my configurations to make the cluster accessible to my VMware hosts. On the affected host, the command fails with the response:

root@Peta-San-02:/# consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2
==> Error reading '/opt/petasan/config/etc/consul.d/server': open /opt/petasan/config/etc/consul.d/server: no such file or directory
root@Peta-San-02:/#

and indeed, the target directory has only the following objects:

root@Peta-San-02:/# ls /opt/petasan/config/etc/consul.d/                        client
root@Peta-San-02:/#

Connecting to the gui gets me to the login screen but with the response:

Alert! Consul connection error.

From the dashboard, there are zeroes across the board since I have not completed my configurations to make the cluster accessible to my VMware hosts. On the affected host, the command fails with the response:

root@Peta-San-02:/# consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2
==> Error reading '/opt/petasan/config/etc/consul.d/server': open /opt/petasan/config/etc/consul.d/server: no such file or directory
root@Peta-San-02:/#

and indeed, the target directory has only the following objects:

root@Peta-San-02:/# ls /opt/petasan/config/etc/consul.d/                        client
root@Peta-San-02:/#

Connecting to the gui gets me to the login screen but with the response:

Alert! Consul connection error.

#5

southcoast
50 Posts

October 25, 2018, 1:52 pm
Quote from southcoast on October 25, 2018, 1:52 pm
I inspected the indicated directory on all three servers and found the contents of the directory differ:

root@Peta-SAN-01:~# ls /opt/petasan/config/etc/consul.d/
server
root@Peta-SAN-01:~#

root@Peta-San-02:/# ls /opt/petasan/config/etc/consul.d/
client
root@Peta-San-02:/#

and

root@Peta-San-03:~# ls /opt/petasan/config/etc/consul.d/
server
root@Peta-San-03:~#

I duplicated the server directory from a working server to the affected server, and executed the specified recovery command. The output is as follows:

root@Peta-San-02:/# consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Error starting agent: Failed to start Consul server: Failed to start lan serf: Failed to create memberlist: Failed to parse advertise address!
root@Peta-San-02:/#

I inspected the indicated directory on all three servers and found the contents of the directory differ:

root@Peta-SAN-01:~# ls /opt/petasan/config/etc/consul.d/
server
root@Peta-SAN-01:~#

root@Peta-San-02:/# ls /opt/petasan/config/etc/consul.d/
client
root@Peta-San-02:/#

and

root@Peta-San-03:~# ls /opt/petasan/config/etc/consul.d/
server
root@Peta-San-03:~#

I duplicated the server directory from a working server to the affected server, and executed the specified recovery command. The output is as follows:

root@Peta-San-02:/# consul agent -config-dir /opt/petasan/config/etc/consul.d/server -bind BACKEND1_IP_THIS_NODE -retry-join BACKEND1_IP_OTHER_NODE1 -retry-join BACKEND1_IP_OTHER_NODE2
==> WARNING: LAN keyring exists but -encrypt given, using keyring
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Error starting agent: Failed to start Consul server: Failed to start lan serf: Failed to create memberlist: Failed to parse advertise address!
root@Peta-San-02:/#

#6

southcoast
50 Posts

October 25, 2018, 2:46 pm
Quote from southcoast on October 25, 2018, 2:46 pm
I reloaded the server remotely and it is now showing in the node list as up.

I can as well access the server on the port 5000 management access.

I am now going to see if I can complete the configuration for access from my local VMware servers.

I reloaded the server remotely and it is now showing in the node list as up.

I can as well access the server on the port 5000 management access.

I am now going to see if I can complete the configuration for access from my local VMware servers.

#7

admin
2,967 Posts

October 25, 2018, 3:12 pm
Quote from admin on October 25, 2018, 3:12 pm
Node 2 having a /opt/petasan/config/etc/consul.d/client directory rather than /opt/petasan/config/etc/consul.d/server is very strange. The first 3 Management nodes take on (among other things) the role of Consul servers, Storage nodes (nodes greater than 3) are Consul clients. Has this node 2 been deployed in a prior cluster as node 4 or greater ? Or maybe it has an old install disk that it is booting from ?

The cluster with 2 nodes should be functioning, even without node 2, is this the case ? Do you have any OSDs on such nodes ? are they up ? What is the cluster status on the dashboard: OK/Warning/Error ?

Node 2 having a /opt/petasan/config/etc/consul.d/client directory rather than /opt/petasan/config/etc/consul.d/server is very strange. The first 3 Management nodes take on (among other things) the role of Consul servers, Storage nodes (nodes greater than 3) are Consul clients. Has this node 2 been deployed in a prior cluster as node 4 or greater ? Or maybe it has an old install disk that it is booting from ?

The cluster with 2 nodes should be functioning, even without node 2, is this the case ? Do you have any OSDs on such nodes ? are they up ? What is the cluster status on the dashboard: OK/Warning/Error ?

Last edited on October 25, 2018, 3:14 pm by admin · #8

southcoast
50 Posts

October 25, 2018, 3:33 pm
Quote from southcoast on October 25, 2018, 3:33 pm
I found the /opt/petasan/config/etc/consul.d/ on the 3 servers to have either client or server but no both files.

Should the directory contain both client and server?

The dashboard "Ceph Cluster OSD status" window shows total/up/down as "0".

On the other hand, I have not completed configuration tasks as the servers will not activate the pools I created. If I press refresh on the browser, the status goes to "checking" and after some seconds returns to a state of "Inactive"

I found the /opt/petasan/config/etc/consul.d/ on the 3 servers to have either client or server but no both files.

Should the directory contain both client and server?

The dashboard "Ceph Cluster OSD status" window shows total/up/down as "0".

On the other hand, I have not completed configuration tasks as the servers will not activate the pools I created. If I press refresh on the browser, the status goes to "checking" and after some seconds returns to a state of "Inactive"

Last edited on October 25, 2018, 3:33 pm by southcoast · #9

southcoast
50 Posts

October 25, 2018, 4:25 pm
Quote from southcoast on October 25, 2018, 4:25 pm
I executed from a shell session a status command and got the following:

root@Peta-San-03:~# ceph osd status --cluster=Cameron-SAN-01
+----+------+------+-------+--------+---------+--------+---------+-------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+------+-------+--------+---------+--------+---------+-------+
+----+------+------+-------+--------+---------+--------+---------+-------+
root@Peta-San-03:~#

I executed from a shell session a status command and got the following:

root@Peta-San-03:~# ceph osd status --cluster=Cameron-SAN-01
+----+------+------+-------+--------+---------+--------+---------+-------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+------+-------+--------+---------+--------+---------+-------+
+----+------+------+-------+--------+---------+--------+---------+-------+
root@Peta-San-03:~#

#10

Post Reply: Difficulty recovering one of three servers after power outage

Cancel

Pages: 1 2 3 4 » Last