Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

node stays disconnected following power outage

Pages: 1 2 3

I would also like to get to the bottom of the consul issue, so that I can better understand how this whole puzzle is put together. When I previously experienced this seemingly exact same issue, I simply wiped all disks and reinstalled the whole cluster from ground up. There was never a time where I reinstalled only a single node, and the cluster doesn't even build until the third node is joined.

I will probably just go ahead and reinstall/replace at this point, if there isn't anything else I could be looking at.

Thanks for your insight.

Still holding on in hopes of solving the consul issue, but ... when I do decide to do the reinstall/replace, is there anything that should be done on the current node, or cluster-wise, before wiping this node before redeployment? Can I re-use the same hostname when I do this?

When you replace a management node, you need to give it the same hostname and management ip address, it will then be recreated with old configuration  ( backend ips/monitor/consul)  by itself.

 

Well, now that you describe a "replacement" my last questions seem a bit lacking in thought ... 🙂

Any further ideas on the consul issue? This really bugs me ...

no idea, the first 3 nodes (management nodes) should always be consul servers.

Hi,

I've build a 3 nodes cluster based on PetaSAN 2.3 iso, all node have been "apt dist-upgrade" right after the installation, only 2 of the nodes have attached storage.

The last nodes which is also providing the half of the ceph OSDs is again out of the quorum. The first time it happened was last week so I rebuild it from scratch last Friday.

But This morning it s again out of the quorum.

004:~# find /opt/petasan/config/etc/consul.d -ls
59899928      4 drwxr-xr-x   3 root     root         4096 May 13 08:40 /opt/petasan/config/etc/consul.d
59899929      4 drwxr-xr-x   2 root     root         4096 May 13 08:40 /opt/petasan/config/etc/consul.d/client
59899930      4 -rw-r--r--   1 root     root          201 May 13 08:40 /opt/petasan/config/etc/consul.d/client/config.json

...

root@c2dsp1lcpt004:~# ceph --cluster bxrsvsan01 -s
cluster:
id:     9c2d4085-bd52-428f-8d0b-7a0883df6580
health: HEALTH_WARN
Degraded data redundancy: 7/14 objects degraded (50.000%), 7 pgs degraded, 1024 pgs undersized
1/3 mons down, quorum c2dsp1lcpt001,c2dsp1lcpt003

services:
mon: 3 daemons, quorum c2dsp1lcpt001,c2dsp1lcpt003, out of quorum: c2dsp1lcpt004
mgr: c2dsp1lcpt001(active), standbys: c2dsp1lcpt003
osd: 24 osds: 12 up, 12 in

...

Each nodes have 4 physical interfaces configured as bond0 and bond1 using LACP

Weird but only the last nodes is in hosts file.

root@c2dsp1lcpt004:~# cat /etc/hosts
172.19.112.12   c2dsp1lcpt004

 

Plenty of this in logs.

root@c2dsp1lcpt004:~# tail -300 /var/log/syslog.1
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])

And finally.

root@c2dsp1lcpt004:~# grep consul /var/log/syslog.1
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 42.237µs
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 131083
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-131083-1557660787566.tmp
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-114697-1557566132638
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 112652 to 120843
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 131083 complete
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 21.263µs
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 139277
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-139277-1557707707920.tmp
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-122891-1557613332618
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 120844 to 129037
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 139277 complete
root@c2dsp1lcpt004:~#
root@c2dsp1lcpt004:~# grep consul /var/log/syslog
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: requesting shutdown
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: consul: shutting down server
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: dns: error starting tcp server: accept tcp 127.0.0.1:8600: use of closed network connection
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: shutdown complete
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: http: Request GET /v1/kv/PetaSAN/Disks/?index=147017&wait=10s, error: rpc error: EOF from=127.0.0.1:45176
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Generated unique node ID "24b0e2d3-8470-4d33-0a1f-17811d30b14a" for this agent (persisted)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt004 10.10.4.13
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Joining cluster...
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: manager: No servers available
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joining: [10.10.4.11 10.10.4.12 10.10.4.13]
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: failed to sync remote state: No known Consul servers
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt001 10.10.4.11
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt003 10.10.4.12
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: memberlist: Refuting a suspect message (from: c2dsp1lcpt004)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt001 (Addr: tcp/10.10.4.11:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt003 (Addr: tcp/10.10.4.12:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joined: 3 Err: <nil>
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Join completed. Synced with 3 initial agents
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Deregistered service 'consul'
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Synced node info

 

Any hint would be much appreciated.

Thank you,

Have a good day.

Charlie

The dist-upgrade should not be done, it could replace storage related components which will could affect the kernel and iSCSI layer.

Still you should not be getting ceph and consul errors, the most probable reasons are network errors, some things to look into

do you see errors in

cat /opt/petasan/log/PetaSAN.log

check consul health via

consul members

what errors do you see if you try to start a failed osd manually via

/usr/bin/ceph-osd -f --cluster [cluster_name]  --id [osd_id] --setuser ceph --setgroup ceph

The /etc/hosts is also strange, could be due network errors when being copied, but generally the core components such as iSCSI, ceph, and consul use ips so they would not be affected, but still the root cause could be the same.

This is a continuation of the original thread, unrelated to the "dist-upgrade" posts. I rebuilt this cluster and it worked for a few months. This morning health showed a warning. So I logged in and found the following on node3:

  • same consul situation as previously described (/opt/petasan/config/etc/consul.d/server: no such file or directory)
  • all osds unavailable, showing failed in journalctl logs, ceph-disk list showing this for each:
    • /dev/sdb :
      /dev/sdb1 ceph data, unprepared
      /dev/sdb2 ceph block
  • mon service of course down

I did find something very interesting / disconcerting in the logs this time:

29/04/2019 17:01:26 INFO Start add osd job for disk sdn.
29/04/2019 17:01:28 INFO Start cleaning disk sdn
29/04/2019 17:01:29 INFO Executing wipefs --all /dev/sdn
29/04/2019 17:01:29 INFO Executing dd if=/dev/zero of=/dev/sdn bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
29/04/2019 17:01:43 INFO Executing parted -s /dev/sdn mklabel gpt
29/04/2019 17:01:44 INFO Executing partprobe /dev/sdn
29/04/2019 17:01:47 INFO Starting ceph-disk zap /dev/sdn
29/04/2019 17:01:49 INFO ceph-disk zap done
29/04/2019 17:01:49 INFO Auto select journal for disk sdn.
29/04/2019 17:01:50 INFO User selected Auto journal, selected device is nvme1n1 disk for disk sdn.
29/04/2019 17:01:50 INFO Start prepare osd sdn
29/04/2019 17:01:50 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sdn --block.db /dev/nvme1n1 --cluster pdflvsan01
09/07/2019 08:06:55 INFO Joining node to running cluster.
09/07/2019 08:07:21 INFO Stopping petasan services on local node.
09/07/2019 08:07:21 INFO Stopping all petasan services.
09/07/2019 08:07:21 INFO files_sync.py process is 3305
09/07/2019 08:07:21 INFO Trying to stop files_sync.py process
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO files_sync.py process is 3041736
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO iscsi_service.py process is 3334
09/07/2019 08:07:24 INFO Trying to stop iscsi_service.py process
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO iscsi_service.py process is 3041744
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO admin.py process is 3430
09/07/2019 08:07:27 INFO Trying to stop admin.py process
09/07/2019 08:07:27 INFO Service is starting.
09/07/2019 08:07:27 INFO Cleaning unused configurations.
09/07/2019 08:07:27 INFO Cleaning all mapped disks
09/07/2019 08:07:27 INFO Cleaning unused rbd images.
09/07/2019 08:07:27 INFO Cleaning unused ips.
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3449
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3041807
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO Starting local clean_ceph.
09/07/2019 08:07:30 INFO Stopping ceph services
09/07/2019 08:07:35 INFO Start cleaning config files
09/07/2019 08:07:36 INFO End cleaning config files
09/07/2019 08:07:36 INFO Starting ceph services
09/07/2019 08:07:36 INFO Starting local clean_consul.
09/07/2019 08:07:36 INFO Trying to clean Consul on local node
09/07/2019 08:07:36 INFO consul process pid is 2894
09/07/2019 08:07:36 INFO Trying to stop consul process
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Config/Files?index=264&recurse=1
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
09/07/2019 08:07:38 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f64bf613510>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s

Logs from 29/04/2019 showed the tail end of setup from the last wipe/reload of the node. Suddenly this morning (09/07/2019) the node is being joined to a cluster? What could have caused that? The node has been part of this cluster since the last log entry a few months ago ... but the timing appears to match the start of my problems ...

This cluster is running 2.2.0, with the recommended vmware-performance-related changes.

Suggestions for finding the cause (and a way to prevent this) happily accepted.

To have a node re-join by itself is indeed strange.  I checked with the developers and the function to join a cluster is only called within the deployment wizard so can only happen if the node is redeployed, moreover if you try to redeploy an  already deployed node ( via the web ui , starts at page 1 to 8) it will tell you the node is already deployed, the only case we see that can redeploy an already deployed node is that the internal url for page 8 is called directly. Note this url is not visible on the browser url or history but could be known looking at the source code. This may not make too much sense but it is the only thing we see.

That is very interesting information ...

Hypothetical question: If I deployed this node successfully, and did not touch the browser window through which the deployment was triggered, and left that page for a couple of months ... and then came back a few months later, when the browser refreshes its pages would that potentially trigger the internal url (again) and cause what I am seeing?

Pages: 1 2 3