node stays disconnected following power outage
deweyhylton
14 Posts
April 25, 2019, 10:34 pmQuote from deweyhylton on April 25, 2019, 10:34 pmI would also like to get to the bottom of the consul issue, so that I can better understand how this whole puzzle is put together. When I previously experienced this seemingly exact same issue, I simply wiped all disks and reinstalled the whole cluster from ground up. There was never a time where I reinstalled only a single node, and the cluster doesn't even build until the third node is joined.
I will probably just go ahead and reinstall/replace at this point, if there isn't anything else I could be looking at.
Thanks for your insight.
I would also like to get to the bottom of the consul issue, so that I can better understand how this whole puzzle is put together. When I previously experienced this seemingly exact same issue, I simply wiped all disks and reinstalled the whole cluster from ground up. There was never a time where I reinstalled only a single node, and the cluster doesn't even build until the third node is joined.
I will probably just go ahead and reinstall/replace at this point, if there isn't anything else I could be looking at.
Thanks for your insight.
deweyhylton
14 Posts
April 26, 2019, 3:39 pmQuote from deweyhylton on April 26, 2019, 3:39 pmStill holding on in hopes of solving the consul issue, but ... when I do decide to do the reinstall/replace, is there anything that should be done on the current node, or cluster-wise, before wiping this node before redeployment? Can I re-use the same hostname when I do this?
Still holding on in hopes of solving the consul issue, but ... when I do decide to do the reinstall/replace, is there anything that should be done on the current node, or cluster-wise, before wiping this node before redeployment? Can I re-use the same hostname when I do this?
admin
2,930 Posts
April 26, 2019, 4:38 pmQuote from admin on April 26, 2019, 4:38 pmWhen you replace a management node, you need to give it the same hostname and management ip address, it will then be recreated with old configuration ( backend ips/monitor/consul) by itself.
When you replace a management node, you need to give it the same hostname and management ip address, it will then be recreated with old configuration ( backend ips/monitor/consul) by itself.
Last edited on April 26, 2019, 4:39 pm by admin · #13
deweyhylton
14 Posts
April 26, 2019, 5:22 pmQuote from deweyhylton on April 26, 2019, 5:22 pmWell, now that you describe a "replacement" my last questions seem a bit lacking in thought ... 🙂
Any further ideas on the consul issue? This really bugs me ...
Well, now that you describe a "replacement" my last questions seem a bit lacking in thought ... 🙂
Any further ideas on the consul issue? This really bugs me ...
admin
2,930 Posts
April 26, 2019, 5:28 pmQuote from admin on April 26, 2019, 5:28 pmno idea, the first 3 nodes (management nodes) should always be consul servers.
no idea, the first 3 nodes (management nodes) should always be consul servers.
charlie_mtl
9 Posts
May 13, 2019, 2:10 pmQuote from charlie_mtl on May 13, 2019, 2:10 pmHi,
I've build a 3 nodes cluster based on PetaSAN 2.3 iso, all node have been "apt dist-upgrade" right after the installation, only 2 of the nodes have attached storage.
The last nodes which is also providing the half of the ceph OSDs is again out of the quorum. The first time it happened was last week so I rebuild it from scratch last Friday.
But This morning it s again out of the quorum.
004:~# find /opt/petasan/config/etc/consul.d -ls
59899928 4 drwxr-xr-x 3 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d
59899929 4 drwxr-xr-x 2 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d/client
59899930 4 -rw-r--r-- 1 root root 201 May 13 08:40 /opt/petasan/config/etc/consul.d/client/config.json
...
root@c2dsp1lcpt004:~# ceph --cluster bxrsvsan01 -s
cluster:
id: 9c2d4085-bd52-428f-8d0b-7a0883df6580
health: HEALTH_WARN
Degraded data redundancy: 7/14 objects degraded (50.000%), 7 pgs degraded, 1024 pgs undersized
1/3 mons down, quorum c2dsp1lcpt001,c2dsp1lcpt003
services:
mon: 3 daemons, quorum c2dsp1lcpt001,c2dsp1lcpt003, out of quorum: c2dsp1lcpt004
mgr: c2dsp1lcpt001(active), standbys: c2dsp1lcpt003
osd: 24 osds: 12 up, 12 in
...
Each nodes have 4 physical interfaces configured as bond0 and bond1 using LACP
Weird but only the last nodes is in hosts file.
root@c2dsp1lcpt004:~# cat /etc/hosts
172.19.112.12 c2dsp1lcpt004
Plenty of this in logs.
root@c2dsp1lcpt004:~# tail -300 /var/log/syslog.1
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
And finally.
root@c2dsp1lcpt004:~# grep consul /var/log/syslog.1
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 42.237µs
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 131083
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-131083-1557660787566.tmp
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-114697-1557566132638
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 112652 to 120843
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 131083 complete
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 21.263µs
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 139277
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-139277-1557707707920.tmp
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-122891-1557613332618
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 120844 to 129037
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 139277 complete
root@c2dsp1lcpt004:~#
root@c2dsp1lcpt004:~# grep consul /var/log/syslog
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: requesting shutdown
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: consul: shutting down server
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: dns: error starting tcp server: accept tcp 127.0.0.1:8600: use of closed network connection
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: shutdown complete
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: http: Request GET /v1/kv/PetaSAN/Disks/?index=147017&wait=10s, error: rpc error: EOF from=127.0.0.1:45176
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Generated unique node ID "24b0e2d3-8470-4d33-0a1f-17811d30b14a" for this agent (persisted)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt004 10.10.4.13
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Joining cluster...
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: manager: No servers available
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joining: [10.10.4.11 10.10.4.12 10.10.4.13]
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: failed to sync remote state: No known Consul servers
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt001 10.10.4.11
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt003 10.10.4.12
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: memberlist: Refuting a suspect message (from: c2dsp1lcpt004)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt001 (Addr: tcp/10.10.4.11:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt003 (Addr: tcp/10.10.4.12:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joined: 3 Err: <nil>
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Join completed. Synced with 3 initial agents
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Deregistered service 'consul'
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Synced node info
Any hint would be much appreciated.
Thank you,
Have a good day.
Charlie
Hi,
I've build a 3 nodes cluster based on PetaSAN 2.3 iso, all node have been "apt dist-upgrade" right after the installation, only 2 of the nodes have attached storage.
The last nodes which is also providing the half of the ceph OSDs is again out of the quorum. The first time it happened was last week so I rebuild it from scratch last Friday.
But This morning it s again out of the quorum.
004:~# find /opt/petasan/config/etc/consul.d -ls
59899928 4 drwxr-xr-x 3 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d
59899929 4 drwxr-xr-x 2 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d/client
59899930 4 -rw-r--r-- 1 root root 201 May 13 08:40 /opt/petasan/config/etc/consul.d/client/config.json
...
root@c2dsp1lcpt004:~# ceph --cluster bxrsvsan01 -s
cluster:
id: 9c2d4085-bd52-428f-8d0b-7a0883df6580
health: HEALTH_WARN
Degraded data redundancy: 7/14 objects degraded (50.000%), 7 pgs degraded, 1024 pgs undersized
1/3 mons down, quorum c2dsp1lcpt001,c2dsp1lcpt003
services:
mon: 3 daemons, quorum c2dsp1lcpt001,c2dsp1lcpt003, out of quorum: c2dsp1lcpt004
mgr: c2dsp1lcpt001(active), standbys: c2dsp1lcpt003
osd: 24 osds: 12 up, 12 in
...
Each nodes have 4 physical interfaces configured as bond0 and bond1 using LACP
Weird but only the last nodes is in hosts file.
root@c2dsp1lcpt004:~# cat /etc/hosts
172.19.112.12 c2dsp1lcpt004
Plenty of this in logs.
root@c2dsp1lcpt004:~# tail -300 /var/log/syslog.1
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
And finally.
root@c2dsp1lcpt004:~# grep consul /var/log/syslog.1
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 42.237µs
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 131083
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-131083-1557660787566.tmp
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-114697-1557566132638
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 112652 to 120843
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 131083 complete
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 21.263µs
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 139277
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-139277-1557707707920.tmp
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-122891-1557613332618
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 120844 to 129037
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 139277 complete
root@c2dsp1lcpt004:~#
root@c2dsp1lcpt004:~# grep consul /var/log/syslog
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: requesting shutdown
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: consul: shutting down server
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: dns: error starting tcp server: accept tcp 127.0.0.1:8600: use of closed network connection
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: shutdown complete
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: http: Request GET /v1/kv/PetaSAN/Disks/?index=147017&wait=10s, error: rpc error: EOF from=127.0.0.1:45176
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Generated unique node ID "24b0e2d3-8470-4d33-0a1f-17811d30b14a" for this agent (persisted)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt004 10.10.4.13
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Joining cluster...
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: manager: No servers available
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joining: [10.10.4.11 10.10.4.12 10.10.4.13]
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: failed to sync remote state: No known Consul servers
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt001 10.10.4.11
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt003 10.10.4.12
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: memberlist: Refuting a suspect message (from: c2dsp1lcpt004)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt001 (Addr: tcp/10.10.4.11:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt003 (Addr: tcp/10.10.4.12:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joined: 3 Err: <nil>
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Join completed. Synced with 3 initial agents
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Deregistered service 'consul'
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Synced node info
Any hint would be much appreciated.
Thank you,
Have a good day.
Charlie
admin
2,930 Posts
May 13, 2019, 3:23 pmQuote from admin on May 13, 2019, 3:23 pmThe dist-upgrade should not be done, it could replace storage related components which will could affect the kernel and iSCSI layer.
Still you should not be getting ceph and consul errors, the most probable reasons are network errors, some things to look into
do you see errors in
cat /opt/petasan/log/PetaSAN.log
check consul health via
consul members
what errors do you see if you try to start a failed osd manually via
/usr/bin/ceph-osd -f --cluster [cluster_name] --id [osd_id] --setuser ceph --setgroup ceph
The /etc/hosts is also strange, could be due network errors when being copied, but generally the core components such as iSCSI, ceph, and consul use ips so they would not be affected, but still the root cause could be the same.
The dist-upgrade should not be done, it could replace storage related components which will could affect the kernel and iSCSI layer.
Still you should not be getting ceph and consul errors, the most probable reasons are network errors, some things to look into
do you see errors in
cat /opt/petasan/log/PetaSAN.log
check consul health via
consul members
what errors do you see if you try to start a failed osd manually via
/usr/bin/ceph-osd -f --cluster [cluster_name] --id [osd_id] --setuser ceph --setgroup ceph
The /etc/hosts is also strange, could be due network errors when being copied, but generally the core components such as iSCSI, ceph, and consul use ips so they would not be affected, but still the root cause could be the same.
Last edited on May 13, 2019, 3:24 pm by admin · #17
deweyhylton
14 Posts
July 9, 2019, 5:12 pmQuote from deweyhylton on July 9, 2019, 5:12 pmThis is a continuation of the original thread, unrelated to the "dist-upgrade" posts. I rebuilt this cluster and it worked for a few months. This morning health showed a warning. So I logged in and found the following on node3:
- same consul situation as previously described (/opt/petasan/config/etc/consul.d/server: no such file or directory)
- all osds unavailable, showing failed in journalctl logs, ceph-disk list showing this for each:
- /dev/sdb :
/dev/sdb1 ceph data, unprepared
/dev/sdb2 ceph block
- mon service of course down
I did find something very interesting / disconcerting in the logs this time:
29/04/2019 17:01:26 INFO Start add osd job for disk sdn.
29/04/2019 17:01:28 INFO Start cleaning disk sdn
29/04/2019 17:01:29 INFO Executing wipefs --all /dev/sdn
29/04/2019 17:01:29 INFO Executing dd if=/dev/zero of=/dev/sdn bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
29/04/2019 17:01:43 INFO Executing parted -s /dev/sdn mklabel gpt
29/04/2019 17:01:44 INFO Executing partprobe /dev/sdn
29/04/2019 17:01:47 INFO Starting ceph-disk zap /dev/sdn
29/04/2019 17:01:49 INFO ceph-disk zap done
29/04/2019 17:01:49 INFO Auto select journal for disk sdn.
29/04/2019 17:01:50 INFO User selected Auto journal, selected device is nvme1n1 disk for disk sdn.
29/04/2019 17:01:50 INFO Start prepare osd sdn
29/04/2019 17:01:50 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sdn --block.db /dev/nvme1n1 --cluster pdflvsan01
09/07/2019 08:06:55 INFO Joining node to running cluster.
09/07/2019 08:07:21 INFO Stopping petasan services on local node.
09/07/2019 08:07:21 INFO Stopping all petasan services.
09/07/2019 08:07:21 INFO files_sync.py process is 3305
09/07/2019 08:07:21 INFO Trying to stop files_sync.py process
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO files_sync.py process is 3041736
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO iscsi_service.py process is 3334
09/07/2019 08:07:24 INFO Trying to stop iscsi_service.py process
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO iscsi_service.py process is 3041744
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO admin.py process is 3430
09/07/2019 08:07:27 INFO Trying to stop admin.py process
09/07/2019 08:07:27 INFO Service is starting.
09/07/2019 08:07:27 INFO Cleaning unused configurations.
09/07/2019 08:07:27 INFO Cleaning all mapped disks
09/07/2019 08:07:27 INFO Cleaning unused rbd images.
09/07/2019 08:07:27 INFO Cleaning unused ips.
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3449
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3041807
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO Starting local clean_ceph.
09/07/2019 08:07:30 INFO Stopping ceph services
09/07/2019 08:07:35 INFO Start cleaning config files
09/07/2019 08:07:36 INFO End cleaning config files
09/07/2019 08:07:36 INFO Starting ceph services
09/07/2019 08:07:36 INFO Starting local clean_consul.
09/07/2019 08:07:36 INFO Trying to clean Consul on local node
09/07/2019 08:07:36 INFO consul process pid is 2894
09/07/2019 08:07:36 INFO Trying to stop consul process
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Config/Files?index=264&recurse=1
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
09/07/2019 08:07:38 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f64bf613510>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
Logs from 29/04/2019 showed the tail end of setup from the last wipe/reload of the node. Suddenly this morning (09/07/2019) the node is being joined to a cluster? What could have caused that? The node has been part of this cluster since the last log entry a few months ago ... but the timing appears to match the start of my problems ...
This cluster is running 2.2.0, with the recommended vmware-performance-related changes.
Suggestions for finding the cause (and a way to prevent this) happily accepted.
This is a continuation of the original thread, unrelated to the "dist-upgrade" posts. I rebuilt this cluster and it worked for a few months. This morning health showed a warning. So I logged in and found the following on node3:
- same consul situation as previously described (/opt/petasan/config/etc/consul.d/server: no such file or directory)
- all osds unavailable, showing failed in journalctl logs, ceph-disk list showing this for each:
- /dev/sdb :
/dev/sdb1 ceph data, unprepared
/dev/sdb2 ceph block
- mon service of course down
I did find something very interesting / disconcerting in the logs this time:
29/04/2019 17:01:26 INFO Start add osd job for disk sdn.
29/04/2019 17:01:28 INFO Start cleaning disk sdn
29/04/2019 17:01:29 INFO Executing wipefs --all /dev/sdn
29/04/2019 17:01:29 INFO Executing dd if=/dev/zero of=/dev/sdn bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
29/04/2019 17:01:43 INFO Executing parted -s /dev/sdn mklabel gpt
29/04/2019 17:01:44 INFO Executing partprobe /dev/sdn
29/04/2019 17:01:47 INFO Starting ceph-disk zap /dev/sdn
29/04/2019 17:01:49 INFO ceph-disk zap done
29/04/2019 17:01:49 INFO Auto select journal for disk sdn.
29/04/2019 17:01:50 INFO User selected Auto journal, selected device is nvme1n1 disk for disk sdn.
29/04/2019 17:01:50 INFO Start prepare osd sdn
29/04/2019 17:01:50 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sdn --block.db /dev/nvme1n1 --cluster pdflvsan01
09/07/2019 08:06:55 INFO Joining node to running cluster.
09/07/2019 08:07:21 INFO Stopping petasan services on local node.
09/07/2019 08:07:21 INFO Stopping all petasan services.
09/07/2019 08:07:21 INFO files_sync.py process is 3305
09/07/2019 08:07:21 INFO Trying to stop files_sync.py process
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO files_sync.py process is 3041736
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO iscsi_service.py process is 3334
09/07/2019 08:07:24 INFO Trying to stop iscsi_service.py process
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO iscsi_service.py process is 3041744
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO admin.py process is 3430
09/07/2019 08:07:27 INFO Trying to stop admin.py process
09/07/2019 08:07:27 INFO Service is starting.
09/07/2019 08:07:27 INFO Cleaning unused configurations.
09/07/2019 08:07:27 INFO Cleaning all mapped disks
09/07/2019 08:07:27 INFO Cleaning unused rbd images.
09/07/2019 08:07:27 INFO Cleaning unused ips.
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3449
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3041807
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO Starting local clean_ceph.
09/07/2019 08:07:30 INFO Stopping ceph services
09/07/2019 08:07:35 INFO Start cleaning config files
09/07/2019 08:07:36 INFO End cleaning config files
09/07/2019 08:07:36 INFO Starting ceph services
09/07/2019 08:07:36 INFO Starting local clean_consul.
09/07/2019 08:07:36 INFO Trying to clean Consul on local node
09/07/2019 08:07:36 INFO consul process pid is 2894
09/07/2019 08:07:36 INFO Trying to stop consul process
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Config/Files?index=264&recurse=1
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
09/07/2019 08:07:38 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f64bf613510>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
Logs from 29/04/2019 showed the tail end of setup from the last wipe/reload of the node. Suddenly this morning (09/07/2019) the node is being joined to a cluster? What could have caused that? The node has been part of this cluster since the last log entry a few months ago ... but the timing appears to match the start of my problems ...
This cluster is running 2.2.0, with the recommended vmware-performance-related changes.
Suggestions for finding the cause (and a way to prevent this) happily accepted.
admin
2,930 Posts
July 10, 2019, 1:45 pmQuote from admin on July 10, 2019, 1:45 pmTo have a node re-join by itself is indeed strange. I checked with the developers and the function to join a cluster is only called within the deployment wizard so can only happen if the node is redeployed, moreover if you try to redeploy an already deployed node ( via the web ui , starts at page 1 to 8) it will tell you the node is already deployed, the only case we see that can redeploy an already deployed node is that the internal url for page 8 is called directly. Note this url is not visible on the browser url or history but could be known looking at the source code. This may not make too much sense but it is the only thing we see.
To have a node re-join by itself is indeed strange. I checked with the developers and the function to join a cluster is only called within the deployment wizard so can only happen if the node is redeployed, moreover if you try to redeploy an already deployed node ( via the web ui , starts at page 1 to 8) it will tell you the node is already deployed, the only case we see that can redeploy an already deployed node is that the internal url for page 8 is called directly. Note this url is not visible on the browser url or history but could be known looking at the source code. This may not make too much sense but it is the only thing we see.
Last edited on July 10, 2019, 1:46 pm by admin · #19
deweyhylton
14 Posts
July 10, 2019, 2:22 pmQuote from deweyhylton on July 10, 2019, 2:22 pmThat is very interesting information ...
Hypothetical question: If I deployed this node successfully, and did not touch the browser window through which the deployment was triggered, and left that page for a couple of months ... and then came back a few months later, when the browser refreshes its pages would that potentially trigger the internal url (again) and cause what I am seeing?
That is very interesting information ...
Hypothetical question: If I deployed this node successfully, and did not touch the browser window through which the deployment was triggered, and left that page for a couple of months ... and then came back a few months later, when the browser refreshes its pages would that potentially trigger the internal url (again) and cause what I am seeing?
node stays disconnected following power outage
deweyhylton
14 Posts
Quote from deweyhylton on April 25, 2019, 10:34 pmI would also like to get to the bottom of the consul issue, so that I can better understand how this whole puzzle is put together. When I previously experienced this seemingly exact same issue, I simply wiped all disks and reinstalled the whole cluster from ground up. There was never a time where I reinstalled only a single node, and the cluster doesn't even build until the third node is joined.
I will probably just go ahead and reinstall/replace at this point, if there isn't anything else I could be looking at.
Thanks for your insight.
I would also like to get to the bottom of the consul issue, so that I can better understand how this whole puzzle is put together. When I previously experienced this seemingly exact same issue, I simply wiped all disks and reinstalled the whole cluster from ground up. There was never a time where I reinstalled only a single node, and the cluster doesn't even build until the third node is joined.
I will probably just go ahead and reinstall/replace at this point, if there isn't anything else I could be looking at.
Thanks for your insight.
deweyhylton
14 Posts
Quote from deweyhylton on April 26, 2019, 3:39 pmStill holding on in hopes of solving the consul issue, but ... when I do decide to do the reinstall/replace, is there anything that should be done on the current node, or cluster-wise, before wiping this node before redeployment? Can I re-use the same hostname when I do this?
Still holding on in hopes of solving the consul issue, but ... when I do decide to do the reinstall/replace, is there anything that should be done on the current node, or cluster-wise, before wiping this node before redeployment? Can I re-use the same hostname when I do this?
admin
2,930 Posts
Quote from admin on April 26, 2019, 4:38 pmWhen you replace a management node, you need to give it the same hostname and management ip address, it will then be recreated with old configuration ( backend ips/monitor/consul) by itself.
When you replace a management node, you need to give it the same hostname and management ip address, it will then be recreated with old configuration ( backend ips/monitor/consul) by itself.
deweyhylton
14 Posts
Quote from deweyhylton on April 26, 2019, 5:22 pmWell, now that you describe a "replacement" my last questions seem a bit lacking in thought ... 🙂
Any further ideas on the consul issue? This really bugs me ...
Well, now that you describe a "replacement" my last questions seem a bit lacking in thought ... 🙂
Any further ideas on the consul issue? This really bugs me ...
admin
2,930 Posts
Quote from admin on April 26, 2019, 5:28 pmno idea, the first 3 nodes (management nodes) should always be consul servers.
no idea, the first 3 nodes (management nodes) should always be consul servers.
charlie_mtl
9 Posts
Quote from charlie_mtl on May 13, 2019, 2:10 pmHi,
I've build a 3 nodes cluster based on PetaSAN 2.3 iso, all node have been "apt dist-upgrade" right after the installation, only 2 of the nodes have attached storage.
The last nodes which is also providing the half of the ceph OSDs is again out of the quorum. The first time it happened was last week so I rebuild it from scratch last Friday.
But This morning it s again out of the quorum.
004:~# find /opt/petasan/config/etc/consul.d -ls
59899928 4 drwxr-xr-x 3 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d
59899929 4 drwxr-xr-x 2 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d/client
59899930 4 -rw-r--r-- 1 root root 201 May 13 08:40 /opt/petasan/config/etc/consul.d/client/config.json...
root@c2dsp1lcpt004:~# ceph --cluster bxrsvsan01 -s
cluster:
id: 9c2d4085-bd52-428f-8d0b-7a0883df6580
health: HEALTH_WARN
Degraded data redundancy: 7/14 objects degraded (50.000%), 7 pgs degraded, 1024 pgs undersized
1/3 mons down, quorum c2dsp1lcpt001,c2dsp1lcpt003services:
mon: 3 daemons, quorum c2dsp1lcpt001,c2dsp1lcpt003, out of quorum: c2dsp1lcpt004
mgr: c2dsp1lcpt001(active), standbys: c2dsp1lcpt003
osd: 24 osds: 12 up, 12 in...
Each nodes have 4 physical interfaces configured as bond0 and bond1 using LACP
Weird but only the last nodes is in hosts file.
root@c2dsp1lcpt004:~# cat /etc/hosts
172.19.112.12 c2dsp1lcpt004
Plenty of this in logs.
root@c2dsp1lcpt004:~# tail -300 /var/log/syslog.1
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])And finally.
root@c2dsp1lcpt004:~# grep consul /var/log/syslog.1
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 42.237µs
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 131083
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-131083-1557660787566.tmp
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-114697-1557566132638
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 112652 to 120843
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 131083 complete
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 21.263µs
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 139277
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-139277-1557707707920.tmp
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-122891-1557613332618
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 120844 to 129037
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 139277 complete
root@c2dsp1lcpt004:~#
root@c2dsp1lcpt004:~# grep consul /var/log/syslog
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: requesting shutdown
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: consul: shutting down server
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: dns: error starting tcp server: accept tcp 127.0.0.1:8600: use of closed network connection
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: shutdown complete
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: http: Request GET /v1/kv/PetaSAN/Disks/?index=147017&wait=10s, error: rpc error: EOF from=127.0.0.1:45176
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Generated unique node ID "24b0e2d3-8470-4d33-0a1f-17811d30b14a" for this agent (persisted)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt004 10.10.4.13
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Joining cluster...
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: manager: No servers available
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joining: [10.10.4.11 10.10.4.12 10.10.4.13]
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: failed to sync remote state: No known Consul servers
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt001 10.10.4.11
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt003 10.10.4.12
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: memberlist: Refuting a suspect message (from: c2dsp1lcpt004)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt001 (Addr: tcp/10.10.4.11:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt003 (Addr: tcp/10.10.4.12:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joined: 3 Err: <nil>
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Join completed. Synced with 3 initial agents
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Deregistered service 'consul'
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Synced node info
Any hint would be much appreciated.
Thank you,
Have a good day.
Charlie
Hi,
I've build a 3 nodes cluster based on PetaSAN 2.3 iso, all node have been "apt dist-upgrade" right after the installation, only 2 of the nodes have attached storage.
The last nodes which is also providing the half of the ceph OSDs is again out of the quorum. The first time it happened was last week so I rebuild it from scratch last Friday.
But This morning it s again out of the quorum.
004:~# find /opt/petasan/config/etc/consul.d -ls
59899928 4 drwxr-xr-x 3 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d
59899929 4 drwxr-xr-x 2 root root 4096 May 13 08:40 /opt/petasan/config/etc/consul.d/client
59899930 4 -rw-r--r-- 1 root root 201 May 13 08:40 /opt/petasan/config/etc/consul.d/client/config.json
...
root@c2dsp1lcpt004:~# ceph --cluster bxrsvsan01 -s
cluster:
id: 9c2d4085-bd52-428f-8d0b-7a0883df6580
health: HEALTH_WARN
Degraded data redundancy: 7/14 objects degraded (50.000%), 7 pgs degraded, 1024 pgs undersized
1/3 mons down, quorum c2dsp1lcpt001,c2dsp1lcpt003
services:
mon: 3 daemons, quorum c2dsp1lcpt001,c2dsp1lcpt003, out of quorum: c2dsp1lcpt004
mgr: c2dsp1lcpt001(active), standbys: c2dsp1lcpt003
osd: 24 osds: 12 up, 12 in
...
Each nodes have 4 physical interfaces configured as bond0 and bond1 using LACP
Weird but only the last nodes is in hosts file.
root@c2dsp1lcpt004:~# cat /etc/hosts
172.19.112.12 c2dsp1lcpt004
Plenty of this in logs.
root@c2dsp1lcpt004:~# tail -300 /var/log/syslog.1
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: -------------------------------------
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('updated current_index: ', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: file_path
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: /etc/hosts
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: Gonna write file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: wrote file
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('index', '102782')
May 13 03:40:06 c2dsp1lcpt004 files_sync.py[3384]: ('data', [{u'LockIndex': 0, u'ModifyIndex': 102782, u'Value': '...yZHNwMWxjcHQwMDQK', u'Flags': 0, u'Key': u'PetaSAN/Config/Files
/etc/hosts', u'CreateIndex': 121}])
And finally.
root@c2dsp1lcpt004:~# grep consul /var/log/syslog.1
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 42.237µs
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 131083
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-131083-1557660787566.tmp
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-114697-1557566132638
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 112652 to 120843
May 12 07:33:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 131083 complete
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: consul.fsm: snapshot created in 21.263µs
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Starting snapshot up to 139277
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/780-139277-1557707707920.tmp
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/780-122891-1557613332618
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Compacting logs from 120844 to 129037
May 12 20:35:07 c2dsp1lcpt004 consul[3277]: raft: Snapshot to 139277 complete
root@c2dsp1lcpt004:~#
root@c2dsp1lcpt004:~# grep consul /var/log/syslog
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: requesting shutdown
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: consul: shutting down server
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: dns: error starting tcp server: accept tcp 127.0.0.1:8600: use of closed network connection
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: serf: Shutdown without a Leave
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: agent: shutdown complete
May 13 08:39:57 c2dsp1lcpt004 consul[3277]: http: Request GET /v1/kv/PetaSAN/Disks/?index=147017&wait=10s, error: rpc error: EOF from=127.0.0.1:45176
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Generated unique node ID "24b0e2d3-8470-4d33-0a1f-17811d30b14a" for this agent (persisted)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt004 10.10.4.13
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Joining cluster...
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: manager: No servers available
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joining: [10.10.4.11 10.10.4.12 10.10.4.13]
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: failed to sync remote state: No known Consul servers
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt001 10.10.4.11
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: serf: EventMemberJoin: c2dsp1lcpt003 10.10.4.12
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: memberlist: Refuting a suspect message (from: c2dsp1lcpt004)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt001 (Addr: tcp/10.10.4.11:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: consul: adding server c2dsp1lcpt003 (Addr: tcp/10.10.4.12:8300) (DC: petasan)
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: (LAN) joined: 3 Err: <nil>
May 13 08:40:01 c2dsp1lcpt004 consul[309991]: agent: Join completed. Synced with 3 initial agents
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Deregistered service 'consul'
May 13 08:40:02 c2dsp1lcpt004 consul[309991]: agent: Synced node info
Any hint would be much appreciated.
Thank you,
Have a good day.
Charlie
admin
2,930 Posts
Quote from admin on May 13, 2019, 3:23 pmThe dist-upgrade should not be done, it could replace storage related components which will could affect the kernel and iSCSI layer.
Still you should not be getting ceph and consul errors, the most probable reasons are network errors, some things to look into
do you see errors in
cat /opt/petasan/log/PetaSAN.log
check consul health via
consul members
what errors do you see if you try to start a failed osd manually via
/usr/bin/ceph-osd -f --cluster [cluster_name] --id [osd_id] --setuser ceph --setgroup ceph
The /etc/hosts is also strange, could be due network errors when being copied, but generally the core components such as iSCSI, ceph, and consul use ips so they would not be affected, but still the root cause could be the same.
The dist-upgrade should not be done, it could replace storage related components which will could affect the kernel and iSCSI layer.
Still you should not be getting ceph and consul errors, the most probable reasons are network errors, some things to look into
do you see errors in
cat /opt/petasan/log/PetaSAN.log
check consul health via
consul members
what errors do you see if you try to start a failed osd manually via
/usr/bin/ceph-osd -f --cluster [cluster_name] --id [osd_id] --setuser ceph --setgroup ceph
The /etc/hosts is also strange, could be due network errors when being copied, but generally the core components such as iSCSI, ceph, and consul use ips so they would not be affected, but still the root cause could be the same.
deweyhylton
14 Posts
Quote from deweyhylton on July 9, 2019, 5:12 pmThis is a continuation of the original thread, unrelated to the "dist-upgrade" posts. I rebuilt this cluster and it worked for a few months. This morning health showed a warning. So I logged in and found the following on node3:
- same consul situation as previously described (/opt/petasan/config/etc/consul.d/server: no such file or directory)
- all osds unavailable, showing failed in journalctl logs, ceph-disk list showing this for each:
- /dev/sdb :
/dev/sdb1 ceph data, unprepared
/dev/sdb2 ceph block- mon service of course down
I did find something very interesting / disconcerting in the logs this time:
29/04/2019 17:01:26 INFO Start add osd job for disk sdn.
29/04/2019 17:01:28 INFO Start cleaning disk sdn
29/04/2019 17:01:29 INFO Executing wipefs --all /dev/sdn
29/04/2019 17:01:29 INFO Executing dd if=/dev/zero of=/dev/sdn bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
29/04/2019 17:01:43 INFO Executing parted -s /dev/sdn mklabel gpt
29/04/2019 17:01:44 INFO Executing partprobe /dev/sdn
29/04/2019 17:01:47 INFO Starting ceph-disk zap /dev/sdn
29/04/2019 17:01:49 INFO ceph-disk zap done
29/04/2019 17:01:49 INFO Auto select journal for disk sdn.
29/04/2019 17:01:50 INFO User selected Auto journal, selected device is nvme1n1 disk for disk sdn.
29/04/2019 17:01:50 INFO Start prepare osd sdn
29/04/2019 17:01:50 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sdn --block.db /dev/nvme1n1 --cluster pdflvsan01
09/07/2019 08:06:55 INFO Joining node to running cluster.
09/07/2019 08:07:21 INFO Stopping petasan services on local node.
09/07/2019 08:07:21 INFO Stopping all petasan services.
09/07/2019 08:07:21 INFO files_sync.py process is 3305
09/07/2019 08:07:21 INFO Trying to stop files_sync.py process
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO files_sync.py process is 3041736
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO iscsi_service.py process is 3334
09/07/2019 08:07:24 INFO Trying to stop iscsi_service.py process
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO iscsi_service.py process is 3041744
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO admin.py process is 3430
09/07/2019 08:07:27 INFO Trying to stop admin.py process
09/07/2019 08:07:27 INFO Service is starting.
09/07/2019 08:07:27 INFO Cleaning unused configurations.
09/07/2019 08:07:27 INFO Cleaning all mapped disks
09/07/2019 08:07:27 INFO Cleaning unused rbd images.
09/07/2019 08:07:27 INFO Cleaning unused ips.
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3449
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3041807
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO Starting local clean_ceph.
09/07/2019 08:07:30 INFO Stopping ceph services
09/07/2019 08:07:35 INFO Start cleaning config files
09/07/2019 08:07:36 INFO End cleaning config files
09/07/2019 08:07:36 INFO Starting ceph services
09/07/2019 08:07:36 INFO Starting local clean_consul.
09/07/2019 08:07:36 INFO Trying to clean Consul on local node
09/07/2019 08:07:36 INFO consul process pid is 2894
09/07/2019 08:07:36 INFO Trying to stop consul process
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Config/Files?index=264&recurse=1
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
09/07/2019 08:07:38 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f64bf613510>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10sLogs from 29/04/2019 showed the tail end of setup from the last wipe/reload of the node. Suddenly this morning (09/07/2019) the node is being joined to a cluster? What could have caused that? The node has been part of this cluster since the last log entry a few months ago ... but the timing appears to match the start of my problems ...
This cluster is running 2.2.0, with the recommended vmware-performance-related changes.
Suggestions for finding the cause (and a way to prevent this) happily accepted.
This is a continuation of the original thread, unrelated to the "dist-upgrade" posts. I rebuilt this cluster and it worked for a few months. This morning health showed a warning. So I logged in and found the following on node3:
- same consul situation as previously described (/opt/petasan/config/etc/consul.d/server: no such file or directory)
- all osds unavailable, showing failed in journalctl logs, ceph-disk list showing this for each:
- /dev/sdb :
/dev/sdb1 ceph data, unprepared
/dev/sdb2 ceph block
- /dev/sdb :
- mon service of course down
I did find something very interesting / disconcerting in the logs this time:
29/04/2019 17:01:26 INFO Start add osd job for disk sdn.
29/04/2019 17:01:28 INFO Start cleaning disk sdn
29/04/2019 17:01:29 INFO Executing wipefs --all /dev/sdn
29/04/2019 17:01:29 INFO Executing dd if=/dev/zero of=/dev/sdn bs=1M count=500 oflag=direct,dsync >/dev/null 2>&1
29/04/2019 17:01:43 INFO Executing parted -s /dev/sdn mklabel gpt
29/04/2019 17:01:44 INFO Executing partprobe /dev/sdn
29/04/2019 17:01:47 INFO Starting ceph-disk zap /dev/sdn
29/04/2019 17:01:49 INFO ceph-disk zap done
29/04/2019 17:01:49 INFO Auto select journal for disk sdn.
29/04/2019 17:01:50 INFO User selected Auto journal, selected device is nvme1n1 disk for disk sdn.
29/04/2019 17:01:50 INFO Start prepare osd sdn
29/04/2019 17:01:50 INFO Starting ceph-disk prepare --zap-disk --bluestore --block-dev /dev/sdn --block.db /dev/nvme1n1 --cluster pdflvsan01
09/07/2019 08:06:55 INFO Joining node to running cluster.
09/07/2019 08:07:21 INFO Stopping petasan services on local node.
09/07/2019 08:07:21 INFO Stopping all petasan services.
09/07/2019 08:07:21 INFO files_sync.py process is 3305
09/07/2019 08:07:21 INFO Trying to stop files_sync.py process
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO files_sync.py process is 3041736
09/07/2019 08:07:24 INFO files_sync.py process stopped
09/07/2019 08:07:24 INFO iscsi_service.py process is 3334
09/07/2019 08:07:24 INFO Trying to stop iscsi_service.py process
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO iscsi_service.py process is 3041744
09/07/2019 08:07:27 INFO iscsi_service.py process stopped
09/07/2019 08:07:27 INFO admin.py process is 3430
09/07/2019 08:07:27 INFO Trying to stop admin.py process
09/07/2019 08:07:27 INFO Service is starting.
09/07/2019 08:07:27 INFO Cleaning unused configurations.
09/07/2019 08:07:27 INFO Cleaning all mapped disks
09/07/2019 08:07:27 INFO Cleaning unused rbd images.
09/07/2019 08:07:27 INFO Cleaning unused ips.
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3449
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO admin.py process is 3041807
09/07/2019 08:07:30 INFO admin.py process stopped
09/07/2019 08:07:30 INFO Starting local clean_ceph.
09/07/2019 08:07:30 INFO Stopping ceph services
09/07/2019 08:07:35 INFO Start cleaning config files
09/07/2019 08:07:36 INFO End cleaning config files
09/07/2019 08:07:36 INFO Starting ceph services
09/07/2019 08:07:36 INFO Starting local clean_consul.
09/07/2019 08:07:36 INFO Trying to clean Consul on local node
09/07/2019 08:07:36 INFO consul process pid is 2894
09/07/2019 08:07:36 INFO Trying to stop consul process
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Config/Files?index=264&recurse=1
09/07/2019 08:07:36 WARNING Retrying (Retry(total=5, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', BadStatusLine("''",))': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
09/07/2019 08:07:38 WARNING Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f64bf613510>: Failed to establish a new connection: [Errno 111] Connection refused',)': /v1/kv/PetaSAN/Disks/?index=2720037&wait=10s
Logs from 29/04/2019 showed the tail end of setup from the last wipe/reload of the node. Suddenly this morning (09/07/2019) the node is being joined to a cluster? What could have caused that? The node has been part of this cluster since the last log entry a few months ago ... but the timing appears to match the start of my problems ...
This cluster is running 2.2.0, with the recommended vmware-performance-related changes.
Suggestions for finding the cause (and a way to prevent this) happily accepted.
admin
2,930 Posts
Quote from admin on July 10, 2019, 1:45 pmTo have a node re-join by itself is indeed strange. I checked with the developers and the function to join a cluster is only called within the deployment wizard so can only happen if the node is redeployed, moreover if you try to redeploy an already deployed node ( via the web ui , starts at page 1 to 8) it will tell you the node is already deployed, the only case we see that can redeploy an already deployed node is that the internal url for page 8 is called directly. Note this url is not visible on the browser url or history but could be known looking at the source code. This may not make too much sense but it is the only thing we see.
To have a node re-join by itself is indeed strange. I checked with the developers and the function to join a cluster is only called within the deployment wizard so can only happen if the node is redeployed, moreover if you try to redeploy an already deployed node ( via the web ui , starts at page 1 to 8) it will tell you the node is already deployed, the only case we see that can redeploy an already deployed node is that the internal url for page 8 is called directly. Note this url is not visible on the browser url or history but could be known looking at the source code. This may not make too much sense but it is the only thing we see.
deweyhylton
14 Posts
Quote from deweyhylton on July 10, 2019, 2:22 pmThat is very interesting information ...
Hypothetical question: If I deployed this node successfully, and did not touch the browser window through which the deployment was triggered, and left that page for a couple of months ... and then came back a few months later, when the browser refreshes its pages would that potentially trigger the internal url (again) and cause what I am seeing?
That is very interesting information ...
Hypothetical question: If I deployed this node successfully, and did not touch the browser window through which the deployment was triggered, and left that page for a couple of months ... and then came back a few months later, when the browser refreshes its pages would that potentially trigger the internal url (again) and cause what I am seeing?