next steps on troubleshooting

hak
23 Posts
July 15, 2023, 12:39 pmQuote from hak on July 15, 2023, 12:39 pmpre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)
- I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother
- created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros
- did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results
- all seemed fine, no production data on this system yet
this AM got an email:
Dear PetaSAN user;
Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):
- osd.5/XXX-PSAN-01
- osd.4/XXX-PSAN-01
- osd.3/XXX-PSAN-01
Host Hardware check via DRAC (IPMI): server is all green, no hardware failures
40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.
Trying the gui on XXX-CLUS-01 results in:
504 Gateway Time-out / nginx/1.18.0 (Ubuntu)
From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:
"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.
[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."
SSH to -01 still works.
What should I check to try and find cause and remediation?
pre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)
- I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother
- created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros
- did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results
- all seemed fine, no production data on this system yet
this AM got an email:
Dear PetaSAN user;
Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):
- osd.5/XXX-PSAN-01
- osd.4/XXX-PSAN-01
- osd.3/XXX-PSAN-01
Host Hardware check via DRAC (IPMI): server is all green, no hardware failures
40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.
Trying the gui on XXX-CLUS-01 results in:
504 Gateway Time-out / nginx/1.18.0 (Ubuntu)
From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:
"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.
[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."
SSH to -01 still works.
What should I check to try and find cause and remediation?
Last edited on July 15, 2023, 12:41 pm by hak · #1

admin
2,967 Posts
July 15, 2023, 7:32 pmQuote from admin on July 15, 2023, 7:32 pmWhat should I check to try and find cause and remediation?
well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂 in your case, i would first check backend network on node 1.
What should I check to try and find cause and remediation?
well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂 in your case, i would first check backend network on node 1.

hak
23 Posts
July 16, 2023, 6:33 pmQuote from hak on July 16, 2023, 6:33 pmOK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday). one of the 2 bond/slave links was down.
looking at xxx-psan-01's console screen, it was filled with
[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1
(and repeat many, many times)
I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it. bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.
but xxx-psan-01 still wasn't happy. so i rebooted it from the psan console....
- no speed change notices, post reboot, so that's good.
- bond0 still good on the switch (2 of 2 up)
- but dashboard still shows 12/15 OSDs
- Ceph Health warning about xxx-psan-01 having "clock skew"
- maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly
- after a few more minutes of uptime, another warning message:
- "
- and still no gui to https://<ip-of-psan-01> (same nginex 504)
- tried another graceful reboot, still getting symptoms of items 3-6
all networking looks good. what causes the 'slow ops' and webgui to not work?
OK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday). one of the 2 bond/slave links was down.
looking at xxx-psan-01's console screen, it was filled with
[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1
(and repeat many, many times)
I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it. bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.
but xxx-psan-01 still wasn't happy. so i rebooted it from the psan console....
- no speed change notices, post reboot, so that's good.
- bond0 still good on the switch (2 of 2 up)
- but dashboard still shows 12/15 OSDs
- Ceph Health warning about xxx-psan-01 having "clock skew"
- maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly
- after a few more minutes of uptime, another warning message:
- "
- and still no gui to https://<ip-of-psan-01> (same nginex 504)
- tried another graceful reboot, still getting symptoms of items 3-6
all networking looks good. what causes the 'slow ops' and webgui to not work?

hak
23 Posts
July 16, 2023, 9:38 pmQuote from hak on July 16, 2023, 9:38 pmstill no gui on node-01 and the warning is just growing:
what causes that?
still no gui on node-01 and the warning is just growing:
what causes that?

hak
23 Posts
July 17, 2023, 3:21 pmQuote from hak on July 17, 2023, 3:21 pmsame: still no gui on node-01 and the warning climbs:
i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:
what
not working PSAN-01
working PSAN-02
eth6
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0
eth7
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1
bond0
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever
bond0_40G.210@bond0_40G
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet x.x.x.220/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet x.x.x.222/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever
what is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)
same: still no gui on node-01 and the warning climbs:
i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:
what
not working PSAN-01
working PSAN-02
eth6
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0
eth7
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1
bond0
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever
bond0_40G.210@bond0_40G
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet x.x.x.220/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet x.x.x.222/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever
what is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)
Last edited on July 17, 2023, 3:48 pm by hak · #5

admin
2,967 Posts
July 17, 2023, 4:01 pmQuote from admin on July 17, 2023, 4:01 pmafter the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..
after the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..

hak
23 Posts
July 17, 2023, 4:33 pmQuote from hak on July 17, 2023, 4:33 pmright, ntpq -p on PSAN-01 shows ok:
root@xxx-PSAN-01:~# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*x.x.x.1 208.91.112.62 3 u 430 1024 1 0.083 -0.082 0.042
LOCAL(0) .LOCL. 7 l 442 64 100 0.000 0.000 0.000
root@xxx-PSAN-01:~#
oddly, on (functional) nodes 2-5 it shows the same non-answer:
root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#
right, ntpq -p on PSAN-01 shows ok:
root@xxx-PSAN-01:~# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*x.x.x.1 208.91.112.62 3 u 430 1024 1 0.083 -0.082 0.042
LOCAL(0) .LOCL. 7 l 442 64 100 0.000 0.000 0.000
root@xxx-PSAN-01:~#
oddly, on (functional) nodes 2-5 it shows the same non-answer:
root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#

hak
23 Posts
July 17, 2023, 6:08 pmQuote from hak on July 17, 2023, 6:08 pmcat /etc/ntp.conf (which were 'as installed' below)
on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).
"server x.x.x.1 iburst"
on PSAN-02 it lists only the management IP of nodes 1 and 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst"
on PSAN-03: it lists the management IP of nodes 1 (only)
"server x.x.x.220 burst iburst"
on PSAN-04: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst
on PSAN-05: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst
so it seems 3 relied on 1 only. 1 got disconnected. so 3 got messed up too. 2 relies on 3 and 1, so it got messed up. and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.
for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.
cat /etc/ntp.conf (which were 'as installed' below)
on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).
"server x.x.x.1 iburst"
on PSAN-02 it lists only the management IP of nodes 1 and 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst"
on PSAN-03: it lists the management IP of nodes 1 (only)
"server x.x.x.220 burst iburst"
on PSAN-04: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst
on PSAN-05: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst
so it seems 3 relied on 1 only. 1 got disconnected. so 3 got messed up too. 2 relies on 3 and 1, so it got messed up. and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.
for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.

hak
23 Posts
July 17, 2023, 7:50 pmQuote from hak on July 17, 2023, 7:50 pmrebooting the nodes after setting a common ntp.conf value worked. i'm back at 15/15. is editing /ntp.conf ok or should things be set upstream in a petasan config file?
rebooting the nodes after setting a common ntp.conf value worked. i'm back at 15/15. is editing /ntp.conf ok or should things be set upstream in a petasan config file?

admin
2,967 Posts
July 17, 2023, 8:13 pmQuote from admin on July 17, 2023, 8:13 pmThe ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.
The ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.
next steps on troubleshooting
hak
23 Posts
Quote from hak on July 15, 2023, 12:39 pmpre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)
- I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother
- created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros
- did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results
- all seemed fine, no production data on this system yet
this AM got an email:
Dear PetaSAN user;
Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):
- osd.5/XXX-PSAN-01
- osd.4/XXX-PSAN-01
- osd.3/XXX-PSAN-01
Host Hardware check via DRAC (IPMI): server is all green, no hardware failures
40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.
Trying the gui on XXX-CLUS-01 results in:
504 Gateway Time-out / nginx/1.18.0 (Ubuntu)
From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:
"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.
[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."
SSH to -01 still works.
What should I check to try and find cause and remediation?
pre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)
- I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother
- created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros
- did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results
- all seemed fine, no production data on this system yet
this AM got an email:
Dear PetaSAN user;
Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):
- osd.5/XXX-PSAN-01
- osd.4/XXX-PSAN-01
- osd.3/XXX-PSAN-01
Host Hardware check via DRAC (IPMI): server is all green, no hardware failures
40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.
Trying the gui on XXX-CLUS-01 results in:
504 Gateway Time-out / nginx/1.18.0 (Ubuntu)
From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:
"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.
15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.
rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)
File "rados.pyx", line 680, in rados.Rados.connect
cluster.connect()
File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect
Traceback (most recent call last):
15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)
15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.
[errno 110] RADOS timed out (error connecting to the cluster)
2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config
15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."
SSH to -01 still works.
What should I check to try and find cause and remediation?
admin
2,967 Posts
Quote from admin on July 15, 2023, 7:32 pmWhat should I check to try and find cause and remediation?
well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂 in your case, i would first check backend network on node 1.
What should I check to try and find cause and remediation?
well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂 in your case, i would first check backend network on node 1.
hak
23 Posts
Quote from hak on July 16, 2023, 6:33 pmOK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday). one of the 2 bond/slave links was down.
looking at xxx-psan-01's console screen, it was filled with
[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1
(and repeat many, many times)
I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it. bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.
but xxx-psan-01 still wasn't happy. so i rebooted it from the psan console....
- no speed change notices, post reboot, so that's good.
- bond0 still good on the switch (2 of 2 up)
- but dashboard still shows 12/15 OSDs
- Ceph Health warning about xxx-psan-01 having "clock skew"
- maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly
- after a few more minutes of uptime, another warning message:
- "
- and still no gui to https://<ip-of-psan-01> (same nginex 504)
- tried another graceful reboot, still getting symptoms of items 3-6
all networking looks good. what causes the 'slow ops' and webgui to not work?
OK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday). one of the 2 bond/slave links was down.
looking at xxx-psan-01's console screen, it was filled with
[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1
(and repeat many, many times)
I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it. bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.
but xxx-psan-01 still wasn't happy. so i rebooted it from the psan console....
- no speed change notices, post reboot, so that's good.
- bond0 still good on the switch (2 of 2 up)
- but dashboard still shows 12/15 OSDs
- Ceph Health warning about xxx-psan-01 having "clock skew"
- maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly
- after a few more minutes of uptime, another warning message:
- "
- and still no gui to https://<ip-of-psan-01> (same nginex 504)
- tried another graceful reboot, still getting symptoms of items 3-6
all networking looks good. what causes the 'slow ops' and webgui to not work?
hak
23 Posts
Quote from hak on July 16, 2023, 9:38 pmstill no gui on node-01 and the warning is just growing:
what causes that?
still no gui on node-01 and the warning is just growing:
what causes that?
hak
23 Posts
Quote from hak on July 17, 2023, 3:21 pmsame: still no gui on node-01 and the warning climbs:
i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:
what not working PSAN-01 working PSAN-02 eth6 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0eth7 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1bond0 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft foreverbond0_40G.210@bond0_40G <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet x.x.x.220/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet x.x.x.222/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft foreverwhat is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)
same: still no gui on node-01 and the warning climbs:
i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:
what | not working PSAN-01 | working PSAN-02 |
eth6 | <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff altname enp216s0f0np0 |
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff altname enp216s0f0np0 |
eth7 | <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff altname enp216s0f1np1 |
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff altname enp216s0f1np1 |
bond0 | <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link valid_lft forever preferred_lft forever |
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link valid_lft forever preferred_lft forever |
bond0_40G.210@bond0_40G | <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff inet x.x.x.220/24 scope global bond0_40G.210 valid_lft forever preferred_lft forever inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link valid_lft forever preferred_lft forever |
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff inet x.x.x.222/24 scope global bond0_40G.210 valid_lft forever preferred_lft forever inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link valid_lft forever preferred_lft forever |
what is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)
admin
2,967 Posts
Quote from admin on July 17, 2023, 4:01 pmafter the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..
after the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..
hak
23 Posts
Quote from hak on July 17, 2023, 4:33 pmright, ntpq -p on PSAN-01 shows ok:
root@xxx-PSAN-01:~# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*x.x.x.1 208.91.112.62 3 u 430 1024 1 0.083 -0.082 0.042
LOCAL(0) .LOCL. 7 l 442 64 100 0.000 0.000 0.000
root@xxx-PSAN-01:~#oddly, on (functional) nodes 2-5 it shows the same non-answer:
root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#
right, ntpq -p on PSAN-01 shows ok:
root@xxx-PSAN-01:~# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*x.x.x.1 208.91.112.62 3 u 430 1024 1 0.083 -0.082 0.042
LOCAL(0) .LOCL. 7 l 442 64 100 0.000 0.000 0.000
root@xxx-PSAN-01:~#
oddly, on (functional) nodes 2-5 it shows the same non-answer:
root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#
hak
23 Posts
Quote from hak on July 17, 2023, 6:08 pmcat /etc/ntp.conf (which were 'as installed' below)
on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).
"server x.x.x.1 iburst"
on PSAN-02 it lists only the management IP of nodes 1 and 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst"on PSAN-03: it lists the management IP of nodes 1 (only)
"server x.x.x.220 burst iburst"
on PSAN-04: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburston PSAN-05: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburstso it seems 3 relied on 1 only. 1 got disconnected. so 3 got messed up too. 2 relies on 3 and 1, so it got messed up. and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.
for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.
cat /etc/ntp.conf (which were 'as installed' below)
on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).
"server x.x.x.1 iburst"
on PSAN-02 it lists only the management IP of nodes 1 and 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst"
on PSAN-03: it lists the management IP of nodes 1 (only)
"server x.x.x.220 burst iburst"
on PSAN-04: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst
on PSAN-05: it lists the management IP of nodes 1 through 3
"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst
so it seems 3 relied on 1 only. 1 got disconnected. so 3 got messed up too. 2 relies on 3 and 1, so it got messed up. and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.
for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.
hak
23 Posts
Quote from hak on July 17, 2023, 7:50 pmrebooting the nodes after setting a common ntp.conf value worked. i'm back at 15/15. is editing /ntp.conf ok or should things be set upstream in a petasan config file?
rebooting the nodes after setting a common ntp.conf value worked. i'm back at 15/15. is editing /ntp.conf ok or should things be set upstream in a petasan config file?
admin
2,967 Posts
Quote from admin on July 17, 2023, 8:13 pmThe ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.
The ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.