Forums - PetaSAN

ForumGeneral Discussionnext steps on troubleshooting
You need to log in to create posts and topics. Login · Register
next steps on troubleshooting

hak
23 Posts

July 15, 2023, 12:39 pm
Quote from hak on July 15, 2023, 12:39 pm
pre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)

I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother

created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros

did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results

all seemed fine, no production data on this system yet

this AM got an email:

Dear PetaSAN user;

Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):

- osd.5/XXX-PSAN-01

- osd.4/XXX-PSAN-01

- osd.3/XXX-PSAN-01

Host Hardware check via DRAC (IPMI): server is all green, no hardware failures

40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.

Trying the gui on XXX-CLUS-01 results in:

504 Gateway Time-out / nginx/1.18.0 (Ubuntu)

From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:

"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.

15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.

15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.

[errno 110] RADOS timed out (error connecting to the cluster)

2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."

SSH to -01 still works.

What should I check to try and find cause and remediation?

pre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)

I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother

created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros

did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results

all seemed fine, no production data on this system yet

this AM got an email:

Dear PetaSAN user;

Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):

- osd.5/XXX-PSAN-01

- osd.4/XXX-PSAN-01

- osd.3/XXX-PSAN-01

Host Hardware check via DRAC (IPMI): server is all green, no hardware failures

40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.

Trying the gui on XXX-CLUS-01 results in:

504 Gateway Time-out / nginx/1.18.0 (Ubuntu)

From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:

"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.

15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.

15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.

[errno 110] RADOS timed out (error connecting to the cluster)

2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."

SSH to -01 still works.

What should I check to try and find cause and remediation?

Last edited on July 15, 2023, 12:41 pm by hak · #1

admin
2,967 Posts

July 15, 2023, 7:32 pm
Quote from admin on July 15, 2023, 7:32 pm
What should I check to try and find cause and remediation?

well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂 in your case, i would first check backend network on node 1.

What should I check to try and find cause and remediation?

well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂 in your case, i would first check backend network on node 1.

#2

hak
23 Posts

July 16, 2023, 6:33 pm
Quote from hak on July 16, 2023, 6:33 pm
OK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday). one of the 2 bond/slave links was down.

looking at xxx-psan-01's console screen, it was filled with

[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1

(and repeat many, many times)

I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it. bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.

but xxx-psan-01 still wasn't happy. so i rebooted it from the psan console....

no speed change notices, post reboot, so that's good.

bond0 still good on the switch (2 of 2 up)

but dashboard still shows 12/15 OSDs

Ceph Health warning about xxx-psan-01 having "clock skew"

maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly

after a few more minutes of uptime, another warning message:

"160 slow ops, oldest one blocked for 1087 sec, mon.XXX-PSAN-01 has slow ops

and still no gui to https://<ip-of-psan-01> (same nginex 504)

tried another graceful reboot, still getting symptoms of items 3-6

all networking looks good. what causes the 'slow ops' and webgui to not work?

OK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday). one of the 2 bond/slave links was down.

looking at xxx-psan-01's console screen, it was filled with

[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1

(and repeat many, many times)

I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it. bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.

but xxx-psan-01 still wasn't happy. so i rebooted it from the psan console....

no speed change notices, post reboot, so that's good.

bond0 still good on the switch (2 of 2 up)

but dashboard still shows 12/15 OSDs

Ceph Health warning about xxx-psan-01 having "clock skew"

maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly

after a few more minutes of uptime, another warning message:

"160 slow ops, oldest one blocked for 1087 sec, mon.XXX-PSAN-01 has slow ops

and still no gui to https://<ip-of-psan-01> (same nginex 504)

tried another graceful reboot, still getting symptoms of items 3-6

all networking looks good. what causes the 'slow ops' and webgui to not work?

#3

hak
23 Posts

July 16, 2023, 9:38 pm
Quote from hak on July 16, 2023, 9:38 pm
still no gui on node-01 and the warning is just growing:

clock skew detected on mon.XXX-PSAN-01
74527 slow ops, oldest one blocked for 11343 sec, mon.XXX-PSAN-01 has slow ops

what causes that?

still no gui on node-01 and the warning is just growing:

clock skew detected on mon.XXX-PSAN-01
74527 slow ops, oldest one blocked for 11343 sec, mon.XXX-PSAN-01 has slow ops

what causes that?

#4

hak
23 Posts

July 17, 2023, 3:21 pm
Quote from hak on July 17, 2023, 3:21 pm
same: still no gui on node-01 and the warning climbs:

clock skew detected on mon.xxx-PSAN-01
540742 slow ops, oldest one blocked for 72594 sec, mon.xxx-PSAN-01 has slow ops

i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:

what not working PSAN-01 working PSAN-02

eth6 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0

eth7 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1

bond0 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever

bond0_40G.210@bond0_40G <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet x.x.x.220/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000     link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet x.x.x.222/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever

what is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)

same: still no gui on node-01 and the warning climbs:

clock skew detected on mon.xxx-PSAN-01
540742 slow ops, oldest one blocked for 72594 sec, mon.xxx-PSAN-01 has slow ops

i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:

what not working PSAN-01 working PSAN-02

eth6 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0

eth7 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1

bond0 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever

bond0_40G.210@bond0_40G <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet x.x.x.220/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000     link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet x.x.x.222/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever

what is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)

Last edited on July 17, 2023, 3:48 pm by hak · #5

admin
2,967 Posts

July 17, 2023, 4:01 pm
Quote from admin on July 17, 2023, 4:01 pm
after the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..

after the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..

#6

hak
23 Posts

July 17, 2023, 4:33 pm
Quote from hak on July 17, 2023, 4:33 pm
right, ntpq -p on PSAN-01 shows ok:

root@xxx-PSAN-01:~# ntpq -p
remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
x.x.x.1     208.91.112.62    3 u 430 1024    1    0.083   -0.082   0.042
LOCAL(0)        .LOCL.           7 l 442   64 100    0.000    0.000   0.000
root@xxx-PSAN-01:~#

oddly, on (functional) nodes 2-5 it shows the same non-answer:

root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#

right, ntpq -p on PSAN-01 shows ok:

root@xxx-PSAN-01:~# ntpq -p
remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
x.x.x.1     208.91.112.62    3 u 430 1024    1    0.083   -0.082   0.042
LOCAL(0)        .LOCL.           7 l 442   64 100    0.000    0.000   0.000
root@xxx-PSAN-01:~#

oddly, on (functional) nodes 2-5 it shows the same non-answer:

root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#

#7

hak
23 Posts

July 17, 2023, 6:08 pm
Quote from hak on July 17, 2023, 6:08 pm
cat /etc/ntp.conf (which were 'as installed' below)

on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).

"server x.x.x.1 iburst"

on PSAN-02 it lists only the management IP of nodes 1 and 3

"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst"

on PSAN-03: it lists the management IP of nodes 1 (only)

"server x.x.x.220 burst iburst"

on PSAN-04: it lists the management IP of nodes 1 through 3

"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst

on PSAN-05: it lists the management IP of nodes 1 through 3

"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst

so it seems 3 relied on 1 only. 1 got disconnected. so 3 got messed up too. 2 relies on 3 and 1, so it got messed up. and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.

for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.

cat /etc/ntp.conf (which were 'as installed' below)

on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).

"server x.x.x.1 iburst"

on PSAN-02 it lists only the management IP of nodes 1 and 3

"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst"

on PSAN-03: it lists the management IP of nodes 1 (only)

"server x.x.x.220 burst iburst"

on PSAN-04: it lists the management IP of nodes 1 through 3

"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst

on PSAN-05: it lists the management IP of nodes 1 through 3

"server x.x.x.220 burst iburst
server x.x.x.224 burst iburst
server x.x.x.222 burst iburst

so it seems 3 relied on 1 only. 1 got disconnected. so 3 got messed up too. 2 relies on 3 and 1, so it got messed up. and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.

for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.

#8

hak
23 Posts

July 17, 2023, 7:50 pm
Quote from hak on July 17, 2023, 7:50 pm
rebooting the nodes after setting a common ntp.conf value worked. i'm back at 15/15.   is editing /ntp.conf ok or should things be set upstream in a petasan config file?

rebooting the nodes after setting a common ntp.conf value worked. i'm back at 15/15.   is editing /ntp.conf ok or should things be set upstream in a petasan config file?

#9

admin
2,967 Posts

July 17, 2023, 8:13 pm
Quote from admin on July 17, 2023, 8:13 pm
The ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.

The ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.

#10

Post Reply: next steps on troubleshooting

Cancel