Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

[PetaSAN 1.4.0] NTP is not working

Pages: 1 2

I installed PetaSAN 1.4.0 and the clocks of the different Nodes were far off. (At the moment I have the problem, that my cluster isn't initializing)

I realized the following:

  • "ntpq -p"  produced the error message"name or service not known". The issue was fixed with adding the following line to /etc/hosts (empty after setup):
    "127.0.0.1 hostname.domain.tld hostname localhost"

I additionally changed ntp.conf to use my organization NTP-Servers and manually synced the date initially with ntpdate.

 

The ntp should be working in v 1.4 either in after a fresh install or from upgrade, we tested this several times.  Node 1 acts as the main ntp server for the cluster, followed bu node 2 then 3. If you define an external ntp server via the Cluster Settings page, it will be used by node 1 to adjust its time and relay it to the other nodes.

Did you get any other errors apart from this ?

I just wanted to point out a possible bug. I'll possibly check it again if the scenario is set up as you described. I opened another topic with more details about my setup and my problem to build a cluster.

I do see this in my logs, my cluster was built last week.  ~11/15.  Nothing on node 01, so guess this is by design?  My concern was just the permission denied.

 

 

Nov 21 14:36:43 ps-node-02 ntpd[1509]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Nov 21 14:36:43 ps-node-02 ntpd[1509]: 21 Nov 14:36:43 ntpd[1509]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied

Nov 21 13:35:40 ps-node-03 ntpd[1473]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Nov 21 13:35:40 ps-node-03 ntpd[1473]: 21 Nov 13:35:40 ntpd[1473]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Nov 21 14:35:41 ps-node-03 ntpd[1473]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied
Nov 21 14:35:41 ps-node-03 ntpd[1473]: 21 Nov 14:35:41 ntpd[1473]: frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied

 

First node with ntpq-p seems ok

root@ps-node-01:/var/log# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*10.10.1.1 10.10.0.96 3 u 79 128 377 0.391 -0.197 0.130
LOCAL(0) .LOCL. 7 l 43m 64 0 0.000 0.000 0.000

root@ps-node-02:/var/log# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*ps-node-01 10.10.1.1 5 u 359 512 377 0.178 3.341 0.017
LOCAL(0) .LOCL. 9 l 96m 64 0 0.000 0.000 0.000

root@ps-node-03:/var/log# ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*ps-node-01 10.10.1.1 5 u 552 1024 377 0.147 -12.752 0.029
+ps-node-02 10.10.0.111 6 u 547 1024 277 0.141 -16.502 0.195
LOCAL(0) .LOCL. 11 l 7h 64 0 0.000 0.000 0.000

 

Thanks for reporting this.  I am trying to replicate this, so far i do not see these logs but it may take time so we will add this to our test cases.  The good thing is that the syncing is working as you mentioned.

It will be helpful if you can supply the output of :

ps aux | grep ntp

systemctl status ntp

I am seeing this EXACT behavior on a fresh 2.0 install.  We have two separate cluster in 2 physical locations, one is perfect, the other is showing this issue with the "frequency file /var/lib/ntp/ntp.drift.TEMP: Permission denied"

The servers are now all about 8 minutes behind real time and falling further behind.  I don't want to do anything rash in case it will kill the array, but this is DEFINITELY an issue.

Below is ps -ef | grep ntp for all 3 nodes, #2 looks weird, but a "systemctl status ntp" is IDENTICAL on all 3.

Node 1:

root 1881286 1 0 10:55 ? 00:00:00 /usr/sbin/ntpd -n -g

Node 2:

957535 ? 00:00:00 ntpd

Node 3:

root 1697210 1 0 10:47 ? 00:00:00 /usr/sbin/ntpd -n -g

 

Do not know why this happens, the following may help:

On all nodes:
Check ownership of /var/lib/ntp , it should be ntp, else set it
chown -R ntp:ntp /var/lib/ntp

Disable ntp auto start, they are started within the PetaSAN scripts
update-rc.d ntp disable
systemctl disable ntp
systemctl disable systemd-timesyncd
systemctl restart ntp

After approx 30 min start monitoring nodes via
ntpq -p
the time offset between nodes should start to decrease

If you feel the offset is not decreasing, you can force sync nodes to the first node, with the exception of first node, on other nodes:
systemctl stop ntp
ntpdate ip_address_of_first_node
hwclock --systohc --utc
systemctl start ntp

 

Permissions are fine.  All 3 servers are in sync with each other time wise. But they are ALL a few minutes behind real-time...  the array is working, but it thinks its about 10 minutes ago which makes all the monitoring screwy.

ok i understand better now. the internal syncing is working but they are all off by several minutes. To fix this you need to define an external ntp time server in the Cluster Settings page. If you already have then there is am issue connecting to it, make sure the external ntp server can be pinged, you may need route your management network for external access.  You can monitor the time offset with your external ntp server via

ntpq -p

If this does not show your external ntp or does not show the offset decreasing, try another external ntp server.

Without an external ntp server, the nodes will be in sync but could deviate by 1 or 2 sec per day if they have a low grade hardware clock ( 20 ppm accuracy )

For the second node showing different ps ouptut, make sure the ntp service is not started via sysinit system  but systemctl via

update-rc.d ntp disable
systemctl restart ntp

if the permission error exist on just the second node this may be the cause, the installer should have run the disable command during installation but maybe it failed or something over-wrote it.

Thank you!  There were two problems and now they both seem fixed!  The outside ntp server was indeed unreachable, and the node showing the odd ps output was cured by following the two commands you gave.  After about 30 minutes all is in sync and keeping up.  Also, I saw 2.1 is out, can't wait to test it!  I appreciate the amazing work you've done with this!

Pages: 1 2