Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

next steps on troubleshooting

pre-productoin cluster is setup (5x R640 Dells, 1G management, 2x 10G iscsi (not bonded) and 2x 40G private network (bonded lacp) with 3x nvme OSD's each)

  • I did individual disk benchmarking pre-cluster setup on each node from the petasan console. all disks checked out within a few % of eachother
  • created cluster with replica 3 scheme and RBD type, high end hardware profile (192GB ram and 2x 12c 3+ghz cpus per node), all without erros
  • did the in-gui cluster load testing with 3 and 4 'storage nodes' to compare to my 3 test node cluster (using single HHHL nvme) happy with results
  • all seemed fine, no production data on this system yet

this AM got an email:

 

Dear PetaSAN user;

 Cluster XXX-CLUS-01 has one or more osd failures, please check the following osd(s):

- osd.5/XXX-PSAN-01

- osd.4/XXX-PSAN-01

- osd.3/XXX-PSAN-01

Host Hardware check via DRAC (IPMI): server is all green, no hardware failures

40G switch fabric check: portchannels (5 of them each with 2x 40G member/slaves) are all up.

Trying the gui on XXX-CLUS-01 results in:

504 Gateway Time-out / nginx/1.18.0 (Ubuntu)

From another node, viewing XXX-CLUS-01 from Nodes-List, says it's up/green, but clicking on '-01's disks results in an empty page, clicking on it's logs results in the following:

"...15/07/2023 08:28:31 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:28:31 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:28:31 ERROR do_connect() Cannot connect to ceph cluster.

15/07/2023 08:27:42 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:27:42 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:27:42 ERROR do_connect() Cannot connect to ceph cluster.

15/07/2023 08:21:44 WARNING connect() retry(1) Cannot connect to ceph cluster.

rados.TimedOut: [errno 110] RADOS timed out (error connecting to the cluster)

File "rados.pyx", line 680, in rados.Rados.connect

cluster.connect()

File "/usr/lib/python3/dist-packages/PetaSAN/core/ceph/ceph_connector.py", line 67, in do_connect

Traceback (most recent call last):

15/07/2023 08:21:44 ERROR [errno 110] RADOS timed out (error connecting to the cluster)

15/07/2023 08:21:44 ERROR do_connect() Cannot connect to ceph cluster.

[errno 110] RADOS timed out (error connecting to the cluster)

2023-07-15T08:21:14.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:11.081-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:08.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:05.082-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:21:02.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:59.078-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:56.079-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:53.075-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

2023-07-15T08:20:50.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

15/07/2023 08:21:14 ERROR 2023-07-15T08:20:47.055-0400 7f4c26932700 -1 monclient: get_monmap_and_config failed to get config

15/07/2023 08:16:00 WARNING connect() retry(1) Cannot connect to ceph cluster...."

 

SSH to -01 still works.

What should I check to try and find cause and remediation?

 

What should I check to try and find cause and remediation?

well that is the 64M $ question, there is not always a direct answer..that is why there are sill IT jobs around 🙂  in your case, i would first check backend network on node 1.

OK so the VLAG (bond0 for node 1) actually was in a degraded state (I was looking at my other fabric when checking yesterday).  one of the 2 bond/slave links was down.

looking at xxx-psan-01's console screen, it was filled with

[xxxxxxx.xxxxxxxxx] bond0_40G: (slave eth6): speed changed to 0 on port 1

(and repeat many, many times)

I came to the datacenter and swapped fiber cables, no change, i then swapped to another new QSFP in the melanox cx-5 card's port 1.. that did it.  bond happy, both slaves up, no speed change notices, green LED blink pattern normal on mellanox card now.

but xxx-psan-01 still wasn't happy.  so i rebooted it from the psan console....

  1. no speed change notices, post reboot, so that's good.
  2. bond0 still good on the switch (2 of 2 up)
  3. but dashboard still shows 12/15 OSDs
  4. Ceph Health warning about xxx-psan-01 having "clock skew"
    1. maybe it was a bootup? b/c checking 'date' shows fine and checking ntpq -p shows it's using ntp correctly
  5. after a few more minutes of uptime, another warning message:
    1. "
  6. and still no gui to https://<ip-of-psan-01&gt; (same nginex 504)
  7. tried another graceful reboot, still getting symptoms of items 3-6

all networking looks good. what causes the 'slow ops' and webgui to not work?

still no gui on node-01 and the warning is just growing:

 


what causes that?

same: still no gui on node-01 and the warning climbs:


i can ping both management and backend/private IPs and i've compared the "ip a" configs of troublesome node1 and a working node, and i can see no difference:

what not working PSAN-01 working PSAN-02
eth6 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f0np0
eth7 <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1
<BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0_40G state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
altname enp216s0f1np1
bond0 <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever
<BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever
bond0_40G.210@bond0_40G <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000 link/ether 08:c0:eb:f7:1e:6a brd ff:ff:ff:ff:ff:ff
inet x.x.x.220/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1e6a/64 scope link
valid_lft forever preferred_lft forever
<BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000     link/ether 08:c0:eb:f7:1d:0a brd ff:ff:ff:ff:ff:ff
inet x.x.x.222/24 scope global bond0_40G.210
valid_lft forever preferred_lft forever
inet6 fe80::ac0:ebff:fef7:1d0a/64 scope link
valid_lft forever preferred_lft forever

what is keeping node-01 from communicating /bringing the OSD's back? (in the nodes list, Status for all 5 is UP/Green... yet i'm still at 12/15 OSDs and no https to node-01 (yes still to ssh)

 

 

 

 

after the backend net issue, there could now be a clock sync issue, you should try to fix the clock sync/skew. check the times of the nodes, the time zones, check the offset to node 1 by running ntpq -p from other nodes. check if you are using an external ntp and if the real time clock on node 1 is good. maybe the reboot to check network was fine but the time is off..

 

right, ntpq -p on PSAN-01 shows ok:

root@xxx-PSAN-01:~# ntpq -p
remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*x.x.x.1     208.91.112.62    3 u  430 1024    1    0.083   -0.082   0.042
LOCAL(0)        .LOCL.           7 l  442   64  100    0.000    0.000   0.000
root@xxx-PSAN-01:~#

oddly, on (functional) nodes 2-5 it shows the same non-answer:

root@xxx-PSAN-02:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-02:~#
root@xxx-PSAN-03:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-03:~#
root@xxx-PSAN-04:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-04:~#
root@xxx-PSAN-05:~# ntpq -p
ntpq: read: Connection refused
root@xxx-PSAN-05:~#

cat /etc/ntp.conf (which were 'as installed' below)

on PSAN-01 it shows the NTP server i'd like used ( my subnets' HA gateway is ntp for local devices).

"server  x.x.x.1 iburst"

on PSAN-02 it lists only the management IP of nodes 1 and 3

"server  x.x.x.220 burst  iburst
server  x.x.x.224 burst  iburst"

on PSAN-03: it lists the management IP of nodes 1 (only)

"server  x.x.x.220 burst  iburst"

on PSAN-04: it lists the management IP of nodes 1 through 3

"server  x.x.x.220 burst  iburst
server  x.x.x.224 burst  iburst
server  x.x.x.222 burst  iburst

on PSAN-05: it lists the management IP of nodes 1 through 3

"server  x.x.x.220 burst  iburst
server  x.x.x.224 burst  iburst
server  x.x.x.222 burst  iburst

so it seems 3 relied on 1 only.  1 got disconnected.  so 3 got messed up too. 2 relies on 3 and 1, so it got messed up.  and 4 and 5 rely on 1-3 which are all messed up and 4 and 5 followed suit.

for now i edited all of the ntp.conf to point to x.x.x.1... rebooting them keeping 3 online minimum. will report back.

 

rebooting the nodes after setting a common ntp.conf value worked.  i'm back at 15/15.   is editing /ntp.conf ok or should things be set upstream in a petasan config file?

The ntp service was not running on several nodes. i believe there was an issue with the base ubuntu package  update/upgrade that caused the ntp service to stop, maybe you had hit this issue. i do not think it was related to ntp.conf file, it is ok if you wish to have a common file. the way we do it is for only 1 node to sync with an ntp server and the other node sync with initial node...this is better if the ntp server is external/public the latency will be different if each node does external access. but for internal ntp server it is ok to have a common ntp.conf file.