ForumBug ReportingDashboard graphs not available

You need to log in to create posts and topics. Login · Register

Dashboard graphs not available

therm
121 Posts

September 11, 2017, 3:08 pm

Hi,

do not know if this is related to the other problem I reported, but the Dashboard graphs/charts are no longer visible and the following message ocurrs:

503 Service Unavailable

Service Unavailable
The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.

Apache/2.4.18 (Ubuntu) Server at localhost Port 8080

admin
2,930 Posts

September 11, 2017, 4:16 pm

Could be related. but i understood the ceph mon daemon recovered the memory usage after you restart it.

Can you try identify the load on active graphing server: cpu% mem% 0f top processes as well as apache2

also how many connections :

netstat -ant | grep ESTABLISHED | wc -l

you can identify the acting graph server by inspecting the html chart element source in the browser it will list its ip address. Or ssh and test on the first 3 nodes which one has apache2 running.

therm
121 Posts

September 12, 2017, 4:54 am

Quote from therm on September 12, 2017, 4:54 am
root@ceph-node-mru-2:~# netstat -ant | grep ESTABLISHED | wc -l
5815

top - 06:46:18 up 18 days, 12:10, 1 user, load average: 1.75, 2.21, 2.29
Tasks: 849 total, 1 running, 848 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 0.7 sy, 0.0 ni, 97.5 id, 0.8 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 65934252 total, 10714428 free, 44337332 used, 10882492 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 9101588 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12941 ceph 20 0 4294712 2.034g 9888 S 6.0 3.2 2803:08 ceph-osd
16748 ceph 20 0 3528976 1.279g 9440 S 3.6 2.0 60:10.53 ceph-osd
3333 ceph 20 0 4198228 1.941g 9244 S 2.6 3.1 706:03.83 ceph-osd
2376 root 20 0 0 0 0 S 2.0 0.0 0:49.06 kworker/8:2
5840 ceph 20 0 3996572 1.871g 9552 S 2.0 3.0 680:35.21 ceph-osd
7191 ceph 20 0 4199188 2.020g 12012 S 2.0 3.2 757:52.43 ceph-osd
8760 ceph 20 0 4293536 2.122g 9648 S 2.0 3.4 750:24.67 ceph-osd
12745 ceph 20 0 4039972 1.929g 9072 S 2.0 3.1 637:42.63 ceph-osd
15377 ceph 20 0 4026956 2.006g 9452 S 2.0 3.2 632:56.06 ceph-osd
3692 ceph 20 0 4124876 1.915g 9232 S 1.7 3.0 636:54.00 ceph-osd
8156 ceph 20 0 4046356 1.901g 9544 S 1.7 3.0 580:34.86 ceph-osd
9472 ceph 20 0 4176372 1.948g 11304 S 1.7 3.1 681:52.11 ceph-osd
11780 ceph 20 0 3982252 1.836g 9424 S 1.7 2.9 573:30.06 ceph-osd
16747 ceph 20 0 3377252 1.211g 9352 S 1.7 1.9 51:09.24 ceph-osd
14707 ceph 20 0 3980540 1.949g 11956 S 1.3 3.1 651:00.16 ceph-osd
16743 ceph 20 0 3527088 1.277g 11508 S 1.3 2.0 59:44.12 ceph-osd
16746 ceph 20 0 3568080 1.315g 10296 S 1.3 2.1 58:19.63 ceph-osd
3439 ceph 20 0 4027400 1.849g 8920 S 1.0 2.9 592:37.45 ceph-osd
7351 ceph 20 0 3922444 1.771g 9048 S 1.0 2.8 574:51.03 ceph-osd
7534 ceph 20 0 4134280 1.905g 8964 S 1.0 3.0 556:02.80 ceph-osd
12490 ceph 20 0 4207792 2.055g 9264 S 1.0 3.3 645:17.74 ceph-osd
13647 ceph 20 0 3974644 1.870g 9272 S 1.0 3.0 631:04.30 ceph-osd
46254 ceph 20 0 3775788 1.753g 9232 S 1.0 2.8 322:38.78 ceph-osd
57866 root 20 0 0 0 0 S 1.0 0.0 1:22.70 iscsi_trx
12436 ceph 20 0 1909876 133228 16596 S 0.7 0.2 7:03.66 ceph-mon
53885 root 20 0 0 0 0 S 0.7 0.0 1:23.38 iscsi_trx
64804 root 20 0 0 0 0 S 0.7 0.0 0:17.19 kworker/4:1
8 root 20 0 0 0 0 S 0.3 0.0 99:23.92 rcu_sched
49 root 20 0 0 0 0 S 0.3 0.0 108:45.92 ksoftirqd/8
179 root 39 19 0 0 0 S 0.3 0.0 13:38.30 khugepaged
1669 root 20 0 79876 21184 3388 S 0.3 0.0 1:41.25 deploy.py
2171 root 20 0 81252 29536 6832 S 0.3 0.0 191:17.72 consul
2539 root 20 0 0 0 0 S 0.3 0.0 0:12.12 kworker/12:0
2628 root 20 0 0 0 0 S 0.3 0.0 0:02.25 kworker/25:0

I was not at work for a week so maybe the dashboard thing happend before. I did try what you suggested in an older thread:

systemctl stop carbon-cache

systemctl stop apache2

systemctl stop collectd

systemctl stop grafana-server

systemctl start carbon-cache

systemctl start apache2

systemctl start collectd

systemctl start grafana-server

Now the message is gone but now the chart is black and in right upper corner there is a turning circle and after a while this happens:
504 Gateway Timeout

Gateway Timeout
The gateway did not receive a timely response
from the upstream server or application.

Apache/2.4.18 (Ubuntu) Server at localhost Port 8080
Any idea?

root@ceph-node-mru-2:~# netstat -ant | grep ESTABLISHED | wc -l
5815

top - 06:46:18 up 18 days, 12:10, 1 user, load average: 1.75, 2.21, 2.29
Tasks: 849 total, 1 running, 848 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 0.7 sy, 0.0 ni, 97.5 id, 0.8 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 65934252 total, 10714428 free, 44337332 used, 10882492 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 9101588 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12941 ceph 20 0 4294712 2.034g 9888 S 6.0 3.2 2803:08 ceph-osd
16748 ceph 20 0 3528976 1.279g 9440 S 3.6 2.0 60:10.53 ceph-osd
3333 ceph 20 0 4198228 1.941g 9244 S 2.6 3.1 706:03.83 ceph-osd
2376 root 20 0 0 0 0 S 2.0 0.0 0:49.06 kworker/8:2
5840 ceph 20 0 3996572 1.871g 9552 S 2.0 3.0 680:35.21 ceph-osd
7191 ceph 20 0 4199188 2.020g 12012 S 2.0 3.2 757:52.43 ceph-osd
8760 ceph 20 0 4293536 2.122g 9648 S 2.0 3.4 750:24.67 ceph-osd
12745 ceph 20 0 4039972 1.929g 9072 S 2.0 3.1 637:42.63 ceph-osd
15377 ceph 20 0 4026956 2.006g 9452 S 2.0 3.2 632:56.06 ceph-osd
3692 ceph 20 0 4124876 1.915g 9232 S 1.7 3.0 636:54.00 ceph-osd
8156 ceph 20 0 4046356 1.901g 9544 S 1.7 3.0 580:34.86 ceph-osd
9472 ceph 20 0 4176372 1.948g 11304 S 1.7 3.1 681:52.11 ceph-osd
11780 ceph 20 0 3982252 1.836g 9424 S 1.7 2.9 573:30.06 ceph-osd
16747 ceph 20 0 3377252 1.211g 9352 S 1.7 1.9 51:09.24 ceph-osd
14707 ceph 20 0 3980540 1.949g 11956 S 1.3 3.1 651:00.16 ceph-osd
16743 ceph 20 0 3527088 1.277g 11508 S 1.3 2.0 59:44.12 ceph-osd
16746 ceph 20 0 3568080 1.315g 10296 S 1.3 2.1 58:19.63 ceph-osd
3439 ceph 20 0 4027400 1.849g 8920 S 1.0 2.9 592:37.45 ceph-osd
7351 ceph 20 0 3922444 1.771g 9048 S 1.0 2.8 574:51.03 ceph-osd
7534 ceph 20 0 4134280 1.905g 8964 S 1.0 3.0 556:02.80 ceph-osd
12490 ceph 20 0 4207792 2.055g 9264 S 1.0 3.3 645:17.74 ceph-osd
13647 ceph 20 0 3974644 1.870g 9272 S 1.0 3.0 631:04.30 ceph-osd
46254 ceph 20 0 3775788 1.753g 9232 S 1.0 2.8 322:38.78 ceph-osd
57866 root 20 0 0 0 0 S 1.0 0.0 1:22.70 iscsi_trx
12436 ceph 20 0 1909876 133228 16596 S 0.7 0.2 7:03.66 ceph-mon
53885 root 20 0 0 0 0 S 0.7 0.0 1:23.38 iscsi_trx
64804 root 20 0 0 0 0 S 0.7 0.0 0:17.19 kworker/4:1
8 root 20 0 0 0 0 S 0.3 0.0 99:23.92 rcu_sched
49 root 20 0 0 0 0 S 0.3 0.0 108:45.92 ksoftirqd/8
179 root 39 19 0 0 0 S 0.3 0.0 13:38.30 khugepaged
1669 root 20 0 79876 21184 3388 S 0.3 0.0 1:41.25 deploy.py
2171 root 20 0 81252 29536 6832 S 0.3 0.0 191:17.72 consul
2539 root 20 0 0 0 0 S 0.3 0.0 0:12.12 kworker/12:0
2628 root 20 0 0 0 0 S 0.3 0.0 0:02.25 kworker/25:0

I was not at work for a week so maybe the dashboard thing happend before. I did try what you suggested in an older thread:

systemctl stop carbon-cache

systemctl stop apache2

systemctl stop collectd

systemctl stop grafana-server

systemctl start carbon-cache

systemctl start apache2

systemctl start collectd

systemctl start grafana-server

Now the message is gone but now the chart is black and in right upper corner there is a turning circle and after a while this happens:

504 Gateway Timeout

Gateway Timeout
The gateway did not receive a timely response
from the upstream server or application.

Apache/2.4.18 (Ubuntu) Server at localhost Port 8080

Any idea?

therm
121 Posts

September 12, 2017, 6:33 am

From /var/log/apache2/graphite-web_error.log

[Tue Sep 12 08:31:55.539692 2017] [wsgi:error] [pid 3835:tid 139840687392512] (11)Resource temporarily unavailable: [client 127.0.0.1:52012] mod_wsgi (pid=3835): Unable to connect to WSGI daemon process '_graphite' on '/var/run/apache2/wsgi.3827.0.1.sock'., referer: http://193.174.240.195:3000/dashboard-solo/db/ceph-overview?from=now-180m&to=now&panelId=2

admin
2,930 Posts

September 12, 2017, 1:53 pm

Hi,
Can you please check shared graph directory is mounted and its size
mount | grep shared
du -sch /opt/petasan/config/shared/graphite2/whisper/

any errors in the graphite carbon logs
cat /var/log/carbon/console.log
systemctl stop apache2
rm /var/run/apache2/*
systemctl start apache2
any errors you get in status:
systemctl status apache2
anything in:
/var/log/apache2/error.log
Can you also get connections per service:
netstat -ant | grep ESTABLISHED | grep :3000 | wc -l
netstat -ant | grep ESTABLISHED | grep :8080 | wc -l
netstat -ant | grep ESTABLISHED | grep :6789 | wc -l
netstat -ant | grep ESTABLISHED | grep :680 | wc -l

therm
121 Posts

September 12, 2017, 2:08 pm

root@ceph-node-mru-2:~# mount | grep shared
192.168.1.194:gfs-vol on /opt/petasan/config/shared type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

The du command lasts forever and a df command is hanging....

root@ceph-node-mru-2:~# netstat -ant | grep ESTABLISHED | grep :3000 | wc -l
1
root@ceph-node-mru-2:~# netstat -ant | grep ESTABLISHED | grep :8080 | wc -l
0
root@ceph-node-mru-2:~# netstat -ant | grep ESTABLISHED | grep :6789 | wc -l
37
root@ceph-node-mru-2:~# netstat -ant | grep ESTABLISHED | grep :680 | wc -l
937

therm
121 Posts

September 12, 2017, 2:27 pm

Remount or reboot? 😉

Timestamps of the files from another server are quite old. But as I mentioned it might be that the problem occured at that time:

root@ceph-node-mru-3:~# ls -l /opt/petasan/config/shared/graphite/whisper/PetaSAN/storage-node/ceph-ceph/osd-65/gauge/
total 432
-rw-r--r-- 1 _graphite _graphite 73576 Sep 2 14:17 apply_latency_ms.wsp
-rw-r--r-- 1 _graphite _graphite 73576 Sep 2 14:17 commit_latency_ms.wsp
-rw-r--r-- 1 _graphite _graphite 73576 Sep 2 14:17 kb_total.wsp
-rw-r--r-- 1 _graphite _graphite 73576 Sep 2 14:17 kb_used.wsp
-rw-r--r-- 1 _graphite _graphite 73576 Sep 2 14:17 num_snap_trimming.wsp
-rw-r--r-- 1 _graphite _graphite 73576 Sep 2 14:17 snap_trim_queue_len.wsp

admin
2,930 Posts

September 12, 2017, 3:05 pm

Can you try to access the shared mount

/opt/petasan/config/shared

If not i suggest un-mounting it

umount /opt/petasan/config/shared

If successful ( try the command a few times if its busy ) It will remount automatically in about a minute

therm
121 Posts

September 12, 2017, 3:14 pm

umount seems to hang. Accessing from another host is possible (as I quoted above).

umount -l? (this helped with nfs in the past)

therm
121 Posts

September 12, 2017, 3:36 pm

Ok here is what I did:

systemctl stop carbon-cache

systemctl stop apache2

systemctl stop collectd

systemctl stop grafana-server

umount -l /opt/petasan/config/shared

Then I waited about 10 minutes, nothing happend. Then:

systemctl restart petasan-mount-sharedfs.service

systemctl start carbon-cache

systemctl start apache2

systemctl start collectd

systemctl start grafana-server

That did the trick. Now there is a gap in performance data, but thats ok.

Thanks for guiding!