Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Red exclamation mark in graphs

I added three new nodes to my cluster yesterday, and now the graphs are not working. I get a red exclamation mark in the upper left corner.  Clicking on it shows an inspector request/response/error window, with a request going to /api/datasources/proxy/1/render.  The response and error appear to be blank.

I am seeing some errors in /var/log/apache2/graphite-web_error.log which I think may be related:

[Mon Nov 27 17:23:17.158847 2017] [wsgi:error] [pid 3613:tid 139789630142336] mod_wsgi (pid=3613): Target WSGI script '/usr/share/graphite-web/graphite.wsgi' cannot be loaded as Python module.

[Mon Nov 27 17:23:17.158882 2017] [wsgi:error] [pid 3613:tid 139789630142336] mod_wsgi (pid=3613): Exception occurred processing WSGI script '/usr/share/graphite-web/graphite.wsgi'.

[Mon Nov 27 17:23:17.158913 2017] [wsgi:error] [pid 3613:tid 139789630142336] Traceback (most recent call last):

[Mon Nov 27 17:23:17.158940 2017] [wsgi:error] [pid 3613:tid 139789630142336]   File "/usr/share/graphite-web/graphite.wsgi", line 18, in <module>

[Mon Nov 27 17:23:17.159020 2017] [wsgi:error] [pid 3613:tid 139789630142336]     import graphite.metrics.search

[Mon Nov 27 17:23:17.159036 2017] [wsgi:error] [pid 3613:tid 139789630142336]   File "/usr/lib/python2.7/dist-packages/graphite/metrics/search.py", line 6, in <module>

[Mon Nov 27 17:23:17.159124 2017] [wsgi:error] [pid 3614:tid 139789630142336] mod_wsgi (pid=3614): Target WSGI script '/usr/share/graphite-web/graphite.wsgi' cannot be loaded as Python module.

[Mon Nov 27 17:23:17.159163 2017] [wsgi:error] [pid 3614:tid 139789630142336] mod_wsgi (pid=3614): Exception occurred processing WSGI script '/usr/share/graphite-web/graphite.wsgi'.

[Mon Nov 27 17:23:17.159195 2017] [wsgi:error] [pid 3614:tid 139789630142336] Traceback (most recent call last):

[Mon Nov 27 17:23:17.159221 2017] [wsgi:error] [pid 3614:tid 139789630142336]   File "/usr/share/graphite-web/graphite.wsgi", line 18, in <module>

[Mon Nov 27 17:23:17.159283 2017] [wsgi:error] [pid 3611:tid 139789630142336] mod_wsgi (pid=3611): Target WSGI script '/usr/share/graphite-web/graphite.wsgi' cannot be loaded as Python module.

[Mon Nov 27 17:23:17.159304 2017] [wsgi:error] [pid 3611:tid 139789630142336] mod_wsgi (pid=3611): Exception occurred processing WSGI script '/usr/share/graphite-web/graphite.wsgi'.

[Mon Nov 27 17:23:17.159306 2017] [wsgi:error] [pid 3614:tid 139789630142336]     import graphite.metrics.search

Any ideas on where to look? I've tried sequentially rebooting my 3 monitor nodes, but it didn't help.

Thanks.

Adding nodes should not do this.

First i would try to reboot the management node acting as stats server, maybe it has a configuration issue. You did try to reboot but i only want to reboot 1 server so to make sure the stats service is removed to another node. To find out which of the 3 management server is acting as the current stats server:

systemctl status carbon-cache

active (running) in green will be displayed on this server, the other 2 management nodes should not have this running. reboot this server and double check the service has moved to one of the other nodes, then check if the stats are shown again.

Second: if this does not work, try to check/view all charts: one by one.. all cluster stats charts and all node stats (for any single node) : are all not displayed or do some charts get displayed ?

Last: If it is not a configuration issue, it could be an issue with stats data itself..try backup the data and creating a fresh data directory, on the current running stats server:

/opt/petasan/scripts/stats-stop.sh

mv /opt/petasan/config/shared/graphite /opt/petasan/config/shared/graphite-backup

/opt/petasan/scripts/stats-setup.sh

/opt/petasan/scripts/stats-start.sh

Thanks for the help. I tried your suggestions, but I'm getting the same results.  I am seeing the following in graphite-web_error.log:

[Tue Nov 28 16:01:49.429100 2017] [wsgi:error] [pid 171101:tid 139770963928960]   File "/usr/lib/python2.7/dist-packages/django/db/backends/sqlite3/base.py", line 318, in execute

[Tue Nov 28 16:01:49.429106 2017] [wsgi:error] [pid 171099:tid 139770963928960]   File "/usr/lib/python2.7/dist-packages/django/db/backends/sqlite3/base.py", line 318, in execute

[Tue Nov 28 16:01:49.429638 2017] [wsgi:error] [pid 171101:tid 139770963928960]     return Database.Cursor.execute(self, query, params)

[Tue Nov 28 16:01:49.429640 2017] [wsgi:error] [pid 171099:tid 139770963928960]     return Database.Cursor.execute(self, query, params)

[Tue Nov 28 16:01:49.429641 2017] [wsgi:error] [pid 171100:tid 139770963928960]     return Database.Cursor.execute(self, query, params)

[Tue Nov 28 16:01:49.429650 2017] [wsgi:error] [pid 171097:tid 139770963928960]     return Database.Cursor.execute(self, query, params)

[Tue Nov 28 16:01:49.429677 2017] [wsgi:error] [pid 171101:tid 139770963928960] IntegrityError: UNIQUE constraint failed: auth_user.username

[Tue Nov 28 16:01:49.429683 2017] [wsgi:error] [pid 171099:tid 139770963928960] IntegrityError: UNIQUE constraint failed: auth_user.username

[Tue Nov 28 16:01:49.429688 2017] [wsgi:error] [pid 171100:tid 139770963928960] IntegrityError: UNIQUE constraint failed: auth_user.username

[Tue Nov 28 16:01:49.429697 2017] [wsgi:error] [pid 171097:tid 139770963928960] IntegrityError: UNIQUE constraint failed: auth_user.username

Perhaps some sort of corrupt sqlite database? Any other suggestions of something to try?

 

The strange thing is that the 3 nodes only share the stats metric share  which is a time series db of type whisper  in:

/opt/petasan/config/shared/graphite /opt/petasan/config/shared/graphite

but each has its own copy of the config files including the sql lite db

/opt/petasan/config/stats/graphite/

so if there is configuration corruption in one node it should work fine in the others. If the issue was in the shared metrics data,  it should be solved by re-creating a new shared  whisper folder, we tried both these scenarios so i am puzzled, it is not logical all 3 nodes got corrupt configuration. Did you re-fresh your browser when you did the steps in my prev email  ?  Is it possible the nodes were updated with other software packages ?

Okay, found something interesting...

I mentioned that this problem all started when I added 3 new nodes to the cluster.  When looking at /etc/syslog, I noticed these errors:

Nov 29 09:10:36 ceph2 collectd[803233]: write_graphite plugin: getaddrinfo (localhost, 2003, tcp) failed: Name or service not known

Can't find localhost?  So I look at my /etc/hosts file.  It consists of just the three new nodes and nothing else!  Very strange.  I notice that it's a symbolic link to a PetaSAN file of some sort, so I'm reluctant to edit the file.  Is this file being built from a list of nodes somewhere? Is it safe to manually edit it and put all of the nodes back in? Is there any help that I can give to help track down what bug happened in adding the three new nodes that caused it to wipe out the original nodes?

This is indeed a clue. It could be a failure during /etc/hosts update/syncing when adding a node.
First make sure the file/link /etc/hosts exists on all nodes and points to /etc/hosts -> /opt/petasan/config/etc/hosts
If not create it if missing.

Then on any node stop the file sync service ( so our /etc/hosts does not get over-written )
systemctl stop petasan-file-sync

edit /etc/hosts to include localhost as 127.0.0.1 then all hosts and their management ip such as:
127.0.0.1 localhost
10.0.1.11 ps-node-01
10.0.1.12 ps-node-02
10.0.1.13 ps-node-03
10.0.1.14 ps-node-04

Now we need to sync the file to all nodes, we do this programmatically so in your case create a script file

nano /opt/petasan/scripts/util/sync_file.py

#!/usr/bin/python
import sys
from PetaSAN.backend.file_sync_manager import FileSyncManager

if len(sys.argv) != 2 :
print('usage: sync_file full_file_path')
sys.exit(1)

path= sys.argv[1]
if FileSyncManager().commit_file(path):
print('Success')

sys.exit(0)

I am having problems adding correct indentations, better download this file from:

https://drive.google.com/open?id=1FzvpVOnN96B2VN52o9lTwllrJ1lkyJRD

chmod +x /opt/petasan/scripts/util/sync_file.py

sync the hosts file to all nodes:
/opt/petasan/scripts/util/sync_file.py   /etc/hosts

restart the sync service on current node
systemctl start petasan-file-sync

This worked - thanks.  For anyone seeing this script in the future, the cut-and-paste has mangled indenting - the lines after the 'if' statements need to be indented.

Okay, after re-living the installation events in my head, I believe that the bug may have been triggered as follows:

The three new nodes we added are in a different building from the rest of our nodes.  I had incorrectly mapped some VLANs between the buildings at the beginning, so I had the management network working, but the back-end networks weren't working correctly.  I believe the first of the new nodes started the install process and then determined that it couldn't see the rest of the cluster on the back-end, so it gave me an error.  I found the incorrectly mapped VLANs, fixed them, then continued the install.  Is it possible that during the install, a sync of the hosts file failed and ended up putting a blank file in it's place, which then had the three new nodes added sequentially, which was then synced back to the original nodes?

Just trying to help!

Thanks so much for sharing this and what you think happened. I did update the post with downloadable script.