Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

[BUG] ERROR GlusterFS vol_create failed

Pages: 1 2

Hello,

I am in the middle of trying to do a fresh install, everything (log wise) was going fine but on my third node, the log is reporting the following.

I do remember reading that stats seems to require 20GB per OSD, is this correct? or is my memory fussy?

I have only created 65GB rootfs (os) and I am suspecting this maybe the issue.

I have 2x NVME cards connected with duel 10Gig with an 65GB OS disk and three servers and 5 NIC's (recommended) each in their own vLANs.  Connectivity between the nodes on each VLAN/address has been checked and fine.

I am predicting stats must be on the OS disk and not CEPH, if this is so, should I increases all OS disks to 120GB, would this then cater for the stats?

Thanks for reading,

Kind Regards

27/07/2019 14:52:21 ERROR GlusterFS vol_create failed attempt 1
27/07/2019 14:52:42 ERROR GlusterFS vol_create failed attempt 2
27/07/2019 14:53:02 ERROR GlusterFS vol_create failed attempt 3
27/07/2019 14:53:22 ERROR GlusterFS vol_create failed attempt 4
27/07/2019 14:53:43 ERROR GlusterFS vol_create failed attempt 5
27/07/2019 14:54:03 ERROR GlusterFS vol_create failed attempt 6
27/07/2019 14:54:23 ERROR GlusterFS vol_create failed attempt 7
27/07/2019 14:54:44 ERROR GlusterFS vol_create failed attempt 8
27/07/2019 14:55:04 ERROR GlusterFS vol_create failed attempt 9
27/07/2019 14:55:24 ERROR GlusterFS vol_create failed attempt 10
27/07/2019 14:55:45 ERROR GlusterFS vol_create failed attempt 11
27/07/2019 14:56:05 ERROR GlusterFS vol_create failed attempt 12

The GlusterFS is used to provide a highly available share for storing stats. 64 GB for root OS should be OK ( it is the min value)

The error is  attempting to create this share during cluster build which happens at node 3 build step ( since we need 3 nodes to create ceph/consul/gluster..etc). It will retry a couple of minutes then fail in case of error. The most likely cause is network connection on Backend 1 network,  i suggest you double check it and re-install. Maybe first create a simple network with no vlans/bonds/jumbo frames to check things work .

Thanks very much for speedy reply, I didn't expect that 🙂

I have just reinstalled with an 180GB SSD in each for OS and the issue still persists (thought was worth testing while I wait for reply).

I have already checked MTU as the links are on 9k and even have tried without changing from 1500 but this problem is persisting.  Whats interesting is the cluster comes up fine, just the stats is all broken.

I have also checked connectivity on each of the nodes to every address within cluster and all can see each other.

Just about to triple check backend 1, let's see what we can see.

Thanks again for the leads, will let you know how I get on.

Have rechecked network, all ranges are working fine.  I can also push 10k though before packets are too big so it's not that.

As only stats are affected, can you tell me briefly how the are done and I will see if I can manually create what it wants.

The cluster itself is running perfectly, just no stats.

Very weird.  This also confirms that networking is fine

root@psan03:~# gluster peer probe psan01
peer probe: success. Host psan01 port 24007 already in peer list
root@psan03:~# gluster peer probe psan02
peer probe: success. Host psan02 port 24007 already in peer list
root@psan03:~# gluster peer probe psan03
peer probe: success. Probe on localhost not needed
root@psan03:~# gluster peer status
Number of Peers: 2

Hostname: 172.227.3.2
Uuid: 8234a72b-5a49-4698-9c32-64426d0f021a
State: Peer in Cluster (Connected)
Other names:
psan02

Hostname: 172.227.3.1
Uuid: 7cd1c17b-c734-43de-9ac0-de6b192e1e92
State: Peer in Cluster (Connected)
Other names:
psan01
root@psan03:~# gluster pool list
UUID Hostname State
8234a72b-5a49-4698-9c32-64426d0f021a 172.227.3.2 Connected
7cd1c17b-c734-43de-9ac0-de6b192e1e92 172.227.3.1 Connected
06b9fd35-d25c-40ac-a0cf-9a737b197533 localhost Connected

 

Any other ideas? I have no reinstalled about four or so times, each time does the same thing, cluster works great but no stats plus errors in logs which I do not like 😉

Doing a bit of googling on how GlusterFS is done/works, ive found the following resolves the issue.  Now just waiting for stats to generate to know for sure we are good.

I do not know why a fresh install comes up broken but this does fix it and hopefully will helps others if they suffer the same issue.

 

root@psan03:~# gluster volume create gfs-vol replica 2 transport tcp psan01:/opt/petasan/config/shared psan02:/opt/petasan/config/shared force
volume create: gfs-vol: success: please start the volume to access data
root@psan03:~# gluster volume start gfs-vol
volume start: gfs-vol: success
root@psan03:~# gluster volume info gfs-vol

Volume Name: gfs-vol
Type: Replicate
Volume ID: 37905f38-240b-404e-af31-2932873763a6
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: psan01:/opt/petasan/config/shared
Brick2: psan02:/opt/petasan/config/shared
Options Reconfigured:
performance.readdir-ahead: on
root@psan03:~#
root@psan03:~#
root@psan03:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 7.9G 0 7.9G 0% /dev
tmpfs 1.6G 15M 1.6G 1% /run
/dev/sda3 15G 1.9G 13G 13% /
tmpfs 7.9G 0 7.9G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/sda4 30G 92M 30G 1% /var/lib/ceph
/dev/sda5 104G 98M 103G 1% /opt/petasan/config
/dev/sda2 127M 946K 126M 1% /boot/efi
/dev/nvme0n1p1 97M 5.5M 92M 6% /var/lib/ceph/osd/netspeedy-1
/dev/nvme1n1p1 97M 5.5M 92M 6% /var/lib/ceph/osd/netspeedy-0
172.227.3.1:gfs-vol 104G 62M 104G 1% /opt/petasan/config/shared

check

systemctl status glusterfs-server
gluster peer status

if the servers are up and peered, do a:

gluster vol info gfs-vol
gluster vol status gfs-vol

if volume "gfs-vol" is not created do a:

gluster vol create gfs-vol replica 3 IP1:/opt/petasan/config/gfs-brick IP2:/opt/petasan/config/gfs-brick IP3:/opt/petasan/config/gfs-brick

where IP1,2,3 are backend ips of your first 3 nodes.  if the volume was already present, skip the above step

try to start the volume

gluster vol start gfs-vol

if you are lucky, set some optional values

gluster vol set gfs-vol network.ping-timeout 5
gluster vol set gfs-vol gfs-vol nfs.disable true

If the volume starts, the clients should pick it up after a minute or so

Thanks very much 🙂

Will re-install and if it does it again (expect it will) will use your solution.  Least I was close 🙂

Thanks again.

Good it works 🙂

one point is you should create the bricks on gfs-brick not the shared folder, the first is for the servers, the second is where clients mount the share. also better  do a 3x replication since all 3 management nodes in PetaSAN are alike. Also better use backend 1 ips rather than node names.

Aye, many thanks.  Just redoing it with your fixes now (i.e correct mount + replica count).  Still wasn't bad guess 😉

Thanks again for putting me straight.

Pages: 1 2