Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD went down on vmware demo environment

Pages: 1 2

It works better today with three petasan servers running on the ESXi with only one OSD each, 4 GB memory, VMXNET3 and paravirtual. But I still get lots of stange errors on the console, OSD disappears and comes back and I have no graphs on the dashbord. I started with separate vlans for iSCSI net and the backend, but now they are on the same two vlans/nics; one for iSCSI-1 + backend-1 and one for iSCSI-2 + backend-2 and of course one for management.  Do you think this can generate problems? I dont really see why it should.

My goal with this test, if it works well, is to install this in our prod env as a secondary storage pool. Today we are running a 3 servers vmware cluster, on one of our sites, with local storage, but we are moving out the storage to a new SAN solution. This makes a lots of local disks free on the ESXi servers, where I'm planning to run petasan as the secondary storage pool.

btw, while typing this post the graphs went live! \o/ Does it take some hours before the graphs starts to display data?

when you had separate vlans was it working well or did you also have problems ?

I will try your setup so i can have a better idea, this will be after release of 1.3 next week since we are tied up + no free machines. If you do have any update please let me know. Also one command to try, log into one the nodes via ssh and type:

consul monitor

this will show you if you have heartbeat errors among the nodes.

Also in the node logs, beside the GlusterFS mount issues, do you see any connection / re-attempt message ?

The charts should come up, and remain up 🙂 , as soon as you start the cluster, with samples every 1 minute. If it came up after an 1h, it is possible there is an hour shift from your browser clock vs the PetaSAN cluster.  If it is working intermittently + you are see-ing the GlusterFS mount issues then there is probably a connection issue.

The reason to run front/back-end on same vlan was to reduce number of nics in on the vm's to reduce confusement, but I had the same problems either way with my first install. With my second install it seems to be working perfect with the storage. I got some errors at first, but now it seems stable. The only thing now is that the graphs went away after I rebooted all vm's, one at a time, for test.

Now I'm running 4 petasan vm's on 2 ESXi hosts (free stand alone), 2 on each. The petasan vm's have 1 cpu, 4 GB mem and 1 OSD with 100 GB each. Both ESXi hosts have iscsi-adapers configured on both iSCSI-vlans and one 500 GB storage mounted. I have successfully installed vm-servers on on both of the ESXi's using same shared storage and it works great. I can also confirm that the ESXi hosts find all four paths to the shared storage. I have also used some cheph cli-commands to get rid of one OSD that went lost/down when I moved(reinstalled) one of the petasan nodes from one ESXi host to the other.

At the moment I only see 'GlusterFS mount attempt' in the logs.

Here is the output of  'consul monitor' on all four of them:

root@sto1-petasan-01:~# consul monitor

2017/05/12 11:01:16 [INFO] consul.fsm: snapshot created in 65.07µs

2017/05/12 11:01:16 [INFO] raft: Starting snapshot up to 254120

2017/05/12 11:01:16 [INFO] snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/16-254120-1494579676049.tmp

2017/05/12 11:01:16 [INFO] snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/12-237726-1494567943420

2017/05/12 11:01:16 [INFO] raft: Compacting logs from 235680 to 243880

2017/05/12 11:01:17 [INFO] raft: Snapshot to 254120 complete

2017/05/12 11:05:30 [INFO] agent.rpc: Accepted client: 127.0.0.1:49444

 

root@sto1-petasan-02:~# consul monitor

2017/05/12 11:12:43 [INFO] agent.rpc: Accepted client: 127.0.0.1:53766

 

root@sto1-petasan-03:~# consul monitor

2017/05/12 11:01:35 [INFO] consul.fsm: snapshot created in 72.12µs

2017/05/12 11:01:35 [INFO] raft: Starting snapshot up to 254148

2017/05/12 11:01:35 [INFO] snapshot: Creating new snapshot at /opt/petasan/config/var/consul/raft/snapshots/16-254148-1494579695753.tmp

2017/05/12 11:01:35 [INFO] snapshot: reaping snapshot /opt/petasan/config/var/consul/raft/snapshots/12-237758-1494567965365

2017/05/12 11:01:35 [INFO] raft: Compacting logs from 235716 to 243908

2017/05/12 11:01:35 [INFO] raft: Snapshot to 254148 complete

2017/05/12 11:19:41 [INFO] agent.rpc: Accepted client: 127.0.0.1:46926

 

root@sto1-petasan-04:~# consul monitor

2017/05/12 11:22:46 [INFO] agent.rpc: Accepted client: 127.0.0.1:41498

 

I still got more questions, but I'll open separate threads for that.

This sounds very good 🙂

-Is the main difference between the first and second attempts was distributing the nodes on 2 ESXi rather than 1 ? Generally try to assign as more resources to PetaSAN as you can, since even in your second setup, it is still much lower than the recommendation guide. Ideally the minimum configuration would be 3 node ESXi running a single PetaSAN node with 4 cores and 32G and serving 4 OSD  for each node and maybe use rdm mapping or PCI passthrough for the physical disks. I hope that we will test this ourselves soon after the 1.3 is out.

-Can you please give more detail on the node you re-installed: was this one of the first 3 nodes or was it the fourth ? if it was one of the first three (the management nodes)  it should be deployed using the "Replace Management Node" choice in the deployment wizard, its configuration (hostname, ips) have to match the original. If it was the fourth, then it should be removed from the cluster from the "Node List" and then you can install a new node and deploy it using  the "Join" option.

-Re the gluster problem issue: we use it as an internal file share to store statistic data that show up in the dashboard graphs, can you please run the following cli commands on each of the first 3 nodes please:

gluster peer status

gluster vol info gfs-vol

systemctl status  glusterfs-server

 

Pages: 1 2