Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

OSD went down on vmware demo environment

Pages: 1 2

 

I am testing petasan 1.21 on a vmware 5.5 environment. For this purpose I built 3 virtual machines with 8gb of ram for the 3 petasan nodes with 1 OSD in each one.

Then I created and published a new 30gb petasan disk.

All works until I create a new virtual machine on the disk created by the petasan . Sometimes during the format process of the disk, sometimes during the next installation process, I lose one or two osd and I can't complete the virtual machine installation. Right now I can't finish any virtual machine installation due to this problem.

After a while all the osd come up but the installation hangs.

 

I tried both e1000 and vmx3 interface for the petasan nodes, but nothing is changed.

Is there some check to do or some timeout to rise?

 

Thank you.

What are your iscsi disk settings. Also are you using a seperate network for the petasan backend vs iscsi?

My first suspect is there is an issue with your network. Please double check the connections are fine and maybe re-try with a different setup.

 

This is what i suspect is happening:

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#flapping-osds

In addition to the setting changes in the link, there are various timeouts that control the OSD heartbeat which can help, however i would not recommend doing any changes but rather identify any cause of network issues.

Also i presume that when the OSDs are up again that the iscsi disk is functioning again but the installation application timeout.

Just to add, in an operational cluster, if 1 node or 1 osd go down, the iSCSI disks will keep working. It is the simultaneous failure of 2 nodes or 2 osds on separate nodes that will cause the io to pause since there are not enough active data replicas. IO will resume if 1 of the failed nodes/osds is back up.

In case of network problems, the osds cannot heartbeat each other and report each other as down.

I have the same problem with my virtual petasan-test with 4 nodes installed om one ESXi 6.0.0 host. But I get the feeling like the problem is in vmware. Because when i try to install a new ubuntu 16.04 vm and it tries to format the disk it comes to 33% and then it freezes! The whole ESXi gui interface freezes, but the vm's continue to run, all but the new installation. And when i try to quit the gui and log in again it won't let me in. But after 30-60 min it is ok again and I can log in. During this the petasan-web-gui is up, but a bit slow, and I can se that petasan loses and restores some OSD's.

When checking log's in the petasan-web-gui I get a lots of:




 

The new ubuntu vm is stored on a iSCSI datastore provided by the 4 virtualized PetaSAN nodes correct ? or is it on a local datastore ?

Also it will help to select

  • VMWare Paravirtual for the SCSI Controller
  • VMXNET3 for the network adaptor

Also maybe better to just use 3 nodes with 1 OSD each to limit the load on the ESXi host. Also maybe, if you have several local physical disks, have the OSDs store on separate local datastore disks and also the ubuntu iso image that you read from.

 

just to add a little on the above,  the GlusterFS errors you mentioned are intriguing. GlusterFS is a system totally separate from Ceph and iSCSI and we use it internally to share the statistic graphs we draw amount the nodes. A failure there is another indication that the system is either over loaded or the network is not reliable, GlusterFS nodes talk over the Backend 1 network.

Correct, the new vm is being installed on the petasan storage on the same ESXi where the four petasan servers are installed.

I found out that the main problem for me was the memory. I had only 1 GB per petasan server. Now they all have 4 GB and when I mounted the storage on our second ESXi server and installed a vm (Ubuntu) on the petasan storage residing on the first ESXi host it worked flawlessly. After I had installed that vm I went back to the first ESXi and tried to install a second one there, but then all fell again. So I left work and when I came back to today only one of six OSD's where online.

I also found out the reason to why the vmware gui freezes is because it (the ESXi host) can't access the petasan storage and when the storage is "stable" again it lets me log in.

btw, I have tried with both paravirtual and LSI, but have the same problem with both of them and I'm always running vmxnet3 as nic.

So if I under stand you right; it is better to have few big OSD's instead of many small ones? I thought that 'the more the merrier', but I'll redesign everything today. I'll reconfigure the system so that there only are three petasan servers running three OSD's.

 

Yes it does look like resource issues, at least it made a difference.  1G was very very low, we do have a recommended hardware guide which will give you some indications what to use in a production environment.

You are correct that Ceph loves more OSDs to give it better concurrency, but it is also requires more resources (memory as well as cpu core per osd as per the guide ), Ceph is a hungry beast. So in your setup it is not desirable to have many OSDs.  Of course if you have the option to distribute the PetaSAN cluster across 2 or more ESXi the better, also make sure you do not use the same local datastore for more than 1 osd.

Also do use MPIO  ( at least 2  paths ) on the ESXi iSCSI client.

We  do a lot of testing on ESXi bare metal and it works very well.  We have not done in the past testing for running PetaSAN vms within ESXi. I will try to include this after our v 1.3 release so at least we have a recommended setup with correct resource configuration.

Pages: 1 2