Forums

Home / Forums

You need to log in to create posts and topics. Login · Register

Few ideas for next releases.

Hi all,

This is just to summarize few thoughts upon features that would be great addition to PetaSAN. My apologies in advance for being too much "enterprise" oriented. In fact I used to build from scratch something similar in the past (about 2 years ago) but unfortunately the LIO and CEPH were not stable enough and were missing critical aspects such as ALUA support, but the worst part was always the performance. During my attempt to build something similar I also used ZFS on Linux modules and Pacemaker but that didn't work out too well and most of the times it suffered from Kernel Panics. Anyway I hope this project would work out much better than my own attempts. By making an initial deployment within my home lab, I can certainly say that the system works nice and clean, but it certainly does lack important features. So here are few ideas from the top of my mind that I would say shall get implemented at some point:

* iSCSI Target granularity and settings - currently iSCSI targets seems to be assigned in a simple automated way for each virtual disk during creation of the disk itself but what I didn't like was the fact that I can't assign multiple disks to the same iSCSI target. The main drawback here is that you would need to provide the iSCSI target settings for each disk on the initiator side as well which could be time consuming task with multiple machines especially when iSCSI targets are assigned statically (like in VMware for example). The better option would be to provide a separate iSCSI target section and then chose which disk should be assigned to which targets. This would also spare a lot of internal IP waste and I'm saying this specifically because I work with a lot of smaller LUN's rather than few huge ones. The main reason behind this is that many times the initiators as well as the individual LUN's are also limited on how much performance the could deal with in totaly. While I fully understand the simple concept of being able to spread the load of individual disks by simply assigning them across multiple nodes, the current method would generate a lot of administrative overhead on the iSCSI initiator side especially in environments where static assignment is a must.

* iSCSI Portals - generally speaking all iSCSI Target implementations have the concept of portals one way or another but here I'm making the comparison with some of the widely used implementations like COMSTAR and FreeNAS. Assignment of individual ethernet adapters toward specific portals and later on binding those with iSCSI Targets would make it possible to assign resources in a much more organized way. In some cases it is even mandatory to have such options like for example in my case where I'm using multiple iSCSI portals on one of my FreeNAS boxes where the first portal presents the assigned LUN's to VMware ESX cluster while the second portal presents the very same LUN's to a dedicated NetBackup machine. This construction let me offload the backup traffic to a completely different network interface and spare the load from the storage dedicated connections but without having multiple portals with multiple targets and multiple LUNS such scenarios wouldn't be possible at all, therefore a great potential would be lost.

* LUN ID assignments - this is definitely something to be considered regardless of the fact that currently virtual disks are bound to individual targets. In specific situations there are definitive needs to specify an exact LUN ID number, so I would strongly suggest this option to be presented somewhere in the GUI with one of the next releases.

* Jumbo frames - now this is something that would be kinda required not only for iSCSI connections but also for the CEPH replication. It would be nice to make this option available especially for those people who might want to make performance tests prior implementation in either test labs or production environments.

* More than two iSCSI subnets - this would have value in various environments but for my case this is specifically tight to saturating two Gigabit links as well as having more than 2 path's available for the initiator side. With 10 Gbit links this is probably not that much of a drawback, but having the option to specify more iSCSI subnets would provide a better flexibility.

* CEPH SSD journaling / caching - I guess the current implementation deals only with basic deployment of OSD's in the form of one OSD per HDD. Obviously there are huge amount of options for fine tuning CEPH on its own but perhaps the next release which is stated to provide support for adding/removing OSD's should be extended also with options for utilizing SSD drives purely for optimizing performance. My previous experience tells me that journaling could definitely make some difference on the performance but of course caching/tearing features would also be a nice to have addition.

* FC Support - LIO target by my best knowledge is easily capable of dealing with FC targets, so technically it should be possible to implement this as a feature. This could really extend the usability of the product for larger environments with appropriate equipment. Should PetaSAN prove itself as stable enough product I could clearly see it in advance as a potential replacement for smaller scale SAN / NAS implementations which heavily rely on dual controllers at most, while PetaSAN could potentially provide multihead scalability. Combining this with FC as an option could make a huge difference for many storage users/administrators.

* Smaller size requirements for boot drive - While I haven't tested this myself as of yet, I intend to build a smaller PetaSAN home lab for testing purposes out of older machines. I plan to use 16 GB pendrives (this is what I have right now) however I noticed in my initial virtualized PetaSAN tests that the system drive should be at least 32 GB. I'm not sure what's the reason for this hardcoded value, but I doubt it should be that big even if we consider logging activities. I assume even 4GB would be more than enough to boot a the footprint, but some clarification on this topic wouldn't heart.

I'll put up few more thoughts later on, but in the meantime I would be curious to hear back from others about their vision on these ideas as well as their concepts and requests.

Thank you for your valuable feedback,

1) iSCSI Target granularity and settings
If the usage case is that each client initiator requires a large number of small luns, then yes having multiple luns per target will be quicker to setup at the client. PetaSAN will require more time.
Can you tell me in your setup how many luns roughly per client ? also is having many small luns relate to performance limits on the lun devices..to avoid io from many client connections ?

Dynamic assignment/discovery in PetaSAN does work well with VMWare. We have made a kernel patch for iSCSI discovery to send only the correct target info required by the client. The normal kernel discovery sends all targets/portals found on server back to client which may not be accessible by client. In PetaSAN the server determines which target/lun the client is interested in based on the virtual ip it is connecting to and we send back allowed/accessible ips/tpgs for this target/lun only.

These are some advantages we feel from the 1 lun per target approach:
-Simpler/quicker creation of disks via the web application
-Better internal architecture: for example the disk id appended to the iqn name is also the id of the Ceph internal rbd image it is 1:1 relation, it is also the disk id used in Consul for resource management. A multiple LUNs per target resource fail over will probably require failing over all LUNs to another host.
-Ceph scale out nature makes it attractive to have larger luns that can be accessed concurrently by many clients, a ceph lun aggregates the performance of all resources in the ceph cluster, the more nodes/disks you add the faster your lun becomes. Of course to feel this aggregate performance you need to have many concurrent i/o operations on the lun which is typical of VMWare/Hyper-v/MS SOFS deployments.

2) iSCSI Portals
I understand the scenario for assigning a different portal to the same lun so you can offload backup application on a different network interface. I can see the advantages of this when using traditional SANs where you are hitting the same box. In a scale out architecture i/o from client and backup app are hitting different machines/physical disks/network interfaces.

3) LUN ID assignments
That could be done.

4) Jumbo frames
Yes, we have it in our roadmap

5) More than two iSCSI subnets
Our requirement is for 10G interfaces. We also have interface bonding / LACP in our roadmap

6) CEPH SSD journaling / caching
Yes we will support this soon, but not in this coming version 1.2. It does make a performance difference when using spinning HDD

7) FC Support
Can not comment on this now, it was not in our initial roadmap, but your feedback is very valuable

8) Smaller size requirements for boot drive
The boot drive is used by the PetaSAN system as well as the Ceph monitors. In addition the features we plan to add in the future do require space on this drive, for example next release 1.2 includes Grafana/Graphite historical data that is saved on this disk. Ceph monitor daemon and the Graphite data collector require decent speed drives.

 

/Maged

Thanks Maged,

Few more thoughts on my end regarding the first two points.

1) iSCSI Target granularity and settings

I certainly agree that there is a drawback of having to migrate multiple LUN's over different host when utilizing them through a single target, however the performance factors for the individual LUN's cannot always be overcome via simply because there are specific limitations on how the clients are processing the data flow through the LUNS. For example VMware and it's software based iSCSI initiator adapter are using 128 QD over the individual LUN's which most of the times is sufficient enough for the generic workloads for small environments however in larger environments it is just mandatory to utilize many smaller LUN's instead of few big ones in order to optimize the performance metrics.

While dynamic assignment in PetSAN is a great way to begin the whole approach I feel it will reach some specific limits very soon especially if we would be talking about an already deployed environment where multiple LUN's are being utilized on the one IP per one LUN basis. For example IP addressing could be a potential problem despite that we are talking about internal IP's, but if we considered let's say a single /24 subnet which may get exhausted at some point due to the number of allocated LUN's, then a migration activity for the whole subnet would become imminent and this may get even more difficult considering the fact that the same changes would have to be reflected also on the clients side. Surely iSNS approach could help in such situations but I don't think it's the best approach either especially when we would be talking about let's say real production data.

I can tell from long years of experience that most of the drawbacks for a particular product are coming specifically during migration tasks and poor organizational concepts which might reach in either planned or unplanned form where the latter usually is the one being driven by noticed drawbacks. In any case it's a good thing to foresee these potential limits but my whole idea here is to just provide some "attention" to what might become a potential culprit. Obviously we can make an endless discussion here and in fact the nature of the product implementation would be highly dependent on the environment needs which as we know may vary drastically throughout the world of IT. Troubleshooting a potential problem with an already deployed environment may also raise additional questions when a particular LUN and it's mapping would have to be identified on both the SAN and the client side, so that's another point that may require some more thinking in my opinion.
So all in all in my opinion having few single LUN's won't necessarily become the best implementation due to the limitations on the client side and therefore I would say it's better to provide the options to the end-user to implement his own environment without enforcing the single LUN per single IP concept, but to rather provide the ability to organize LUN's via iSCSI targets and iSCSI Portals. This of course would need to be combined with specific warnings about the increased timing for migration of multiple LUN's and I can also think that certain limits for LUN's per iSCSI target may also be required in order to keep the stability of the implementation. All of this would require extensive testing of course.

Perhaps another think that should be kept in mind is that many administrators usually create some sort of naming convention on their own for the various Target's / LUN's and even backend extents which helps them administer their systems in an organized fashion, so bombarding an existing environment with too many new Targets and LUN's may actually force things to get not so organized. While CEPH and iSCSI are great on their own, iSCSI would still be treated by most administrators in the traditional SAN way. If there would have been an a direct RBD or CEPH client for Windows and ESX, then things would look in a completely different way.
2) iSCSI Portals

While I agree that the load should get spread across multiple hosts and therefore optimize the I/O from client's perspective, we should never forget that from the client's perspective the whole storage environment would still be kinda seen as a traditional SAN and there are many best practices which specifically state that the administrative load should be logically separated into it's own segment like VLAN's or even VSAN (not VMware's VSAN but SAN segmentation) for example. This is just one point where iSCSI portals and iSCSI targets are helping out quite a lot but we should also not forget that many environments are not yet ready for the 10 Gbit backend infrastructures but instead are still using multiple 1 Gbit NIC's with Etherchannel (my environments are not an exception to this either). Just as it used to be about 7 - 8 years ago implementing 10 Gbit per sec is not that expensive on the NIC side but is quite expensive on the switch side which enforces many IT environments to instead use workarounds. Another problem with just relying on multiple nodes for PetaSAN is that one would require high amount of CEPH nodes in order to achieve the equal I/O and bandwidth performance on the client side. One of the things that people tend to forget with distributed storage systems is that there is a high demand of I/O required for the backend replication itself which of course would be taken out of the individual node's total I/O capability and at the same time the same nodes will have to process the I/O requests from all the connected clients which could become a big bottleneck especially during potential OSD rebuild operations. So in order to achieve good I/O and bandwidth metrics one would need to think in much higher amount of CEPH nodes in comparison to let's say a traditional two head SAN where my approximate guess would be that about 4 - 6 CEPH node's may (or may not) provide roughly the same amount of RAW I/O. Now this whole thing of course is arguable, but my point here is that just adding additional nodes won't necessarily give the desired results to the clients and nodes may not be cheap either. Also when it comes to nodes and OSD's certain organizational concepts may also be required in order to keep up with redundancy and performance concepts.

 

Anyway I though I'll just put few more thoughts here. I don't keep my vision too traditional over new things but transitioning between the traditional and newcomer concepts usually doesn't go as smooth as we want 🙂
Also thanks for the clarification on point 8. I guess it would be wiser to use a smaller SSD (something like 120 GB) instead of pen drives. Probably a smaller HDD would also do the trick but I'd prefer to utilize HDD's specifically as OSD's.

You raise valid  points,  definitely very useful cases.

I wanted to know roughly when you say high load  in VMWare, what is the total queue depth per ESXi host (across all luns) ? The queue depth per lun can be set to 256 max ( raised in vSphere 5.5), how many do you use per host..again very roughly.

You touch on a very nice point, from a user perspective you want a system that can start small, delivering the same charateristics of a 2 node traditional SAN yet has the advantage to scale out. Realistically Ceph is excellent for scale out but at 1 or 2 nodes it will not be a match for tarditional systems.

I am sure we will have more on this. Cheers /Maged

Sorry I just realized I never answered your question about the number of LUN's.

I have different environments to maintain with different requirements but the smallest one (which is a testlab btw) for example has the following characteristics:

8 * 256 GB sized LUN's placed on ZFS RAID-10 volume residing on 8 * 15K SAS HDD's

8 * 512 GB sized LUN's + 4 * 2 TB LUN's placed on ZFS RAID-10 volume residing on 8 * 7.2K SATA HDD's

4 * 128 GB sized LUN's placed on ZFS RAID-10 volume residing on 4 * SATA SSD's

So in total I have 24 LUN's allocated for a small environment which consist from 3 * ESXi servers in a cluster. Of course we can't compare apples to oranges here because my ZFS based storage takes care of the performance factors via ARC caching as much as possible, but the point is that here I had to spread the VM's throughout various LUN's (about 40 VM's) in order to be able to keep decent performance metrics. Works out pretty well for me and comparing it to previous years where I used to have like 8 large LUN's in total it makes huge performance change for the individual VM's simply because the LUN's themselves became the bottleneck on the performance side. I have other environments where the LUN numbers go much higher where the reason for spreading the load is purely due to performance and somewhat organizational requirements. For my VMware based environments VVOL would make a true change but non-commercial solutions are not yet there on the VVOL aspect and I guess nobody really wants to develop a VVOL module for free.

While I'm aware of the tuning options for LUN's I tend not to modify those without a valid reason, because the vendor's defaults usually are set with specific values for a good reason. I'd touch the queue depth only if a vendor would provide specific instructions to do so but never without a valid reason. Data integrity and safety of an environment is a much more important thing to me than squeezing out few more bytes out by tuning unnecessarily.

As for the scale out point and comparison to 2 node traditional SAN, perhaps my example was not the best one but I can see you got my exact point. I was looking for a stable iSCSI to CEPH gateway solution for years and unfortunately I never found one a good one, thus the reason why I also tried to build it myself without much of a success (unfortunately I'm not a developer). I'm happy to see that PetaSAN is taking a good approach, so I have high hopes that I can finally implement a solution that could provide me with the necessary redundancy (means more than 4 storage head nodes) even if the cost would be implementing more machines. Obviously CEPH is not designed for two nodes and would never perform well enough without decent hardware investment but the battle of cost versus functionality probably would stay for a long time, however it doesn't have to be exactly either one of them. Here I'm not speaking about commercial SAN infrastructure only but also about personal home labs. For example one of my long-time residing problems is that I could never make my storage fully redundant toward the clients side because almost all dual head solutions are paid commercial solutions not speaking about the fact that many of them provide terrible performance and are tight to internally developed many times poorly documented proprietary software modules. I've tested quite few of those solutions in the past and frankly I would rather go for EMC / NetApp as a paid solution before I would spend any money on badly designed but claiming to be super fast and flexible... marketing bla, bla, bla solutions. Should someone create a usable solution it hardly stay free or without capacity limitations on the storage side, but I guess the business of the storage world would remain like that for a while. CEPH scalability makes huge potential especially if it finally goes stable enough. PetaSAN makes the difference with this primarily on the iSCSI gateway solution. I'll see if I can build up a proper environment in my lab and test out the performance metrics with my current load, but that's gonna take some time as I'm usually quite overloaded with things 🙂

 

Thank you for sending your configuration info.

I am happy you are considering to use PetaSAN in the future ( hopefully soon 🙂 )

 

Looking forward to it my friend 🙂

At some point I may start testing PetaSAN with one of my VMware vCloud Director based IaaS environments too where I not only have to use multiple LUN's but also large LUN's organized through storage policies into a single entity (relevant only for VMware). This environment is semi-production one and it has a heavy requirement also for offsite replication, but I wonder if cross-DC storage implementation of PetaSAN would work well enough after fine tuning some of the CEPH parameters. Never tried anything like this before but I'm definitely looking forward to try it out. Since I have fast link between the DC's I assume it may work well enough or perhaps similarly to VMware VSAN.

First we need to get a stable PetaSAN product though... Let us know once a new version is released. Also I'm interested to know if you have any specific procedures in place for upgrading a PetaSAN deployment. This is especially critical for already deployed CEPH nodes and while the upgrade there normally should work without too much disruption or downtime, personally I had my bad experience with similar approach in the past which were more or less originated by CEPH bugs. I guess things had changed drastically with CEPH over the years but it would be nice to know what's the official PetaSAN vision on upgrading a complete environment.

We plan release 1.2 first week of feb.  Release 1.3 will include upgrade via installer.

We plan to support  ui based rbd mirroring later half of this year, however with some cli commands it is probably easy to setup up now and PetaSAN on the other DC should pick up the image.

Going over some of the points:
-When you create a disk, the iqn name is automatically appended with the ceph rbd image name. but you can always edit the iqn name and change it to whatever you want.
-If you do not like automatic ip assignment, you can choose manual and input the ips manually.
-The iscsi 1 and iscsi 2 subnets can be changed at any time, these affect the creation of new disks only but do not affect already created disks. For new disks they are used for automatic ip assignment and validation of manually inputed ips ( we check the manual ips are not used before on other disks and are within the declared subnet ). So it is possible to have multiple subnets on each of the iscsi interfaces if you wish.
-Internally we store all the iscsi info ( iqn, ips, subnet ) for a disk within the ceph rbd image itself as metadata and can be totally independent of other disk images.

The Intel E5 V5 skylake processors will release this summer.  100 Gbps interfaces will become common.  Most servers will likely be equipped with a single 100 Gbps interface.

PetaSan will need a way to create alias interfaces, and probably do vlan tagging.  Otherwise, it would have to be run on a hypervisor that creates virtual interfaces.