ZFS NetApp replacement

Suprnaut · Jan 31, 2012

I'm putting together a proposal to replace our NetAPP. My company has complained about the cost of disk shelves and performance. I would prefer to use a ZFS solution. The storage will mainly be used for an ESXi cluster.

The company prefers to buy Dell equipment so that is what I have to work with. Please tell me if you think this configuration is viable or any suggestions. Thanks.

What I'm thinking is:
Server: Dell R710 Poweredge
Single Xeon 6 core processor
Max ram (I think 128GB)
LSI SAS 9201-16e HBA adapter
2 Qlogic 8GB HBA fiber cards
Intel Dual 10GB copper NIC card
Local hard drives to boot Solaris and possibly SSDs or PCIe card for cache.

Storage Shelves:
As many PowerVault MD1220 filled with 24 1TB 2.5" drives as we need.

danswartz · Jan 31, 2012

ZFS can certainly use that RAM. What level of RAID were you thinking of? If you can afford the wasted space, I'd recommend RAID10. Hard to comment more on that aspect without knowing how many drives (ballpark) you are thinking of.

Suprnaut · Jan 31, 2012

Not really concerned about RAID configurations. Raid 1 vdevs are a possibility was thinking 6 disk Raidz2 vdevs and keep adding more to the pool as more disk shelves get added over time. More concerned about the hardware being adequate and working well with ZFS.

danswartz · Jan 31, 2012

Well, when I said RAID10, I really meant mirrored vdevs concatenated. You can keep adding them as you wish. 6-disk raidz2 will work too. The big advantage of raid1 (or 10) is random reads perform much better than any equivalent raidz* due to being able to pick data from either side of a mirror in parallel.

patrickdk · Jan 31, 2012

Yep, big thing to remember, raidz2 doesn't perform like raid6.

A raid6 will get good random read performance. But a raidz2 (or any raidz) will not get good random read performance, cause it hits each disk up for every i/o in order to be able to checksum the data.

For your ESXi, I would use mirrors, if you also wanted to carve out a backup storage place, use raidz's there.

danswartz · Jan 31, 2012

On the other hand, it depends on your working set (disk block wise) from esxi. If you really are throwing 128GB ram at the box, almost all of that can be used for ARC storage, and there is a good chance that your ARC hit rate could be close to 100%, in which case random read performance is moot. Only you can answer that, though.

danswartz · Jan 31, 2012

Another point: you should get two 64GB SSDs and use them for a ZIL, especially if the client is ESXi and you are serving up NFS to it (ESXi uses forced sync mode for NFS.) Also, with 128GB ram available for ARC, I'm not sure a cache makes a lot of sense (this of course depends on how much storage you have - if you have 16TB or whatever of storage, an L2ARC makes sense, but that will be pretty expensive, since you would need 1TB or so for the cache.

spankit · Jan 31, 2012

danswartz said:
Another point: you should get two 64GB SSDs and use them for a ZIL, especially if the client is ESXi and you are serving up NFS to it (ESXi uses forced sync mode for NFS.) Also, with 128GB ram available for ARC, I'm not sure a cache makes a lot of sense (this of course depends on how much storage you have - if you have 16TB or whatever of storage, an L2ARC makes sense, but that will be pretty expensive, since you would need 1TB or so for the cache.

How did you come up with 1TB of cache? I was under the impression that any amount of read cache would be beneficial. I know that with ZIL it should not be any larger than half the amount of memory you have installed but didn't realize there was a similar rule for L2ARC.

unhappy_mage · Jan 31, 2012

Suprnaut said:
I'm putting together a proposal to replace our NetAPP. My company has complained about the cost of disk shelves and performance. I would prefer to use a ZFS solution. The storage will mainly be used for an ESXi cluster.

The company prefers to buy Dell equipment so that is what I have to work with. Please tell me if you think this configuration is viable or any suggestions. Thanks.

What I'm thinking is:
Server: Dell R710 Poweredge
Single Xeon 6 core processor
Max ram (I think 128GB)
LSI SAS 9201-16e HBA adapter
2 Qlogic 8GB HBA fiber cards
Intel Dual 10GB copper NIC card
Local hard drives to boot Solaris and possibly SSDs or PCIe card for cache.

Storage Shelves:
As many PowerVault MD1220 filled with 24 1TB 2.5" drives as we need.

How much storage do you have with your current solution? How many VMs are there? How much growth do you expect to have? How are the VMware machines accessing the storage?

Suprnaut · Jan 31, 2012

We currently have about 60TB in the NetApp. The new system will have substantially more storage. Currently we have about 120 VMs running. They are currently connected to the NetApp via iScsi over dual 8 gig qlogic cards.

danswartz · Jan 31, 2012

The L2ARC wants to be a reasonable fraction of the actual disk space. If you think about it, 64GB of cache will not do you much good when you have 60TB of storage. That's about 1/10 of 1%. Yeah, sure, any amount helps, but the money for 1 or more cache SSDs could go to other, better use. And also, remember that you have 128GB of RAM, which will be used for the ARC (level 1 cache.) Generally, the L2ARC wants to be, say, 8-10X the size of the primary cache, hence my 1TB guess. Again, you are right, ANY cache won't hurt (but it may very well do no good whatsoever.)

unhappy_mage · Jan 31, 2012

Suprnaut said:
We currently have about 60TB in the NetApp. The new system will have substantially more storage. Currently we have about 120 VMs running. They are currently connected to the NetApp via iScsi over dual 8 gig qlogic cards.

You might want to consider splitting the load between two systems. Even with 288GB of memory, that's only 2GB of ARC per VM. I guess you should know how much working space your VMs use, or they will thrash the cache and your performance will suck. I'd order two processors just to get the extra memory capacity. The difference in speed with 192GB versus 288GB probably isn't worth worrying about, get the larger capacity.

How about reliability? Do you need any sort of dual-master capability? How much does it cost you (per hour, say) if this system goes down?

danswartz said:
The L2ARC wants to be a reasonable fraction of the actual disk space. If you think about it, 64GB of cache will not do you much good when you have 60TB of storage. That's about 1/10 of 1%. Yeah, sure, any amount helps, but the money for 1 or more cache SSDs could go to other, better use. And also, remember that you have 128GB of RAM, which will be used for the ARC (level 1 cache.) Generally, the L2ARC wants to be, say, 8-10X the size of the primary cache, hence my 1TB guess. Again, you are right, ANY cache won't hurt (but it may very well do no good whatsoever.)

I think for this application lots of L2ARC is a good idea. You'll burn some of your RAM keeping track of it, but overall performance should go up. You should benchmark your system to determine the tradeoff between more money spent on SSDs for L2ARC and extra spindles. If you add a single L2ARC disk and don't improve performance, then it's not worth adding five more. ZIL is also worth playing with; I understand the old iscsitgtd hit ZIL hard, but COMSTAR doesn't. That may have changed in more recent releases; try adding a ZIL disk and see what happens.

danswartz · Jan 31, 2012

"If you add a single L2ARC disk and don't improve performance, then it's not worth adding five more".

Not sure I agree with this - you are talking about a 6X increase in the size of that cache layer! If, say, one SSD gave you 15% hit in the L2 cache, 6 of them should (hopefully) scale to about 90%, which would be a big win. As far as ZIL, a 64GB ZIL is only $100 or so, mirroring that is still cheap. Also, keep in mind that given his ram size, the network BW is going to be far more limiting than the RAM, so likely two much smaller ZILs would be fine.

danswartz · Jan 31, 2012

Here's a thought vis-a-vis the L2ARC question. Assuming you have spare ports for several of them, if you can buy, say, 8 128GB units, add one as a cache device. Run until the cache device is full (which you can tell by doing 'zpool iostat POOL' and see when it is full (or almost.) There is no spiffy l2arc stats I know of, so I do 'zpool iostat -v' and see this:

cache - - - - - -
c0t500A07510324D633d0 59.6G 8M 1 1 159K 207K

it is full. then, do:

kstat | egrep '(l2_hits|l2_misses)'

giving me like:

l2_hits 1607972
l2_misses 3629773

e.g. my L2ARC is only giving about 30% hit - not so good (it is a 64GB unit, with 6GB main ram, 5GB of which is autosized for ARC.) If the numbers suck, use 'zpool add' to add another SSD, wait until cache hot, re-measure. Keep going until you get good numbers or run out of SSDs. This is all predicated on 1TB of SSD being acceptable price-wise, and you being able to RMA for refund any units you can't use...

unhappy_mage · Jan 31, 2012

danswartz said:
"If you add a single L2ARC disk and don't improve performance, then it's not worth adding five more".

Not sure I agree with this - you are talking about a 6X increase in the size of that cache layer! If, say, one SSD gave you 15% hit in the L2 cache, 6 of them should (hopefully) scale to about 90%, which would be a big win. As far as ZIL, a 64GB ZIL is only $100 or so, mirroring that is still cheap. Also, keep in mind that given his ram size, the network BW is going to be far more limiting than the RAM, so likely two much smaller ZILs would be fine.

15% isn't zero. I agree with your conclusion, though; add L2ARC until you're not getting anything out of it, then stop.

If he's using 8gb FC, then hitting L2ARC instead of in-RAM ARC could be a substantial performance hit; RAM is many GB/s, SSD is only hundreds of MB/s. If the VMs are mostly CPU-bound (no IO), or do mostly writes, then using RAM to track L2ARC entries is wasteful. That RAM could be used for write caching, which will improve performance more.

So, long story short, the answer is Try It. Dell should be able to give you hardware for 30 days for free or cheap (assuming you have an established relationship); get in a server full of RAM, a couple shelves of disks, and a few SSDs. Play with it for 30 days, then buy the pieces you need.

danswartz · Jan 31, 2012

Agreed.

patrickdk · Jan 31, 2012

It's not going scale 6 * 15%

It's going be more like, 15%, 13%, 10%, 6%, 3%, 2%

Each time you add another cache, the remaining i/o is more and more random and harder to cache.

But then, this is all hypothetical without knowing his actual workload, and working set size.

My system is very happy with 50gigs of ssd cache, and 16gigs of ram. with 6tb data size.
But I get a 92% arc hitrate, and a 7% l2arc hit, leaving only 1% going to the disks.

for zil size, you only need enough for 5seconds. I haven't had any luck using a $100 60gig ssd for zil's, they keep crapping out on me, so I have just stuck to intel ssd's for zil, and anything else for l2arc.

Jay_2 · Feb 1, 2012

what netapp device are you using by the way?

danswartz · Feb 1, 2012

Yes, patrickdk, I see your point. I was being optimistic that much/most of his working set will fit in L2ARC. No way to know for now, and that much L2ARC is going to be pricey, but until he finds out, it's all guesswork. How are you computing the l2arc hitrate?

patrickdk · Feb 1, 2012

Hmm, I can't remember how I calculated it, thought I used arc_summery, but it doesn't seem to print that info.

Guess I just used the arc hit/miss rates to get the % missed, then split that percent up via the read iops ratio between my disks vs cache ssds.

Suprnaut · Feb 1, 2012

My management had question
The NetApp has hardware fault tolerance/load balancing. Is there anyway to do that with this system? I assume I would need two servers to act as the head units, how would I configure that?

As for ram or cache that shouldn't be an issue. We have many 240GB OCZ Deneva2 SSDs, also we could always move to a bigger server with up to 1tb of ram if need be.

danswartz · Feb 1, 2012

You might be better off looking at nexenta then. It's a commercial offering that uses ZFS, and supports auto-tiering, and failover. That's getting out of my area though. If it's that critical, I don't think a home-rolled whitebox is the way to go... Even if the physical server is not whitebox, it's sounding like the SW will be. Check nexenta.com...

apnar · Feb 1, 2012

Suprnaut said:
My management had question
The NetApp has hardware fault tolerance/load balancing. Is there anyway to do that with this system? I assume I would need two servers to act as the head units, how would I configure that?

Depending on the OS you have some different options. If you go FreeBSD you can look at HAST. You could also look at Nexenta which has a HA Plugin. I imagine there are similar things for Solaris 11 and its spawn. It will definitely add to the complexity of the setup though and you need to make sure both servers have paths to the disks as well as handling take over at the network layer.

In most NetApp setups that I've seen both controllers have 2 paths to all disks which they call Multi-path HA giving a very high level of redundancy. They also make setting up the cluster for fail-over very simple.

Suprnaut · Feb 1, 2012

Ok I'll look into Nexenta. Would rather use Solaris 11 though.

Another question I got was about data loss. If there were a power outage what would happen to the data cached in ram? Typically a RAID card would have a battery to keep the cache.

patrickdk · Feb 1, 2012

There is never pending data cached in ram.

The only time zfs would have pending write data in ram, is when you use a ZIL, but then that data isn't pending in ram anymore cause it's on your zil.

s0rce · Feb 1, 2012

Could you get storage solution for Oracle based on Solaris 11?

danswartz · Feb 1, 2012

Correct. This is why L2ARC disks do not need to be mirrored, if one fails, big deal. Mirroring a ZIL on the other hand is critical, particularly for a mission-critical app. Keep in mind that the upcoming version of nexenta will be based on the new Illumos open-source solaris kernel. Solaris 11 is fine if you are willing to pay for support. Again though, my big concern here is that you not try to band-aid something together without understanding all the issues - if it blows up in your face, management is NOT going to be happy.

Suprnaut · Feb 1, 2012

Ok so writes aren't cached unless I create a ZIL. The ZIL would be on physical SSDs. I've heard it is best to use RAID1.

danswartz · Feb 1, 2012

Usually ZIL SSDs are raid1, correct. Actually, your first statement was not quite correct. A ZIL is only used for sync-mode writes, and if you have no explicit ZIL, ZFS uses space on the pool itself, which can kill performance. Also, the ZIL isn't really a cache per-se, it's a log device, used to ensure consistent data gets to the disks. L2ARC devices ARE cache devices, but for reads only.

apnar · Feb 1, 2012

My understanding that writes will be cached in memory unless the the write was requested to be a sync write. So as long as your application does sync writes when it actually needs to guarantee data is on disk then you're fine. If it does not you risk loosing data currently in RAM during a power outage.

ZIL only comes into play during a sync write. Without it the write isn't complete until data is written to disks, if you add ZIL then data can be written there to "complete" the write so it's safe until ZFS gets around to writing it to the pool disks.

apnar · Feb 1, 2012

Suprnaut said:
Ok I'll look into Nexenta. Would rather use Solaris 11 though.

You can check out Oracle's docs on their clustering. Some quick googling found this doc:

http://docs.oracle.com/cd/E19769-01/pdf/820-4167.pdf

Clustering discussion starts around page 157.

It's all designed around their ZFS Storage servers though

haileris · Feb 2, 2012

danswartz said:
You might be better off looking at nexenta then. It's a commercial offering that uses ZFS, and supports auto-tiering, and failover. That's getting out of my area though. If it's that critical, I don't think a home-rolled whitebox is the way to go... Even if the physical server is not whitebox, it's sounding like the SW will be. Check nexenta.com...

Although I understand about 5% of this conversation, this is one of the most important comments. Regardless of the technology and whether it is better, you are going from something where (I assume) you have commercial cover (and maybe existing skills within your company to support / manage / grow your NetApp solution) to one that could be completely self-supported. Not sure I'd want to rely on the likes of Danswartz and Patricdk to be online if have have a critical error - maybe you could pay them

I'd also be asking the question as to where are you gonna be with this system in 3? years time. What you are putting in now should be able to last for a specific period of time. So understand that time and plan for that. What is the criticality of your data to the business. That could angle the decision one way. What is the RTO/RPO? That could influence your solution. I wouldn't always assume that the present incumbent (the netapp) necessarily meets these today - I would ask again.

Sorry for sounding so boring!!

shabazkilla · Feb 2, 2012

One thing to consider is if this storage solution will be supported by VMware. It's great to save money, but not if you loose support for your production VM environment. Just my .02.

brutalizer · Feb 2, 2012

If it is production, maybe you should buy a product, instead of doing it yourself.

1) Oracle has ZFS servers for sale.

2) I think Nexenta also has ZFS servers for sale - and cheaper than Oracle. Lot of ZFS developers from Sun, quit when Oracle bought Sun, and instead joined Nexenta. Nexenta has good expertise in ZFS.

3) Do it yourself might not be an option for production servers. You can use it for smaller servers first, as a test server. Then, if it works fine, you can buy Solaris or Nexenta.

Suprnaut · Feb 2, 2012

Absolutely this system is going to be a test for a while. The NetApp will stay in production for at least a few more years.

BENN0 · Feb 3, 2012

Have you looked at Sun/Oracle ZFS appliances at all (7xxx models)?
Especially in a setup like yours you can probably get them down to something like 30% of the list price.

MarkL · Feb 3, 2012

Suprnaut said:
Absolutely this system is going to be a test for a while. The NetApp will stay in production for at least a few more years.

Well... then I would suggest you really wait for buying everything until 6 months before you are looking at upgrading.. If you buy it now you'd be swapping over to hardware that is already over 3 years old.

What you should do now is buy some second hand server off ebay and start playing with ZFS in the different forms (Nexenta, Openindiana, Solaris, etc) and get some experience with it.

Suprnaut · Feb 3, 2012

I have played with ZFS before. I have an 18TB napp-it all-in-one ESXi Solaris box at home. I also have a 12TB Solaris test box at work.

You are right about the hardware, Dell is bringing out their 12th Generation hardware in March as I'm sure other vendors will be as well. We have a ton of hardware lying around doing nothing, not old, just not currently doing anything. That is why my proposal to actually use it is appealing. The hardware should be more than adequate.

lkateley · Feb 8, 2012

It is important to remember that the l2 and zil are per pool. So if you plan on having a raidz2 and a raid 10 you will need to factor in ssd for each pool

Nex7 · Feb 8, 2012

So, hello, Nexenta support person here -- I've somehow managed to never create an account on this forum despite randomly finding myself reading it off and on for years.. so thanks to Suprnaut for finally giving me a motivation to register.

I received a ticket from a coworker today here at Nexenta with a request to look at this page and comment on the proposed hardware design, and in reading through this post I felt compelled to touch on a few things I've seen in here.

First, a URL -- and please bear in mind I never finished this, it is incomplete and very wordy (maybe some day I'll have time to both finish it and squash its length), so while I welcome feedback/comments, I may not get around to acting on them for quite some time: http://www.nex7.com/readme1st

A lot of what you're discussing in here I have tried to provide a decent answer for on that page -- things like ZIL, ARC, L2ARC and so on. I see some not-so-great and some great advice being bandied around in this thread, but often without any real context around it, so hopefully that can provide some.

So attempting not to touch on anything I cover on that link, I'd like to hit on a few of the posts here..

1. There was a lot of talk on L2ARC sizing, and how much L2ARC you need per how much pool space you have. Fact is, that's not a good measurement. If I have a 100 TB pool of entirely backup data that gets written and rarely read, but I also use the same pool for one single VM using 1 GB for a SQL DB (potentially stupid, but it's an example), then technically my feasible WSS (Working Set Size) is probably only a couple of GB (the hot SQL GB of data and some metadata, etc), and easily handled by ARC, with no need for L2ARC.

Conversely I can have a simple little 2 TB pool that is absolutely nothing but hot data read constantly by 250 VM's, and in that environment even a full 128 GB of ARC would still be getting a terrible % hit rate. The catch here is to try to gauge Working Set Size (there is, of course, no easy tool or math to figure this out). My general rule of thumb is to max out RAM first, leave some slots for potential use as L2ARC, and then get L2ARC if and only if the ARC statistics prove out that it is necessary. Remember also that L2ARC is never as fast as ARC, AND L2ARC requires ARC space to address it -- the more L2ARC you have, the smaller your ARC will be (since it eats it up to address the L2ARC).

2. Hardware -- here I'm going to speak specifically to Nexenta, and the assumption you want an HA system and you want a Gold or Platinum license from Nexenta.. to do this, you have to go off of Nexenta's HSL (hardware support list), located here: http://www.nexenta.com/corp/support/resources/hardware-compatibility-list

I do not recommend just cherry picking components from this list and praying it works together -- either go through one of the many good Nexenta partners/resellers and buy one of their often cookie-cutter SKU's that are known good builds, or work with a Nexenta SE to build something sane (or reach out to me). The original post indicated the following:

What I'm thinking is:
Server: Dell R710 Poweredge
Single Xeon 6 core processor
Max ram (I think 128GB)
LSI SAS 9201-16e HBA adapter
2 Qlogic 8GB HBA fiber cards
Intel Dual 10GB copper NIC card
Local hard drives to boot Solaris and possibly SSDs or PCIe card for cache.

I would comment on that as follows:

Dell R710 is supported, but not the onboard Broadcom NIC's, so you mention Intel 10 Gb NIC's -- those are supported but just bear in mind they'll be the only supported ones -- in general the Broadcom NIC's on Dell's tend to either work or not work on Nexenta, and even when they work, I'd recommend they be used solely and only for management (web GUI, SSH), and not for data access.

Max RAM -- so there's a formula I provide people when discussing RAM use in production enterprise environments of Nexenta.. it is this:

8 GB RAM
+ 1 GB RAM per raw TB of data disk (disks going into pools as data, not as spares/cache/slog)
+ 1 GB RAM per usable TB of data that you're planning to use dedupe on (yes, this is in addition to the above line)
Always round up to nearest allotment motherboard supports, never down (so if your math leads to 33 GB, and your motherboard supports 32 GB or 48 GB, then you go to 48 GB!)

For a 60 TB (raw) with no dedupe, that would equate to 8 + 60, or 68 GB, which would likely round up to 72 GB of RAM on a Nahalem box. More is ALWAYS OK, let me be clear. If you've got the budget, max it out. Do not be concerned with 'RAM speed' -- quantity is far more important than maintaining Nahalem's max RAM bus speed.

LSI card -- in a Nexenta environment, this is probably the single most common type of card (LSI HBA's) - we well support most of them, there's a list on the HSL PDF.. I bring it up, however, because you then mention the Dell MD1220 -- last I checked, DELL does not support you plugging an MD1xxx (JBOD) array into an LSI SAS HBA, only into a Dell Perc RAID card (which Nexenta does not support). This has been an on-going point of frustration for Nexenta and potential customers looking at Dell. Please note that if Dell won't support your hardware setup, there's nothing Nexenta can do about that..

Local disks - obviously OK for syspool (required, really), but be wary using local in-chassis disks for cache or local in-chassis cards for slog on an HA system; it isn't possible. Technically speaking you can get away with cache disks being local to each chassis in an HA system but we don't recommend it, but you cannot at all use an in-chassis ZIL device .. it must be accessible from both systems, since the slog devices are a required part of a pool to import it, so in-chassis disks or PCI-e cards are right out if HA Is desired.

3. I want to strongly +1000 the person who said use 'raid10' (a bunch of mirrored vdevs) if your intended use-case includes virtualization. I cover this in the link I sent, but it bears special mention -- random IOPS potential of ZFS is entirely based on how many VDEVS you have, and almost nothing to do with the # of disks in each vdev. Because of this, mirrored pair pools are best, because they by design make the most possible vdevs a pool can have (that still has redundancy, anyway).

4. I also want to +1000 the person who said you need a ZIL. First of all, I highly recommend using NFS for virtualization if possible.. iSCSI is what you use when you can't use NFS, in my opinion. And in either event, for virtualization, having a good slog device is going to both improve write latency and offload a lot of load from the pool disks. In a large build like this, there is only ONE device I would recommend you get, and you need to get 2 (to mirror them): STEC ZeusRAM. Don't even look at any other device. I know they're expensive, but you really do get what you pay for.

I will send this to you in the ticket you opened with Nexenta Support as well, but I will try to keep an eye on this thread for the next few days in case you or anyone else have any questions about ZFS, or building/sizing a ZFS system.

ZFS NetApp replacement

Weaksauce

2[H]4U

Weaksauce

2[H]4U

Gawd

2[H]4U

2[H]4U

Limp Gawd

[H]ard|DCer of the Month - October 2005

Weaksauce

2[H]4U

[H]ard|DCer of the Month - October 2005

2[H]4U

2[H]4U

[H]ard|DCer of the Month - October 2005

2[H]4U

Gawd

2[H]4U

2[H]4U

Gawd

Weaksauce

2[H]4U

Weaksauce

Weaksauce

Gawd

Limp Gawd

2[H]4U

Weaksauce

2[H]4U

Weaksauce

Weaksauce

Limp Gawd

Lurker

[H]ard|Gawd

Weaksauce

Weaksauce

Limp Gawd

Weaksauce

n00b

Weaksauce