ENTERPRISE ZFS build: Intel 320 SSDs vs 15k HDD

hatchi · Apr 22, 2012

Hi,

First of all I would really thank everyone who contributed to this forum. The articles and discussions have helped me greatly since I decided to build this ZFS system.

I am in the process of building a ZFS storage for my company. I am faced with the current choice and I need some advice to help me decide.

I will build my ZFS based on NexentaStor ENTERPRISE.
The node will have tons of RAM (512 GB and I am searching if I can secure a deal for the new Intel with more RAM).

The only issue I am facing now is choosing the storage units.
based on the HCL list of NexentaStor I can go with either (I am not very keen to follow the HCL list in details but for HDDs,HBA and network drives I guess I must follow it)
1- Hitachi HUS156060VLS600 (15k HDD)
2- Intel 320 600GB

If I am going to go with 15k HDD I will use them in RAID level similar to raid10 (each vdev of 2 harddisk and data will be striped on the pool) (total 26 drives)

while if I am going to go with SSD I will use the following config
2 X 8 RAIDZ-2 (each RADZ-2 will have 6 SSDs for data and 2 for parity).

Info about the system
1- my application is virtualized environment with hundreds of small VMs.
2- both solutions are acceptable financial wise (SSDs are bit more expensive but I am willing to invest if the gain is 50% or more in performance)
3- I will start by 16 SSDs in the SSD solution. And 26 HDDs in the HDD solution
4- SSD solution will not have L2ARC
5- for HDD solution if I could secure more than 512 GB of RAM I will not use L2ARC if less I will use L2ARC (is this correct ?)

my questions are :
1- will I gain any performance benefits If I use the ssds in RAIDz-2 compared to the HDDs in RAID10 ?
2- will the SSD configuration give me enough data replication in case of failure of one or more SSD? (same question for the HDDs configuration) ?
3- are the Intel 320 SSDs good enough to use in servers environment (24/7 50% write/50% read) ?
4- if you have better configuration on both configs please advise.

Thanks

_Gea · Apr 22, 2012

I do not use as many disks or so much RAM but I have Pools with 10k disks and Pools with SSD's. Beside benchmarks the SSD Pools have a much better reactiveness than my disk Pools.

I use SSD's now for all of my current high speed datastores. Only thing is my bad experience with my first series of SSD's (Sandforce 120 GB) with nearly 10% failure rate in the first year. I now use Intel 320 with only one failure until last summer. (have 20 of them) in a 4 x 3 way mirror config. (Started with 3 way mirrors and extended them)

Because MLC may have problems later, i decided to use them only in 3 way mirrors or Raid-Z2 or Raid Z3 and for a maximum of 3-4 years. But if money is not the question I would use always SSD's because of much much higher I/O even without trim. But I try to not fill them above 70%.

olavgg · Apr 22, 2012

The problem with SSD's is that many lie about flushing data. This can cause the whole filesystem to break.
You need to find a SSD were you're 100% sure it doesn't lie. If you can't, you should go for an enterprise 15K SAS drive.

hatchi · Apr 23, 2012

Thanks for replies,

I am glade that someone had hands on experience with both configs. based on your answer I would be interested to know how much wear off occurred to your SSDs since you started using them.

olavgg said:
The problem with SSD's is that many lie about flushing data. This can cause the whole filesystem to break.
You need to find a SSD were you're 100% sure it doesn't lie. If you can't, you should go for an enterprise 15K SAS drive.

Do you have any information about Intel SSDs (From what I see they are the only ones who are attaching alot of technical details about their SSDs).
Also it seems that users like their stability.

I have one question though :
Does Intel SSDs require many firmware updates ? and if they require does it works in while the SSD is online or it has to be done offline ?

I have another question but maybe it is better to ask it in new thread ?
How much space saving on avarge I will gain if I enabled Data deduplication and compression ?
is it safe to think that I will gain 3X the drive space ?

Thanks

lightp2 · Apr 23, 2012

since you are committing to a large implementation, I got the link from this forum, re-post this link below, maybe can read for background understanding

Differential RAID: Rethinking RAID for SSD Reliability
http://research.microsoft.com/pubs/119126/euro093-balakrishnan.pdf

hatchi · Apr 23, 2012

Thanks lightp2 for the article,

based on this article it means once one SSD start to fail due to wear-off all other SSDs will follow shortly.

While currently we don't have Differential-Raid in place we could overcome this issue manually by replacing all the SSD before they reach their 5000 erases limit.

I wish to have some data to know how fast that could happen in real life server system. (is it 1,3 or 5 years ?)

lightp2 · Apr 23, 2012

I have no experience with Diff-RAID implementations, and I am not aware which hardware provides Diff-RAID as well.

There might be enterprise grade controllers that take those issues into consideration, but I cannot confirm.

However, there is a safer way (safer, not safest). What you can do is, for example,(this is an example, there are many variations)
1, SSD 330 - 120GB one unit
2. OCZ Vertex 4 - 128GB - one unit
3. Crucial M4 128GB - one unit
4. Samsung 830 128GB - one unit.
(you can use other brands, Kingston, SanDisk, Runcore, Mushkin, etc. As long as they are generally good. Obviously do not purposely go for models with well-known problems)

you can form RAID-5 SSD330/OCZ-V4/Crucial-M4/Samsung830.
1. Because they are from different vendors models, it is very likely they have different write behavior due to different internal firmware implementations.
2. This ensures there will never be uniform pattern. So write wear likely quite different for each drive.
3. Due to different capacity, write wear level is again altered since some will have additional capacity left, they become part of additional reserved area.
3.1 Most standard RAID5 implementations expect the involved drives to have same capacity/size, or else they will take the smallest size as starting point.

4. If you want, you can also mix, for example,
4.1 90GB/120GB/128GB/120GB this likely scenario as total will be smallest usable, 4 drives involved with 90GB max each, in RAID5, usable will be 270GB
4.2 The critical point is here 90GB as starting point, so you wasted 30GB on 120GB drive and 38GB on 128GB drive when they form generic RAID5.
4.3 For earlier example, 120/128/128/128, you will have 360GB usable for generic RAID5

5. So from the examples above, to guarantee safer write wear protection, you use many vendors, many models scheme, and finally many different capacity.
5.1 The problem with "many different capacity" is that most RAID implementations prefer whole-disk management, meaning they do not want you to partition the drive, the RAID engine will take entire disk. Couple this with smallest common size, you have issue of wasted space. Strictly speaking not wasted but part of the reserved area in practical sense. However, some users want to use maximum capacity.

Finally, I have something all vendors like to hear,

1. Under this scheme, it is BEST for you to use multiple vendors' products.

2. As such, everybody has an opportunity to present their products for consideration. You do not have to choose one over the other.

However, research the logic in this post, do you agree it is likely more sensible?

3. Looks like this will keep everybody happy

Cheers

ajm83 · Apr 23, 2012

hatchi said:
Thanks for replies,
I have another question but maybe it is better to ask it in new thread ?
How much space saving on avarge I will gain if I enabled Data deduplication and compression ?
is it safe to think that I will gain 3X the drive space ?

Thanks

Depends on the dataset, on mine (virtual machines) I see a compressratio of 1.5x.

We've found de-dup on ZFS not to be worthwhile. The performance degradation is large for the minimal savings (1.07x IIRC on 12TB of mixed Windows and Linux VM images).

The worry is that if you ever reach a point where the DDT grow larger than available RAM (unlikely with 512GB I would guess) performance drops off a cliff and the array becomes unusable.

brutalizer · Apr 23, 2012

Agreed. Stay away from dedup until it has been patched considerably.

ToddW2 · Apr 23, 2012

Why not go enterprise SSD solution if it's in the budget.

hatchi · Apr 23, 2012

lightp2:
Your solution looks interesting although I dont think It is good for enterprise solution.
I have 2 problems with your proposed solution :
1- the main reason for the wear-off problem of the SSDs is usually the flash nands that are used to build the MLC SSD. and from my reading most consumer grade SSD uses the same NANDs (Intel NANDs) or other types of NANDs with similar characteristics.
So changing the controller does not resolve the issue of wearing-off and they will wear off in same speed or similar speed
2- when using more than one ssd controller in a raid vdev we are inheriting all the SSD controllers problems and we will get a vdev that has the sum of all the SSDs problems summed together and the performance will be similar to the performance of the slowest SSD of them.

ajm83 said:
Depends on the dataset, on mine (virtual machines) I see a compressratio of 1.5x.

We've found de-dup on ZFS not to be worthwhile. The performance degradation is large for the minimal savings (1.07x IIRC on 12TB of mixed Windows and Linux VM images).

The worry is that if you ever reach a point where the DDT grow larger than available RAM (unlikely with 512GB I would guess) performance drops off a cliff and the array becomes unusable.

These info is exactly what I am looking at.
at 1.5X only I guess the SSD are still considered an expensive solution (I was hoping I would get 3X saving and only by that way SSD would make sense)

brutalizer said:
Agreed. Stay away from dedup until it has been patched considerably.

Thanks, It seems this feature looks more beautiful on paper than in real implementation

ToddW2 said:
Why not go enterprise SSD solution if it's in the budget.

enterprise SSD are much more expensive (around 10$ or more / 1GB ). and when you add the raid overhead and the spare drives the price would reach more than 15$/ usable GB which is way above the budget.

I have a question which I should have asked in the beginning
Can we use SATA to SAS board with the Intel SSDs and how the performance/stability will be affected ?
We need to have all the drives as SAS drives so that we can support HA configuration either from beginning or in the future

Thanks

Origin_Unknown · Apr 23, 2012

512GB RAM!!!

ajm83 · Apr 24, 2012

hatchi said:
lightp2:
These info is exactly what I am looking at.
at 1.5X only I guess the SSD are still considered an expensive solution (I was hoping I would get 3X saving and only by that way SSD would make sense)

I should add that this is using the default compression algorithm.
You would probably get a better ratio using gzip-9, but there is a speed tradeoff, whereas the default algorithm is seemingly free.

hatchi · Apr 24, 2012

Origin_Unknown said:
512GB RAM!!!

YUP

I found that Adding 512 GB of RAM iis almost the same cost as good SSD for read cache (4X Intel SSD 710) and RAM is much better. (actually RAM is a bit more expensive as you have to invest in newer hardware and have more processors but I think it worth it

)

ajm83 said:
I should add that this is using the default compression algorithm.
You would probably get a better ratio using gzip-9, but there is a speed tradeoff, whereas the default algorithm is seemingly free.

I dont think using the gzip-9 could jump the ration from 1.5X to 3X. Although If using gzip-9 will give us the 3X commpresion ratio I would go with it as I think the performance trade off will be balanced by the fast SSD performance. And the total performance of the system will be higher than using 15k HDD.

I wish that someone here tried the SSD drives with SATA to SAS converters to tell us about their experience with it.

About the HDD, is it possible that since the HDDs will be arranged in mirrored vdevs (each 2 drives are one mirrored vdev) it will give us similar perforance to the SSDs since the SSDs will have RAIDZ-2 vdevs.

in the HDD solution we will have 13 vdev where the data will be stripped across them while in the SSD solution we will have only 2 vdevs.

so the vdev ratio between the HDDs and SSD solution is approx 6:1.
I was hopping that this vdev ratio difference will make the difference in write/read performance between SSDs and HDDs.

Thanks

ENTERPRISE ZFS build: Intel 320 SSDs vs 15k HDD

hatchi

n00b

_Gea

Supreme [H]ardness

olavgg

Limp Gawd

hatchi

n00b

lightp2

Gawd

hatchi

n00b

lightp2

Gawd

ajm83

n00b

brutalizer

[H]ard|Gawd

ToddW2

2[H]4U

hatchi

n00b

Origin_Unknown

Limp Gawd

ajm83

n00b

hatchi

n00b