Deduplication vs Compression

When we were in the process of looking at storage devices (SANs), I had a question that I had to logically figure out – and that was the idea of having data deduplicated or compressed. This is something that is gaining popularity both on storage devices and on operating systems.

For anybody who is unfamiliar with compression who might be reading this, compression is basically crunching down data – in a nutshell it basically means encoding data in such a way that uses fewer bits than the original file. There are even more ways to talk about compression (lossy and lossless compression), but I would suggest looking this up as other people have talked about it quite a bit more than I have. Compression can be done two ways as well – there is compressing a file after it has been written, and then there is in-line (realtime) compression as well (which is what most storage boxes would do).

Deduplication basically eliminates / removes duplicate copies of repeating data. This usually happens on a block level. There is also inline deduplication and “post processing” deduplication. Inline deduplication happens as the data is being written – and with most storage appliances, I have found that this requires quite a bit of memory. Some claim this is 1GB RAM per 1TB of data. I think that is grossly underestimating, though it really depends on your appliance. If you think about how inline deduplication works, it basically will search the file system (or MFT / Inode) for duplicate blocks before it even writes the block, and if the block is duplicated, it will only write a reference to the full block that contains the data. This means that if you have lots of similar data, you could potentially save a LOT of space.

So on to the nitty gritty. I’ve found that deduplication is excellent for data that just sits there – in other words, it’s not being accessed with immediate frequency. Good candidates that I’ve used deduplication on are backup volumes, VDI Deployments (which I’ll get to in a moment), and other storage where you have lots of potential duplicate blocks (I suppose you could say duplicate items here too, but most dedupe is block level, meaning it dedupes duplicate blocks).

Most deduplication appliances provide inline deduplication. There are some that are post-process (such as Windows Server’s Deduplication feature, which actually works pretty well), but it really depends on your usage. Post Process requires more space because basically the data sits there until either a schedule hits for it to begin dedupe or the system to calm down enough to do the process (background dedupe). Generally, as well, background or scheduled dedupe for post process creates a lot of disk activity (thrashing) – see the image below for the overnight dedupe processing:

Dedupe

Compression puts a bit more stress on the processor and perhaps memory as well since the processor is what is re-encoding the data.

So… on to VDI and Deduplication – at face value it seems like a great idea. I have done it before thinking it would be a great idea, but for some reason, both on Nexenta and on Windows Server (Windows iSCSI Target with Deduplication Enabled), it absolutely brings the volumes to a crawl. At the time, I didn’t have time to do much research on it because the complaints were coming in and I was just being quick to get the on to a volume that was not deduped, so perhaps this needs a bit more research.

All in all, I’m a bit of a bigger fan of compression because I’d rather the disks not be hit so hard. Today’s processors can handle most types of compression without much problem (unless you’re using Gzip-8 or 9) and decompression of well written compression algorithms can happen at RAM speed on most multi-core systems (which can be pretty fast).

This is an image of Nexenta using LZ4 compression – it’s not being hit too hard (this particular storage box has about 15 VMs on it that are at least fairly active).

lz4