Data Compression Improvements in Amazon Redshift Bring Compression Ratios Up to 4x
- April 6, 2017
Maor Kleider, Senior Product Manager with Amazon Redshift, wrote today’s guest post.
Amazon Redshift, is a fast, fully managed, petabyte-scale data warehousing service that makes it simple and cost-effective to analyze all of your data. Many of our customers, including Scholastic, King.com, Electronic Arts, TripAdvisor and Yelp, migrated to Amazon Redshift and achieved agility and faster time to insight, while dramatically reducing costs.
Columnar compression is an important technology in Amazon Redshift. It both helps reduce customer costs by increasing the effective storage capacity of our nodes and improves performance by reducing I/O needed to process SQL requests. Improving I/O efficiency is very important for data warehousing. Last year, our I/O enhancements doubled query throughput. Let’s talk about some of the new compression improvements we’ve recently added to Amazon Redshift.
First, we added support for the Zstandard compression algorithm, which offers a good balance between a high compression ratio and speed in build 1.0.1172. When applied to raw data in the standard TPC-DS, 3 TB benchmark, Zstandard achieves 65% reduction in disk space. Zstandard is broadly applicable. You can apply it to any of the following data types: SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, TIMESTAMP and TIMESTAMPTZ.
Second, we’ve improved the automation of compression on tables created by the CREATE TABLE AS, CREATE TABLE or ALTER TABLE ADD COLUMN commands. Starting with Build 1.0.1161, Amazon Redshift automatically chooses a default compression for the columns created by those commands. Automated compression happens when we estimate that we can reduce disk space without degrading query performance. Our customers have seen up to 40% reduction in disk space.
Third, we’ve been optimizing our internal on-disk data structures. Our preview customers averaged a 7% reduction in disk space usage with this improvement. This feature is delivered starting with Build 1.0.1271.
Finally, we have enhanced the ANALYZE COMPRESSION command to estimate disk space reduction. You can now easily identify opportunities to further compress data and improve performance. Behind the scenes, we sample your data and suggest the most effective compression. You can then specify the recommended encodings or your preferred encodings based on your own evaluation.
“Before all the recent compression features, our largest table was over 7 TB. It’s now only 4.85 TB, which is an additional 30.7% reduction in disk space. This allows us to reduce our disk space by 4X in total and our effective cost to less than $250/TB/Year on an uncompressed data basis. We’re now able to analyze more data with Amazon Redshift, and our query performance has gotten even better.” Chuong Do, Director of Analytics, Coursera
Of course, the actual benefits you see on your clusters will depend upon your workload and your data. In combination, these improvements may reduce your data sets by up to 4x vs. the 3x most of our customers saw before.
You may have heard us talk about how an Amazon Redshift data warehouse can cost as little as $1,000 per terabyte per year. It is important to realize that we’re talking about compressed data in this number. After all, that’s what we store. Not all vendors do this – many compress your data under the covers but describe per-terabyte costs in terms of uncompressed data. That’s unfortunate – the difference between talking in terms of uncompressed data and compressed data can be a significant overstatement.