Launch: Amazon Athena adds support for Querying Encrypted Data
- April 14, 2017
In November of last year, we brought a service to the market that we hoped would be a major step toward helping those who have the need to securely access and examine massive amounts of data on a daily basis. This service is none other than Amazon Athena which I think of as a managed service that is attempting “to leap tall queries in a single bound” with querying of object storage. A service that provides AWS customers the power to easily analyze and query large amounts of data stored in Amazon S3.
Amazon Athena is a serverless interactive query service that enables users to easily analyze data in Amazon S3 using standard SQL. At Athena’s core is Presto, a distributed SQL engine to run queries with ANSI SQL support and Apache Hive which allows Athena to work with popular data formats like CSV, JSON, ORC, Avro, and Parquet and adds common Data Definition Language (DDL) operations like create, drop, and alter tables. Athena enables the performant query access to datasets stored in Amazon Simple Storage Service (S3) with structured and unstructured data formats.
You can write Hive-compliant DDL statements and ANSI SQL statements in the Athena Query Editor from the AWS Management Console, from SQL clients such as SQL Workbench by downloading and taking advantage of the Athena JDBC driver. Additionally, by using the JDBC driver you can run queries programmatically from your desired BI tools. You can read more about the Amazon Athena service from Jeff’s blog post during the service release in November.
After releasing the initial features of the Amazon Athena service, the Athena team kept with the Amazon tradition of focusing on the customer by working diligently to make your customer experience with the service better. Therefore, the team has added a feature that I am excited to announce; Amazon Athena now provides support for Querying Encrypted data in Amazon S3. This new feature not only makes it possible for Athena to provide support for querying encrypted data in Amazon S3, but also enables the encryption of data from Athena’s query results. Businesses and customers who have requirements and/or regulations to encrypt sensitive data stored in Amazon S3 are able to take advantage of the serverless dynamic queries Athena offers with their encrypted data.
Before we dive into the using the new Athena feature, let’s take some time to review the supported encryption options that S3 and Athena supports for customers needing to secure and encrypt data. Currently, S3 supports encrypting data with AWS Key Management Service (KMS). AWS KMS is a managed service for the creation and management of encryption keys used to encrypt data. In addition, S3 supports customers using their own encryption keys to encrypt data. Since it is important to understand the encrypted options that Athena supports for datasets stored on S3, in the chart below I have provided a breakdown of the encryption options supported with S3 and Athena, as well as, noted when the new Athena table property, has_encrypted_data, is required for encrypted data access.
For more information on Amazon S3 encryption with AWS KMS or Amazon S3 Encryption options, review the information in the AWS KMS Developer Guide on How Amazon Simple Storage Service (Amazon S3) Uses AWS KMS and Amazon S3 Developer Guide on Protecting Data Using Encryption respectively.
Creating & Accessing Encrypted Databases and Tables
As I noted before, there are a couple of ways to access Athena. Of course, you can access Athena through the AWS Management Console, but you also have the option to use the JDBC driver with SQL clients like SQL Workbench and other Business Intelligence tools. In addition, the JDBC driver allows for programmatic query access.
Enough discussion, it is time to dig into this new Athena service feature by creating a database and some tables, running queries from the table and encryption of the query results. We’ll accomplish all this by using encrypted data stored in Amazon S3.
If this is your first time logging into the service, you will see the Amazon Athena Getting Started screen as shown below. You would need to click the Get Started button to be taken the Athena Query Editor.
Now that we are in the Athena Query Editor, let’s create a database. If the sample database is shown when you open your Query Editor you would simply start typing your query statement in the Query Editor window to clear the sample query and create the new database.
I will now issue the Hive DDL Command, CREATE DATABASE
Once I receive the confirmation that my query execution was successful in the Results tab of Query Editor, my database should be created and available for selection in the dropdown.
I now will change my selected database in the dropdown to my newly created database, tara_customer_db.
With my database created, I am able to create tables from my data stored in S3. Since I did not have data encrypted with the various encryption types, the product group was kind enough to give me some sample data files to place in my S3 buckets. The first batch of sample data that I received was encrypted with SSE-KMS which if you recall from the encryption table matrix we discussed above is encryption type, Server-Side Encryption with AWS KMS–Managed Keys. I stored this set of encrypted data in my S3 bucket aptly named: aws-blog-tew-posts/SSE_KMS_EncryptionData. The second batch of sample data was encrypted with CSE-KMS, which is the encryption type, Client-Side Encryption with AWS, and is stored in my aws-blog-tew-posts/ CSE_KMS_EncryptionData S3 bucket. The last batch of data I received is just good old-fashioned plain text, and I have stored this data in the S3 bucket, aws-blog-tew-posts/PlainText_Table.
Remember to access my data in the S3 buckets from the Athena service, I must ensure that my data buckets have the correct permissions to allow Athena access each bucket and data contained therein. In addition, working with AWS KMS encrypted data requires users to have roles that include the appropriate KMS key policies. It is important to note that to successfully read KMS encrypted data, users must have the correct permissions for access to S3, Athena, and KMS collectively.
There are several ways that I can provide the appropriate access permissions between S3 and the Athena service:
To learn more about the Amazon Athena access permissions and/or the Amazon S3 permissions by reviewing the Athena documentation on Setting User and Amazon S3 Bucket Permissions.
Since my data is ready and setup in my S3 buckets, I just need to head over to Athena Query Editor and create my first new table from the SSE-KMS encrypted data. My DDL commands that I will use to create my new table, sse_customerinfo, is as follows:
CREATE EXTERNAL TABLE sse_customerinfo( c_custkey INT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DOUBLE, c_mktsegment STRING, c_comment STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://aws-blog-tew-posts/SSE_KMS_EncryptionData';
I will enter my DDL command statement for the sse_customerinfo table creation into my Athena Query Editor and click the Run Query button. The Results tab will note that query was run successfully and you will see my new table show up under the tables available for the tara_customer_db database.
I will repeat this process to create my cse_customerinfo table from the CSE-KMS encrypted batch of data and then the plain_customerinfo table from the unencrypted data source stored in my S3 bucket. The DDL statements used to create my cse_customerinfo table are as follows:
CREATE EXTERNAL TABLE cse_customerinfo ( c_custkey INT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DOUBLE, c_mktsegment STRING, c_comment STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://aws-blog-tew-posts/CSE_KMS_EncryptionData' TBLPROPERTIES ('has_encrypted_data'='true');
Again, I will enter my DDL statements above into the Athena Query Editor and click the Run Query button. If you review the DDL statements used to create the cse_customerinfo table carefully, you will notice a new table property (TBLPROPERTIES) flag, has_encrypted_data, was introduced with the new Athena encryption capability. This flag is used to tell Athena that the data in S3 to be used with queries for the specified table is encrypted data. If take a moment and refer back to the encryption matrix table we I reviewed earlier for the Athena and S3 encryption options, you will see that this flag is only required when you are using the Client-Side Encryption with AWS KMS–Managed Keys option. Once the cse_customerinfo table has been successfully created, a key symbol will appear next to the table identifying the table as an encrypted data table.
Finally, I will create the last table, plain_customerinfo, from our sample data. Same steps as we performed for the previous tables. The DDL commands for this table are:
CREATE EXTERNAL TABLE plain_customerinfo( c_custkey INT, c_name STRING, c_address STRING, c_nationkey INT, c_phone STRING, c_acctbal DOUBLE, c_mktsegment STRING, c_comment STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://aws-blog-tew-posts/PlainText_Table';
Great! We have successfully read encrypted data from S3 with Athena, and created tables based on the encrypted data. I can now run queries against my newly created encrypted data tables.
Running Queries against our new database tables is very simple. Again, common DDL statements and commands can be used to create queries against your data stored in Amazon S3. For our query review, I am going to use Athena’s preview data feature. In the list of tables, you will see two icons beside the tables. One icon is a table property icon, selecting this will bring up the selected table properties, however, the other icon, displayed as an eye symbol, and is the preview data feature that will generate a simple SELECT query statement for the table.
To demonstrate running queries with Athena, I have selected to preview data for my plain_customerinfo by selecting the eye symbol/icon next to the table. The preview data feature creates the following DDL statement:
SELECT * FROM plain_customerinfo limit 10;
The query results from using the preview data feature with my plain_customerinfo table are displayed in the Results tab of the Athena Query Editor and provides the option to download the query results by clicking the file icon.
The new Athena encrypted data feature also supports encrypting query results and storing these results in Amazon S3. To take advantage of this feature with my query results, I will now encrypt and save my query data in a bucket of my choice. You should note that the data table that I have selected is currently unencrypted.
First, I’ll select the Athena Settings menu and the review the current storage settings for my query results. Since I do not have a KMS key to use for encryption, I will select the Create KMS key hyperlink and create a KMS key for use in encrypting my query results with Athena and S3. For details on how to create a KMS key and configure the appropriate user permissions, please see http://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html.
After successfully creating my s3encryptathena KMS key and copying the key ARN for use in my Athena settings, I return to the Athena console Settings dialog and select the Encrypt query results textbox. I, then update the Query result location textbox point to my s3 bucket, aws-athena-encrypted, which will be the location for storing my encrypted query results.
The only thing that is left is to select the Encryption type and enter my KMS key. I can do this by either selecting the s3encryptathena key from the Encryption key dropdown or enter its ARN in the KMS key ARN textbox. In this example, I have chosen to use SSE-KMS for the encryption type. You can see both examples of selecting the KMS key below. Clicking the Save button completes the process.
Now I will rerun my current query for my plain_customerinfo table. Remember this table is not encrypted, but with the Athena settings changes made for adding encryption for the query results, I have enabled the query results run against this table to be stored with SSE-KMS encryption using my KMS key.
After my query rerun, I can see the fruits of my labor by going to the Amazon S3 console and viewing the CSV data files saved in my designated bucket, aws-athena-encrypted, and the SSE-KMS encryption of the bucket and files.
Needless to say, this Athena launch has several benefits for those needing to secure data via encryption while still retaining the ability to perform queries and analytics for data stored in varying data formats. Additionally, this release includes improvements I did not dive into with this blog post.
See the release documentation in the Athena user guide to more details and start leveraging Athena to query your encrypted data stored in Amazon S3 now, by reviewing the Configuring Encryption Options section in the Athena documentation.
Learn more about Athena and serverless queries on Amazon S3 by visiting the Athena product page or reviewing the Athena User Guide. In addition, you can dig deeper on the functionality of Athena and data encryption with S3 by reviewing the AWS Big Data Blog post: Analyzing Data in S3 using Amazon Athena and the AWS KMS Developer Guide.