How To Configure And Query The Retirement Archive Athena

When users have completed the retirement process a job will archive their retirement rows to an encrypted S3 bucket for auditing purposes. The retirement should be locked down to just those with a need to access that sensitive information. The process for setting this up from the Athena end and querying that data is detailed herein.

Configure S3

  • Create a new S3 bucket (we use Terraform for this) in the region of your choice
    • Take note of the bucket name and region you choose
  • Make sure to turn on "automatically encrypt objects when they are stored in S3", this is what keeps our learner data safe at rest!
    • You will need to choose an encryption algorithm. You must use AES-256 otherwise uploading from Tubular will not work.
  • You may wish to turn on versioning or other features, but they are not required
  • You may wish to add permissions for other accounts at this time, but make sure public permissions and system permissions are not granted

Gather Some Data

  • The production process of getting files into S3 will be:
    • A Tubular script is run on a cron via Jenkins
    • The script gathers old retirements from LMS and creates a gzipped JSON archive file
    • The archive file is uploaded to S3
    • The archived retirements are permanently deleted from the LMS database
  • For testing the Platform team can provide some properly formatted files

Configure Athena

  • I used the AWS Console for this, though database creation can be handled by Terraform
  • Make sure you are in the same region as the S3 bucket you created (check top right of the console). Encrypted buckets can only be queried in the same region.
  • To connect to a data source you must go through the "connect data source" workflow by clicking on the "Connect Data Source" link and selecting "S3 - AWS Glue Data Catalog"
  • Step 1: Name & Location
    • Under database select "Create new database"
    • Put in your database and table names and the S3 bucket in "Location of input data set" (ex: s3://athena-test-2018-10-23/)
    • Check the "Encrypted data set" box!
    • Leave the "External" box checked. You can't uncheck it anyway.
  • Step 2: Data Format
    • Select JSON
  • Step 3: Columns
    • Click "Bulk Add Columns"
    • Paste this in:
      • user_id int, original_username string, original_email string, original_name string, retired_username string, retired_email string, retirement_request_date timestamp, last_modified_date timestamp
    • Click "Add"
  • Step 4: Partitions
    • I did not add any
  • Click "Create Table" and you're done.

Querying Athena

If everything is set up correctly you should now be able to issue queries against the encrypted, archived files securely. Here are some useful sample queries to run:

SELECT * FROM "your database name"."your table name"

SELECT * FROM "your database name"."your table name"
WHERE original_username = 'username';

SELECT * FROM "your database name"."your table name"
WHERE original_email = 'deleteme7@foo.invalid';

SELECT * FROM "your database name"."your table name"
WHERE retirement_request_date < timestamp '2019-05-29 12:38:51.059'