Adding a foreign key to an existing large table is a complex process that needs to be done very carefully to avoid downtime.

General Requirements

Before continuing here, you should fully read Everything About Database Migrations , paying special attention to the section about large tables.

For large tables, you generally cannot “just” do a migration that adds a foreign key. ALTER TABLE operations in MySQL - and that includes AWS Aurora for MySQL - can have a significant performance impact on large tables, which causes downtime. In the Openedx space, we avoid downtime.

MySQL 8.0, and with that, Aurora for MySQL version 3, supports Fast DDL. Fast DDL provides options for specific operations to alter a table while keeping the database online and operating in a performant way. So to add a foreign key to a large table, we need to use Fast DDL and make sure that we do not perform slow operations that involve things like rebuilding the table.

To do everything correctly, you should acquire an in-depth understanding of the topics involved. Carefully read these documents:

For every operation performed by the migration, you want to ensure that it does not rebuild the table, and that it is performed in place.

General Insights

Some of the operations that incur downtime can be problematic even for smaller existing tables (adding primary keys etc).
Online DDL Operations, Instant DDL (Aurora), Fast DDL (these cover roughly the same concepts) are key for performance and avoiding downtime.
Directly adding a foreign key incurs downtime.
Making a foreign key non-nullable incurs downtime.
Adding a new nullable column is fine.
Adding an index to an existing column where all values are NULL can be done without downtime using ALGORITHM=INPLACE, LOCK=NONE. Make sure there are no non-null values.
- While the database may add these parameters by default, add them explicitly, to make sure they are not silently ignored if they are not supported.
When doing bulk updates, like backfilling values in management commands, you need to avoid reading or updating all records at once to avoid loading everything into memory or performing a slow query.
- Use batching for reading and writing
- MySQL does not support streaming. Thus, if you use <queryset>.iterator(), it will ignore it and still load all rows into memory without batching. Instead, use something like the batch_by_pk method in the edx-enterprise repo.
- Querysets are evaluated at some point, loading the contents into memory / performing the SQL query. Understand when they are evaluated and make sure they’re not evaluated prematurely.
- Use transaction.atomic.
- Check generated SQL queries. With bulk_update, for example, you generate a pretty big query with a lot of WHEN/CASE statements. That is fine, but make sure that the query size does not exceed permitted query size for the db.
Coordinate with 2U SRE when releasing any tricky migrations to edx-platform.

Recommended steps

This work has not been tested on stage+prod yet, so follow at your own risk. I’ll update this soon.

This PR includes a carefully crafted strategy for adding a foreign key to a large table. You can use this as a boilerplate example.

Here’s a rundown of what works without downtime. Each step takes a separate PR, and they should be deployed in order, not just at once.

Add new INT field first, all NULL values
Create index on top of new field with ALGORITHM=INPLACE, LOCK=NONE
Backfill values in batches for field via management command
- Adjust save() method, django signals, or something like that to make sure that new records will not receive NULL values before next step
Ensure there are no more NULL values in the db field
Add foreign key to field with ALGORITHM=INPLACE, LOCK=NONE. It still needs to be nullable.
Add any desired constraints to prevent null values to the django code, but not to the database.
- Adding the foreign key constraint itself for nullable values requires wrapping the SQL in SET FOREIGN_KEY_CHECKS = 0; and SET FOREIGN_KEY_CHECKS = 1; to temporarily disable checks on foreign key links to referenced model so that the inplace algorithm is supported.