Background

We've recently run into some issues with migrations on some large tables that were endlessly growing with data we don't need or want. The effort to remove the unwanted rows proved more challenging than expected so we're capturing the lessons learned here.

Best Practices

There are a few different classes of problems that can each be addressed differently. This document is a starting point, as new realms pop up please add them!

Considerations

  1. Does the table need to exist at all?
    1. If not, consider using the usual migrations path to drop the table.
    2. If the table is large, you may need to slow-prune the table before dropping it (DevOps has run into issues truncate'ing large mysql tables causing EBS queuing impacting the rest of the RDS).
  2. Is there any data in the table that must remain during the delete operation?
    1. If so you will probably need to use the batch delete process below.
  3. Are there any foreign keys pointing to the data (in platform or other external dependencies), or important Django signals listening to delete events on the models?
    1. If so you may need to use the ORM Delete Process below.

Avoid - Raw SQL

In general we should avoid writing raw SQL for these purposes if we can. It skips some important Django checks and can increase problems by having a disconnect between where we test the code, generally SQLite, and where it gets run in production on MySQL that the ORM helps smooth out (a little).

Batch Delete Process

If you need to keep some data in the table, but clear a lot and keep your auto increment key set this might work for you. MySQL has some issues with large deletes. The Django ORM also has some limitations with batch deletes (especially on MySQL). One we ran across was that any sliced queryset will result in Django creating SQL that works on SQLite (and thus unit tests), but errors on MySQL due to creating a DELETE with a subselect in the WHERE clause. This seems to happen any time you pass an list to a filter "__in", even if it's just a list of integer ids. To work around these issues we tried a few things before coming up with the solution here.

General best practices for deleting in batches:

ORM Delete Process

Sometimes you need Django to cascade your delete to other tables that have foreign keys on the data you are deleting, or important pre/post delete signals need to be run. In this case your life is simple. Just use the usual Django filter() to get your target rows and delete() them one at a time. This will be much slower, but will perform all of the necessary checks. Note that the same code that runs a fast delete on one model can be slow on another due to those FKs or signals. See here for details.

Resources