MySQL vs PostgreSQL

Currently MySQL is the only relational database supported for use with Open edX, but there are solid reasons to seriously consider adding support for PostgreSQL also. There are also very good reasons why this hasn’t been done already. This page attempts to summarize the reasons on both sides and detect if/when we reach a point when it becomes worthwhile to prioritize adding PostgreSQL support.

History

Why was MySQL chosen in the first place?

Almost exclusively because when edX was started back in 2011, AWS RDS supported MySQL but not PostgreSQL. (PostgreSQL support was added in November 2013). General consensus among the developers at the time was that PostgreSQL would have been a better long-term choice, but RDS offered vastly simpler operational overhead compared to the other database hosting options available at the time. Using RDS freed up developer bandwidth to work on the new features needed in the startup phase to ensure that edX would survive long enough for anyone to still care about the choice of database.

Why didn’t we switch to PostgreSQL once RDS offered it?

First, edX developers were too busy adding the aforementioned new features. Then in November 2014, AWS Aurora was released with functionality even more appealing for operating sites of the scale edx.org was becoming. PostgreSQL support again lagged, not becoming available until October 2017.

So, why didn’t we switch once Aurora offered it?

By that point we had a pretty large database with 6 years of history for one of the highest-traffic educational web sites in the world. The migration would not have been trivial, and the arguments for it boiled down to “we could stop wasting time on this assortment of MySQL bugs and quirks that we’ve mostly learned to work around and live with”.

Why hasn’t anybody outside edX/2U added PostgreSQL support?

Most of the other development organizations in the Open edX ecosystem also have large legacy MySQL databases that would need to be migrated or maintained in parallel with newer PostgreSQL databases. And even more problematic, they would have a pretty foundational difference in their software stack from the edx.org site driving most development on the code, leaving them exposed to potential bugs related to new commits not correctly accounting for the different database (or having bug reports de-prioritized as being potentially due to a custom choice of database that doesn’t impact most site operators).

Why add PostgreSQL support?

At first the gaps in functionality and reliability between MySQL and PostgreSQL were perceived as annoying nuisances, but as the scale of Open edX and the sites running it have grown, so has the scale of the pain in dealing with MySQL’s foibles. Some of the more notable ones:

  • Transactional DDL. If a database schema migration fails in MySQL, it has a nontrivial risk of leaving the database in a broken state that triggers a CAT-1 incident to restore site functionality. In PostgreSQL, such a failure simply results in the transaction being rolled back so the database is left in the state it was in before the migration attempt.

  • 1st-class Django support. Django has many useful features that only work with PostgreSQL, and many more cases where “this works on MySQL, but not as well as on PostgreSQL”. And this is a logical prioritization choice for the project’s maintainers - according to the 2022 JetBrains survey of Django developers, 79% of them work with PostgreSQL while only 28% use MySQL in any capacity.

  • Better full text search. MySQL comes with a rudimentary full text search engine, but the one in PostgreSQL is good enough to replace dedicated tools like Elasticsearch for many use cases (potentially including many Open edX installations).

  • Better JSON support. MySQL can store JSON data as text, but its ability to work with it as structured data is very limited. PostgreSQL does this well enough to outperform dedicated document databases like MongoDB in many use cases (potentially allowing yet another Open edX dependency to be dropped for many or all installations).

  • Better ecosystem of related software. As PostgreSQL has mostly surpassed MySQL in functionality and now in popularity, it’s becoming the first (and often only) database supported by an assortment of useful supporting software. (AWS is a notable exception for historical reasons.) For example:

  • Proper Unicode behavior. MySQL made the odd choice to make its default “utf8” encoding only support 3 of the 4 available bytes in the UTF-8 standard. The full 4-byte encoding was only added later as “utf8mb4”, and the default character set was still the much more restrictive latin1 until MySQL 8.0. This is forcing a difficult migration process to correct. The migration (and reasoning about Unicode behavior in MySQL in general) is complicated by the fact that MySQL has some hard-coded constraints involving index sizes, etc. that are defined in bytes instead of characters; trying to change the character encoding can fail depending on the content of indexed text fields.

  • Remove blocker to adoption. Organizations that are already strongly invested in PostgreSQL may be reluctant to use Open edX because it only works with MySQL.

  • More rapid development of new features. PostgreSQL is a pretty high-quality codebase under a liberal software license, and adds feature enhancements on a regular basis. MySQL was acquired by Oracle, which has a vested interest in upselling users to a much more expensive proprietary database product, so features which could reduce the incentive to upgrade are typically not added. This led to the creation of a fork called MariaDB, which served to further fragment and slow down development effort.

  • Developer Experience/Hiring/Retention. For many reasons like the ones noted above, PostgreSQL has become strongly preferred over MySQL by a solid majority of developers. StackOverflow’s 2023 developer survey notes that about 43% of developers would like to be working with PostgreSQL vs. only 24% for MySQL (and 71% of PostgreSQL users are happy with it, vs. only 51% for MySQL users). Consequently, despite MySQL’s early lead in popularity for historical reasons, PostgreSQL is now more widely used. (The MariaDB fork of MySQL fares even worse on most of these metrics.). This may not be a determining factor in software engineers deciding where to work, but it lands database selection solidly in the “con” column for many such decisions involving jobs involving Open edX and could help tip the scales one way or the other.

Why not add PostgreSQL support?

As noted in the history section above, there are some pretty good reasons why PostgreSQL support hasn’t been added to Open edX yet despite its advantages.

  • Development effort opportunity cost. Although we primarily use database-agnostic Django functionality, we have hard-coded references to MySQL in devstack, Tutor, some migration steps with custom SQL, etc. We don’t yet have a solid grasp of the total scope of effort, and any development effort put into it will come at the expense of other valuable initiatives.

  • More maintenance overhead. Supporting two databases takes more work than supporting one, and it’s unlikely that all current Open edX sites would promptly switch from MySQL to PostgreSQL once the support is added.

  • Operational unknowns. We collectively have a lot of experience with running MySQL at scale and have tweaked various indexes and queries to accommodate MySQL (and occasionally Aurora quirks). While it’s clear that Postgres can also scale, we may see performance regressions during the process of migrating any large site.

After adding PostgreSQL support, should we drop MySQL support?

To allow for sane migration plans, we’d need at least some span of time when MySQL and PostgreSQL are both supported. But would we want to continue supporting both indefinitely, or drop MySQL support after allowing a reasonable amount of time for migration?

Reasons to drop MySQL support

  • Lower maintenance overhead. If we drop MySQL support, we get back to only needing to support a single database, reducing the long-term maintenance burden.

  • Freedom to use PostgreSQL-only features. Officially supporting MySQL would essentially prevent usage of many of the nicer PostgreSQL-only features in the core Open edX code. We’d only be able to use it in optional features/extensions, if at all.

  • No need to work around MySQL quirks. Several prominent open source projects have declined to support MySQL despite code contributions to add it, because their experience has been that resolving bugs involving its use takes valuable development effort away from other high priority tasks (and is simply unpleasant to deal with). Having experienced many of these problems ourselves (as described above), we might well make the same decision once PostgreSQL is a viable option.

Reasons to keep MySQL support

  • Community pressure. A significant percentage of the Open edX community may be reluctant to migrate from MySQL to PostgreSQL, especially if they already have significant in-house MySQL experience or use other software which only works with MySQL. We’d have to poll the community to gauge sentiment regarding this.

  • Fewer blockers to adoption. Organizations that have already invested heavily in MySQL might be more reluctant to adopt Open edX if it only works with PostgreSQL.