This initiative is to investigate and, if found to be useful, help implement one or more package caching solutions for (at least) edx-platform. We are looking at options like Artifactory and DevPI to help speed up Python-related builds and testing, as well as gain some potential security benefits, and solve some problems related to forks we run off of but don't want to push to PyPI. Additional gains can be made for Node packages, Docker, etc. but are not the primary focus of this investigation (yet!).
The work for this investigation and details can be found in these Jira epics:
One thing we will need to sort out is how we want to get out problem packages into the cache. Presumably anything in PyPI should be able to cache normally. The steps to do that being:
The two ways of dealing with this (off hand) are:
Triggered (either by hand or CI) when a problematic package is updated, much like pushing to PyPI. This is better for a global cache, where updates are infrequent and can be carefully controlled, however several of the repositories in question are not under edX control so there may be some confusion around who is responsible for updating the cache, and when, in these cases. It does make for some extra work when a developer is bumping the version of one of these systems and wants to test locally.
Triggered when a system (Devstack build, Jenkins build, etc) cannot find the necessary version in the cache. If it has the necessary information (repo, commit, version, write permissions to the cache) it can populate a cache itself. This makes sense for a local cache, where there is less likelihood of a conflict between automated systems updating the cache, and where permissions can be more lax.
For systems like Devstack, and perhaps Jenkins testing, it might make sense to build the cache right into the Docker container when other Devstack images are built. It would make for a larger download, but potentially save a lot of testing time. If a version is bumped in development the cache would still pull it through PyPI as normal, but it would also let devstack run tests offline out of the box in most situations (I think, not sure if any non-Python dependencies are pulled at test time).
Another big question is whether we want to have numerous small caches or few, maybe one, big shared cache. Here are some preliminary thoughts on those options:
DevPI | Artifactory | |
---|---|---|
Cost | OSS (MIT license) | OSS version seems not to support PyPI? Probably in the $50 - $100 month range based on competitor costs |
Devstack local cache? | Yes | Probably not due to licensing |
Speed | Test times were slightly faster in DevPI, probably more to do with it being on the host machine instead of Docker more than the package itself. They are likely comparable speed-wise. | |
UI | Web, did not try. | Full featured, easy to browse packages, user permissions, etc |
Command line | Full featured | Seems pretty limited to pip functionality, did not dig into it though |
Ease of Setup | Easy for a local setup | Easy for a local setup |
High Availability / Global cache | Provides single-writer, multiple-reader replication functionality, seems pretty new but probably robust enough for our use cases. Designed for geographically distributed systems, so we could place servers in different locations. | Provides localized cluster functionality. Requires Enterprise licensing. All servers need to be on the same LAN and share the same database server. |
Database | sqlite3 | MySQL, Oracle, MS SQL, PostgreSQL |
Filesystem | local, PostgreSQL (pls no), other plugins? | local (synchronizable in HA configuration), S3, NFS |