This initiative is to investigate and, if found to be useful, help implement one or more package caching solutions for (at least) edx-platform. We are looking at options like Artifactory and DevPI to help speed up Python-related builds and testing, as well as gain some potential security benefits, and solve some problems related to forks we run off of but don't want to push to PyPI. Additional gains can be made for Node packages, Docker, etc. but are not the primary focus of this investigation (yet!).

The work for this investigation and details can be found in these Jira epics:

Findings

Populating Caches

One thing we will need to sort out is how we want to get out problem packages into the cache. Presumably anything in PyPI should be able to cache normally. The steps to do that being:

The two ways of dealing with this (off hand) are:

Push

Triggered (either by hand or CI) when a problematic package is updated, much like pushing to PyPI. This is better for a global cache, where updates are infrequent and can be carefully controlled, however several of the repositories in question are not under edX control so there may be some confusion around who is responsible for updating the cache, and when, in these cases. It does make for some extra work when a developer is bumping the version of one of these systems and wants to test locally.

Pull

Triggered when a system (Devstack build, Jenkins build, etc) cannot find the necessary version in the cache. If it has the necessary information (repo, commit, version, write permissions to the cache) it can populate a cache itself. This makes sense for a local cache, where there is less likelihood of a conflict between automated systems updating the cache, and where permissions can be more lax.

Prepopulate & Pull

For systems like Devstack, and perhaps Jenkins testing, it might make sense to build the cache right into the Docker container when other Devstack images are built. It would make for a larger download, but potentially save a lot of testing time. If a version is bumped in development the cache would still pull it through PyPI as normal, but it would also let devstack run tests offline out of the box in most situations (I think, not sure if any non-Python dependencies are pulled at test time).

Local Caches vs. Global Cache

Another big question is whether we want to have numerous small caches or few, maybe one, big shared cache. Here are some preliminary thoughts on those options:

Local

Global

DevPI vs. Artifactory



DevPIArtifactory
CostOSS (MIT license)OSS version seems not to support PyPI? Probably in the $50 - $100 month range based on competitor costs
Devstack local cache?YesProbably not due to licensing
SpeedTest times were slightly faster in DevPI, probably more to do with it being on the host machine instead of Docker more than the package itself. They are likely comparable speed-wise.
UIWeb, did not try.Full featured, easy to browse packages, user permissions, etc
Command lineFull featuredSeems pretty limited to pip functionality, did not dig into it though
Ease of SetupEasy for a local setupEasy for a local setup
High Availability / Global cacheProvides single-writer, multiple-reader replication functionality, seems pretty new but probably robust enough for our use cases. Designed for geographically distributed systems, so we could place servers in different locations.Provides localized cluster functionality. Requires Enterprise licensing. All servers need to be on the same LAN and share the same database server.
Databasesqlite3MySQL, Oracle, MS SQL, PostgreSQL
Filesystemlocal, PostgreSQL (pls no), other plugins?local (synchronizable in HA configuration), S3, NFS


Open Questions