Package Caching

This initiative is to investigate and, if found to be useful, help implement one or more package caching solutions for (at least) edx-platform. We are looking at options like Artifactory and DevPI to help speed up Python-related builds and testing, as well as gain some potential security benefits, and solve some problems related to forks we run off of but don't want to push to PyPI. Additional gains can be made for Node packages, Docker, etc. but are not the primary focus of this investigation (yet!).

The work for this investigation and details can be found in these Jira epics:

PLAT-1696 - Getting issue details... STATUS

PLAT-1901 - Getting issue details... STATUS

Findings

A write-up on which github.txt Python packages could move to a package cache, and which could be handled via other means is written up in PLAT-1907 - Getting issue details... STATUS
Notes on setting up Artifactory and some timing / test run information is here
Notes on setting up DevPI and timing results are here

Populating Caches

One thing we will need to sort out is how we want to get out problem packages into the cache. Presumably anything in PyPI should be able to cache normally. The steps to do that being:

Clone the correct repository
Check out the correct commit
Build the package
Uploads the built package to the cache

The two ways of dealing with this (off hand) are:

Push

Triggered (either by hand or CI) when a problematic package is updated, much like pushing to PyPI. This is better for a global cache, where updates are infrequent and can be carefully controlled, however several of the repositories in question are not under edX control so there may be some confusion around who is responsible for updating the cache, and when, in these cases. It does make for some extra work when a developer is bumping the version of one of these systems and wants to test locally.

Pull

Triggered when a system (Devstack build, Jenkins build, etc) cannot find the necessary version in the cache. If it has the necessary information (repo, commit, version, write permissions to the cache) it can populate a cache itself. This makes sense for a local cache, where there is less likelihood of a conflict between automated systems updating the cache, and where permissions can be more lax.

Prepopulate & Pull

For systems like Devstack, and perhaps Jenkins testing, it might make sense to build the cache right into the Docker container when other Devstack images are built. It would make for a larger download, but potentially save a lot of testing time. If a version is bumped in development the cache would still pull it through PyPI as normal, but it would also let devstack run tests offline out of the box in most situations (I think, not sure if any non-Python dependencies are pulled at test time).

Local Caches vs. Global Cache

Another big question is whether we want to have numerous small caches or few, maybe one, big shared cache. Here are some preliminary thoughts on those options:

Local

Fastest, closest cache
Allows or brings us closer to offline Devstack
Smaller (or no) DevOps burden
/ Developers would need to understand / manage the local cache if they are bumping the version of a problematic package, but would have greater control
If we wanted to use it for Jenkins builds / tests we might still need to stand up a shared instance
- This brings with it the overhead of somehow having to keep it populated with up-to-date dependencies
Somewhat slower build times for Devstack either when we build the cache image or the first time a user populates the local cache

Global

Easier to maintain one cache, less chance of it falling out of sync on a pull
Little-to-no special handling needed for Jenkins builds / tests
/ Developer burden is lighter, but updating problematic packages may require new permissions / processes, or at least some new automation
Slower in general use due to having to hit the internet to download packages
Still requires Devstacks to be online
Heavier DevOps burden - this becomes another critical system that needs to be highly available and (depending on solution chosen) may involve new clustering setups and additional database servers to maintain

DevPI vs. Artifactory

	DevPI	Artifactory
Cost	OSS (MIT license)	OSS version seems not to support PyPI? Probably in the $50 - $100 month range based on competitor costs
Devstack local cache?	Yes	Probably not due to licensing
Speed	Test times were slightly faster in DevPI, probably more to do with it being on the host machine instead of Docker more than the package itself. They are likely comparable speed-wise.
UI	Web, did not try.	Full featured, easy to browse packages, user permissions, etc
Command line	Full featured	Seems pretty limited to pip functionality, did not dig into it though
Ease of Setup	Easy for a local setup	Easy for a local setup
High Availability / Global cache	Provides single-writer, multiple-reader replication functionality, seems pretty new but probably robust enough for our use cases. Designed for geographically distributed systems, so we could place servers in different locations.	Provides localized cluster functionality. Requires Enterprise licensing. All servers need to be on the same LAN and share the same database server.
Database	sqlite3	MySQL, Oracle, MS SQL, PostgreSQL
Filesystem	local, PostgreSQL (pls no), other plugins?	local (synchronizable in HA configuration), S3, NFS

Open Questions

How will our choices here impact the OSS / partner communities? What can we do to make sure they can benefit from any gains we see here? How can we keep from complicating things for them?

Platform