Course Sharing Design Thoughts

The Course Sharing feature is part of a larger Social Sharing initiative that encourages our learners to share and publicize their edX experiences with other users outside of the edX ecosystem, using social media such as Twitter and Facebook.  This is a valuable feature for edX as it is (should be) a low-cost tool that promotes viral marketing and adoption of our product and services.

Current State

Open edX mobile clients already have latent support for Course Sharing as part of an earlier effort to implement this for an open edX partner.  That work entailed updating Mobile's Enrollment API to include a new field for the URL to the Course About page with each enrolled course.  The Course About URL pointed to the Course About Page on the LMS.  The mobile clients were updated, via a feature flag, to allow learners to share this URL using the client's native Sharing features.

This works well for openedX mobile clients.

However, for edx.org or other openedX instances that have a separate marketing site, the Sharing URL to the Course About page needs to point to a page on the marketing site rather than one on the LMS.  This Course-Marketing-URL (CMU) is better suited for sharing as that is the marketable, searchable (SEO), long-term URL for the Course - unlike LMS's URL, which is specific to the Course-run and (currently) doesn't contain all the marketing information.

Today, the CMU is not available to the LMS in a performant way.  There is a Catalog API that provides the CMUs for requested Course-runs.  It has supports both single and bulk paginated requests.

Design

This section describes a proposal to make CMUs available for use to the LMS.

Design Principles

The following design principles are strongly encouraged for ongoing reliability, scalability, and performance of the LMS:

  1. Keep the LMS service decoupled from that of the Catalog Service.  So, if the Catalog service is down, it doesn't impact core LMS features.
  2. Maintain the SLA of Mobile's enrollment API.  The Mobile team's prior investment in optimizing this endpoint resulted in a significant performance improvement of the Mobile's enrollment API.  Our Service-level Objective (SLO), upon consultation with edx/devOps, was to keep the response time under 2-seconds. This was achieved by creating a persistent cache of Course Metadata in a SQL table (CourseOverview) as direct access to Mongo for that data was not performant at the time. 
  3. Avoid blocking calls on other services within a Web worker process, when responding to a client request.  By keeping dependencies on external systems, including other edX services, to background processes, allows web processes to respond in real-time and achieve their SLAs with minimal (to no) complexity and concerns for Graceful Degradation and Circuit Breakers.

Proposed Design

Currently, the Catalog Service pulls data from various data sources to create a single-source-of-truth for all Course Meta data.  It does this by having a refresh_course_metadata management command that is run every 4 hours by a background Jenkins job via a discovery_refresh_metadata Ansible playbook.

The proposal is to implement a background task that performs the reverse synchronization from the Catalog service back to the LMS.  In the future, we may retrieve additional data from the Catalog service.  For now, we focus on just synchronizing the CMUs.

  1. Update the CourseOverview table to include a new field for marketing_url under the course-catalog metadata section.
    1. Note: there is already social_sharing_url field in this table.  That seems to have been added by the Solutions team, but seems distinct from the CMU since it can be specified by course-teams in the Advanced settings of the course (but only if SOCIAL_SHARING_SETTINGS and CUSTOM_COURSE_URLS feature flags are enabled).
    2. Make sure to inform edx/devops on the PR for this change so the migration can be monitored when it goes through stage and prod.
  2. Create a django management command (update_catalog_data) that:
    1. Calls the Catalog services's course-marketing URL API to get a paginated response of CMUs for all courses in the CourseOverview table.
      1. Explicitly set the page size to a reasonable amount (the current default is only 20), given the Catalog service's SLA.
      2. Make the page size a configurable setting, possibly as an input parameter to the command.  Check with edx/devops on the best way to dynamically configure this setting.
    2. Sets the found CMUs in the new marketing_url field in the CourseOverview table.
    3. Logs to splunk:
      1. when the management command begins and ends
      2. simple statistics, such as the number of courses retrieved from the API
      3. whenever there are any failures in connecting with the Catalog service
  3. Create a Jenkins job (LMS_update_catalog_data) to run the new management command.
    1. Will need to consult with edx/devops to implement this.
  4. Fix the implementation of get_lms_link_for_about_page (it currently returns the incorrect value for edx.org), which is used by the Mobile API.
    1. Update the method to take in a CourseOverview  object (already available to the Mobile API).
    2. Have it return course_overview.social_sharing_url or course_overview.marketing_url or reverse('about_course', course_overview.id), in that priority order.

Anticipated Issues

  • Pagination tuning. The default page size of 20 will result in an unnecessary large number calls to the Catalog API since production currently has over 3200 courses in the CourseOverview table.  Will need to determine a good approach to find the ideal value.  Making the page size dynamically configurable (e.g., via django admin) will help with tweaking this number.  (Note that ECOM-6834 will allow the page size to be changed dynamically, per-request.)
  • CCX courses. It's unclear whether the Catalog service's course-marketing URL API will return the correct value for CCX courses.  If it doesn't, the management command will have to explicitly get the parent course of each CCX course before sending the request to the API.  Handling CCX courses can be implemented as an iterative change and need not be included as part of the initial rollout of this change, but should be handled.
  • Course overview invalidation.  As currently implemented, a course's entry in the course-overview table is immediately deleted upon course publish.  This was fine until now since there was only a single source for the cache: the modulestore.  But by introducing another source of data (Catalog service), this logic should be revisited.  When the entry is deleted and re-created, any previously synchronized data from the catalog service will automatically get wiped out - and will be re-populated only when the background synchronization process is next run.  A few solutions to this would be: (1) no longer delete upon course publish - only update, or (2) make a blocking call (with graceful degradation) to the Catalog service to populate the CMU when the entry is re-created.  The issue with option 2 (currently) is that this sometimes happens in a user-facing web-worker process.
  • Inter-team dependencies. An early conversation and support with an edx/devOps member and ECOM is recommended to help this project go faster.
  • Pagination bug (fix in progress). Currently, the LMS has not done a bulk query of the Catalog service in production.  We found a pagination issue in our previous attempt at this feature, that is now being fixed by ECOM.
  • Mobile enrollment-list API performance (orthogonal bug created). In writing this document, I noticed that the Mobile enrollment-list API is currently not making bulk SQL queries where it should.  So the expected SLA of under 2-seconds is not being met (NewRelic link).  However, this is an orthogonal issue and should be fixed separately.

Rejected Alternative

An earlier attempt at this feature resulted in unanticipated issues and was terminated for the following reasons:

  • The implementation required caching the CMUs in memcached and used the same TTL, 5mns, used by other Catalog integrations, which is too short for our purpose.
  • Having a user-facing API synchronously block on a dependent service is bound to have issues and unpredictable response times.
  • The Catalog client's timeout value of 5-secs exceeds the SLO of 2-secs for the Mobile Enrollment API.
  • Tweaking the page size is even more complex, given that different users have different number of enrollments, the API call is blocking, and the TTL is only 5mns.
  • Even with graceful degradation and circuit breakers, the mobile user experience is far from ideal: if a user is able to share a course one moment and then unable to because of memcached TTL expirations and backend inter-service communication failures.