edX Analytics Installation

This page describes how to set up the analytics stack in a small scale production-like setup, all on one box, with no use of AWS-specific features. The setup is mostly automated using ansible, but still has a few pieces that are manual. 


Install overview:

  1. Set up a new box and ensure it can connect to the LMS and the LMS mysql DB
  2. Run ansible to install all the things and do most of the configuration
  3. Manually finish a few bits of configuration (in particular, OAuth config on the LMS side)
  4. Copy over tracking logs and run some test jobs
  5. Automate loading of tracking logs and schedule jobs to run regularly

TL;DR – just give me the script

This is a bash script to install all the things.

Notes:

  1. It expects to find a tracking.log file in the home directory – put an LMS log there before your run this.
  2. You'll need to manually run the OAuth management command on your LMS system – see below.
  3. You may need to do some network config to make sure your machines have the right ports open. See below.

Run on a new Ubuntu 12.04 box as a user that can sudo.



Detailed steps to get a basic single-box install:

  1. Gather information:
    1. url to your LMS. e.g. lms.mysite.org
    2. url and credentials to your LMS DB. e.g. mysql.mysite.org
  2. Create a box to use for the analytics stack. e.g. analytics.mysite.org.
    1. We started with a blank ubuntu 12.04 AMI on AWS (NOTE: there are known issues upgrading to 14.04 – changed package names, etc. They are probably easily solvable, but we haven't done it yet)
    2. Ensure that this box can talk to the LMS via HTTP: 

    3. Ensure that this box can connect to the DB: 

    4. Ensure the box has the following ports open: 

    5. Install git and python other tools

    6. Create a virtualenv 

  3. Run ansible to set up most of the services. Command is:

    It will do the following:

    1. Install and configure hadoop, hive and sqoop
    2. Configure SSH daemon on the hadoop master node
    3. Configure the result store database
      1. Setup databases
      2. Setup users
    4. Configure data API
      1. Shared secret
      2. Database connection
    5. Configure Insights
      1. API shared secret
      2. Tell insights where the LMS is
  4. Check it:
    1. Run the built-in "compute pi" hadoop job 

    2. Make sure you can run hive 

    3. The API should be up. 

    4. The Insights app should be up: go to insights.mysite.org, make sure home page is there. You won't be able to log in yet.  

  5. Get some test logs into HDFS
    1. copy some log files into the hdfs system: 

    2. Set up the pipeline 

    3. Check the pipeline install by running a simple job to count events per day. Lots of parameters to setup the pipeline before running the job. We'll be able to use --skip-setup below. The user should be set to the current user (that has the ssh self-login set up).

  6. Finish the rest of the pipeline config:
    1. Write config files for the pipeline so that it knows where the LMS database is: 

    2. Test it: 

      If it succeeds, you'll see: 

  7. Finish the LMS -> Insights SSO config: LMS OAuth Trusted Client Registration.
    1. You'll be setting up the connection between Insights and the LMS, so single sign on works.
      1. Run the following management command on the LMS machine

      2. Check it: 


  8. Automate copying of logs. You probably don't want to do it manually all the time. Options:
    1. cron job just copying all your logs from the LMS servers regularly
    2. job to copy logs to S3, use S3 as your HDFS store. (update config to match...)
  9. Schedule launch-task jobs to actually run all the pipeline tasks regularly
    1. Here's the list: https://github.com/edx/edx-analytics-pipeline/wiki/Tasks-to-Run-to-Update-Insights


Resources

Desired end state:

In docs repo, installing + configuring open edx guide, should include our best stab at instructions.

TODO:

Start with https://github.com/edx/edx-documentation/pull/216

Further improvements

  • Clean up the old pipeline tasks to take standard params from main config
  • Get install that uses EMR for hadoop working and document that\