This page describes how to set up the analytics stack in a small scale production-like setup, all on one box, with no use of AWS-specific features. The setup is mostly automated using ansible, but still has a few pieces that are manual.
- Set up a new box and ensure it can connect to the LMS and the LMS mysql DB
- Run ansible to install all the things and do most of the configuration
- Manually finish a few bits of configuration (in particular, OAuth config on the LMS side)
- Copy over tracking logs and run some test jobs
- Automate loading of tracking logs and schedule jobs to run regularly
TL;DR – just give me the script
This is a bash script to install all the things.
- It expects to find a tracking.log file in the home directory – put an LMS log there before your run this.
- You'll need to manually run the OAuth management command on your LMS system – see below.
- You may need to do some network config to make sure your machines have the right ports open. See below.
Run on a new Ubuntu 12.04 box as a user that can sudo.
Detailed steps to get a basic single-box install:
- Gather information:
- url to your LMS. e.g. lms.mysite.org
- url and credentials to your LMS DB. e.g. mysql.mysite.org
- Create a box to use for the analytics stack. e.g. analytics.mysite.org.
- We started with a blank ubuntu 12.04 AMI on AWS (NOTE: there are known issues upgrading to 14.04 – changed package names, etc. They are probably easily solvable, but we haven't done it yet)
Ensure that this box can talk to the LMS via HTTP:
Ensure that this box can connect to the DB:
Ensure the box has the following ports open:
Install git and python other tools
Create a virtualenv
Run ansible to set up most of the services. Command is:
It will do the following:
- Install and configure hadoop, hive and sqoop
- Configure SSH daemon on the hadoop master node
- Configure the result store database
- Setup databases
- Setup users
- Configure data API
- Shared secret
- Database connection
- Configure Insights
- API shared secret
- Tell insights where the LMS is
- Check it:
Run the built-in "compute pi" hadoop job
Make sure you can run hive
The API should be up.
The Insights app should be up: go to insights.mysite.org, make sure home page is there. You won't be able to log in yet.
- Get some test logs into HDFS
copy some log files into the hdfs system:
Set up the pipeline
Check the pipeline install by running a simple job to count events per day. Lots of parameters to setup the pipeline before running the job. We'll be able to use --skip-setup below. The user should be set to the current user (that has the ssh self-login set up).
- Finish the rest of the pipeline config:
Write config files for the pipeline so that it knows where the LMS database is:
If it succeeds, you'll see:
- Finish the LMS -> Insights SSO config: LMS OAuth Trusted Client Registration.
- You'll be setting up the connection between Insights and the LMS, so single sign on works.
Run the following management command on the LMS machine:
- Check it:
- Automate copying of logs. You probably don't want to do it manually all the time. Options:
- cron job just copying all your logs from the LMS servers regularly
- job to copy logs to S3, use S3 as your HDFS store. (update config to match...)
- Schedule launch-task jobs to actually run all the pipeline tasks regularly
- Here's the list: https://github.com/edx/edx-analytics-pipeline/wiki/Tasks-to-Run-to-Update-Insights
Desired end state:
In docs repo, installing + configuring open edx guide, should include our best stab at instructions.
Start with https://github.com/edx/edx-documentation/pull/216
- Clean up the old pipeline tasks to take standard params from main config
- Get install that uses EMR for hadoop working and document that\