edX Analytics Installation
This page describes how to set up the analytics stack in a small scale production-like setup, all on one box, with no use of AWS-specific features. The setup is mostly automated using ansible, but still has a few pieces that are manual.
This is a "production-like" setup, but requires some tweaking to be truly production quality.
For example we strongly recommend you use HTTPS instead of HTTP for both insights and the LMS in production environments. These instructions do not yet contain complete instructions for setting up the system using HTTPS, but there is some stuff in here to help get you started.
Installation advice
For reference see:
yarn nodemanager default port -> https://goo.gl/1uy6kx
xqueue default port -> https://goo.gl/Ra5g3R.
Install overview:
- Set up a new box and ensure it can connect to the LMS and the LMS mysql DB
- Run ansible to install all the things and do most of the configuration
- Manually finish a few bits of configuration (in particular, OAuth config on the LMS side)
- Copy over tracking logs and run some test jobs
- Automate loading of tracking logs and schedule jobs to run regularly
TL;DR – just give me the script
This is a bash script to install all the things.
- It expects to find a tracking.log file in the home directory – put an LMS log there before your run this.
- You'll need to manually run the OAuth management command on your LMS system – see below.
- You may need to do some network config to make sure your machines have the right ports open. See below.
Run on a new Ubuntu 12.04 box as a user that can sudo.
#!/bin/bash LMS_HOSTNAME="https://mulby.sandbox.edx.org" INSIGHTS_HOSTNAME="" # Change this to the externally visible domain and scheme for your Insights install, ideally HTTPS DB_USERNAME="read_only" DB_HOST="localhost" DB_PASSWORD="password" DB_PORT="3306" # Run this script to set up the analytics pipeline echo "Assumes that there's a tracking.log file in \$HOME" sleep 2 echo "Create ssh key" ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' echo >> ~/.ssh/authorized_keys # Make sure there's a newline at the end cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # check: ssh localhost "echo It worked!" -- make sure it works. echo "Install needed packages" sudo apt-get update sudo apt-get install -y git python-pip python-dev libmysqlclient-dev sudo pip install virtualenv echo 'create an "ansible" virtualenv and activate it' virtualenv ansible . ansible/bin/activate git clone https://github.com/edx/configuration.git cd configuration/ make requirements cd playbooks/ echo "running ansible -- it's going to take a while" ansible-playbook -i localhost, -c local analytics_single.yml --extra-vars "INSIGHTS_LMS_BASE=$LMS_HOSTNAME INSIGHTS_BASE_URL=$INSIGHTS_HOSTNAME" echo "-- Set up pipeline" cd $HOME sudo mkdir -p /edx/var/log/tracking sudo cp ~/tracking.log /edx/var/log/tracking sudo chown hadoop /edx/var/log/tracking/tracking.log echo "Waiting 70 seconds to make sure the logs get loaded into HDFS" # Hack hackity hack hack -- cron runs every minute and loads data from /edx/var/log/tracking sleep 70 # Make a new virtualenv -- otherwise will have conflicts echo "Make pipeline virtualenv" virtualenv pipeline . pipeline/bin/activate echo "Check out pipeline" git clone https://github.com/edx/edx-analytics-pipeline cd edx-analytics-pipeline make bootstrap # HACK: make ansible do this cat <<EOF > /edx/etc/edx-analytics-pipeline/input.json {"username": $DB_USERNAME, "host": $DB_HOST, "password": $DB_PASSWORD, "port": $DB_PORT} EOF echo "Run the pipeline" # Ensure you're in the pipeline virtualenv remote-task --host localhost --repo https://github.com/edx/edx-analytics-pipeline --user ubuntu --override-config $HOME/edx-analytics-pipeline/config/devstack.cfg --wheel-url http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise --remote-name analyticstack --wait TotalEventsDailyTask --interval 2016 --output-root hdfs://localhost:9000/output/ --local-scheduler echo "If you got this far without error, you should try running the real pipeline tasks listed/linked below"
Detailed steps to get a basic single-box install:
- Gather information:
- url to your LMS. e.g. lms.mysite.org
- url and credentials to your LMS DB. e.g. mysql.mysite.org
- Create a box to use for the analytics stack. e.g. analytics.mysite.org.
- We started with a blank ubuntu 12.04 AMI on AWS (NOTE: there are known issues upgrading to 14.04 – changed package names, etc. They are probably easily solvable, but we haven't done it yet)
Ensure that this box can talk to the LMS via HTTP:
curl lms.mysite.org
Ensure that this box can connect to the DB:
telnet mysql.mysite.org 3306
Ensure the box has the following ports open:
80 -- for insights (actually 18110 at the moment -- should be changed) # what else?
Install git and python other tools
sudo apt-get update sudo apt-get install git sudo apt-get install python-pip sudo apt-get install python-dev sudo pip install virtualenv
Create a virtualenv
# create an "ansible" virtualenv and activate it virtualenv ansible . ansible/bin/activate
Run ansible to set up most of the services. Command is:
git clone https://github.com/edx/configuration.git cd configuration/ make requirements cd playbooks/ ansible-playbook -i localhost, -c local analytics_single.yml --extra-vars "INSIGHTS_LMS_BASE=mysite.org" # (If your site uses https, change the scheme and set the oauth flag to true. Enforce_secure means "insist on https".) # wait for a while :)
It will do the following:
- Install and configure hadoop, hive and sqoop
- Configure SSH daemon on the hadoop master node
- Configure the result store database
- Setup databases
- Setup users
- Configure data API
- Shared secret
- Database connection
- Configure Insights
- API shared secret
- Tell insights where the LMS is
- Check it:
Run the built-in "compute pi" hadoop job
sudo su - hadoop cd /edx/app/hadoop hadoop jar hadoop-2.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar pi 2 100 # it should compute something -- I got pi = 3.12. Close enough :)
Make sure you can run hive
/edx/app/hadoop/hive/bin/hive # it should work ^D to get back to your regular user
The API should be up.
How to check?
The Insights app should be up: go to insights.mysite.org, make sure home page is there. You won't be able to log in yet.
# Insights gunicorn is on 8110 curl localhost:8110 # Insights nginx (the externally facing view) should be 18110 mybox.org:18110 # TODO: switch nginx port to 80
- Get some test logs into HDFS
copy some log files into the hdfs system:
# scp tracking.log onto the machine from the LMS. Then... sudo mkdir /edx/var/log/tracking sudo cp tracking.log /edx/var/log/tracking sudo chown hadoop /edx/var/log/tracking/tracking.log # wait a minute -- ansible creates a cron job to load files in that dir every minute # Check it hdfs dfs -ls /data Found 1 items -rw-r--r-- 1 hadoop supergroup 308814 2015-10-15 14:31 /data/tracking.log
Set up the pipeline
ssh-keygen -t rsa -f ~/.ssh/id_rsa -P '' echo >> ~/.ssh/authorized_keys # Make sure there's a newline at the end cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # check: ssh localhost "echo It worked!" -- make sure it works. # Make a new virtualenv -- otherwise will have conflicts virtualenv pipeline . pipeline/bin/activate git clone https://github.com/edx/edx-analytics-pipeline cd edx-analytics-pipeline make bootstrap
Check the pipeline install by running a simple job to count events per day. Lots of parameters to setup the pipeline before running the job. We'll be able to use --skip-setup below. The user should be set to the current user (that has the ssh self-login set up).
# Ensure you're in the pipeline virtualenv remote-task --host localhost --repo https://github.com/edx/edx-analytics-pipeline --user ubuntu --override-config $HOME/edx-analytics-pipeline/config/devstack.cfg --wheel-url http://edx-wheelhouse.s3-website-us-east-1.amazonaws.com/Ubuntu/precise --remote-name analyticstack --wait TotalEventsDailyTask --interval 2015 --output-root hdfs://localhost:9000/output/ --local-scheduler
- Finish the rest of the pipeline config:
Write config files for the pipeline so that it knows where the LMS database is:
sudo vim /edx/etc/edx-analytics-pipeline/input.json # put in the right url and credentials for your LMS database
Test it:
remote-task --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait ImportEnrollmentsIntoMysql --interval 2016 --local-scheduler
If it succeeds, you'll see:
sudo mysql SELECT * FROM reports.course_enrollment_daily; # Should give enrollments over time. Note that this only counts enrollments in the event logs -- if you manually created users / enrollments in the DB, they won't be counted.
- Finish the LMS -> Insights SSO config: LMS OAuth Trusted Client Registration.
- You'll be setting up the connection between Insights and the LMS, so single sign on works.
Run the following management command on the LMS machine:
sudo su edxapp /edx/bin/python.edxapp /edx/bin/manage.edxapp lms --setting=production create_oauth2_client confidential --client_name insights --client_id YOUR_OAUTH2_KEY --client_secret secret --trusted # Replace "secret", "YOUR_OAUTH2_KEY", and the url of your Insights box. # TODO: make the ansible script override these # INSIGHTS_BASE_URL # INSIGHTS_OAUTH2_KEY # INSIGHTS_OAUTH2_SECRET # Also set other secrets to more secret values. # Ensure that JWT_ISSUER and OAUTH_OIDC_ISSUER on the LMS in /edx/app/edxapp/lms.env.json match the url root in # /edx/etc/insights.yml (SOCIAL_AUTH_EDX_OIDC_URL_ROOT). This should be the case unless your environment is weird (ala edx sandboxes are really username.sandbox.edx.org but the setting is "int.sandbox.edx.org")
- Check it:
Log into LMS as a staff user. Ensure you can log into Insights and see all courses you have staff access to.
- You'll be setting up the connection between Insights and the LMS, so single sign on works.
- Automate copying of logs. You probably don't want to do it manually all the time. Options:
- cron job just copying all your logs from the LMS servers regularly
- job to copy logs to S3, use S3 as your HDFS store. (update config to match...)
- Schedule launch-task jobs to actually run all the pipeline tasks regularly
- Here's the list: https://github.com/edx/edx-analytics-pipeline/wiki/Tasks-to-Run-to-Update-Insights
# Ensure you're in the pipeline virtualenv remote-task --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait CourseActivityWeeklyTask --local-scheduler \ --end-date $(date +%Y-%m-%d -d "today") \ --weeks 24 \ --n-reduce-tasks 1 # number of reduce slots in your cluster -- we only have 1
- Link to ansible playbook we use: https://github.com/edx/configuration/blob/master/playbooks/edx-east/analytics_single.yml
- Devstack docs: http://edx.readthedocs.org/projects/edx-installing-configuring-and-running/en/latest/devstack/analytics_devstack.html
- https://github.com/edx/edx-analytics-configuration
- http://edx.readthedocs.io/projects/edx-installing-configuring-and-running/en/latest/installation/analytics/index.html (where this doc should live)
- https://github.com/edx/edx-analytics-pipeline/wiki/Tasks-to-Run-to-Update-Insights
- Mailing list: https://groups.google.com/forum/#!forum/openedx-analytics
Desired end state:
In docs repo, installing + configuring open edx guide, should include our best stab at instructions.
Start with https://github.com/edx/edx-documentation/pull/216
- Get a single machine install working. Get instructions into the docs.
- Try to include ways to test that things are working along the way
- Get https://github.com/edx/configuration/pull/2362 merged
- Improve the ansible script to do more of the setup automatically
- include cron to run tasks daily? Leave "get tracking logs to box" to the end user
- Make insights run on port 80 rather than 18110
- Replace all other docs with pointers to the One True Doc (so that documentation ends up under http://edx.readthedocs.io/projects/edx-installing-configuring-and-running/en/latest/installation/analytics/index.html.
- Analytics pipeline wiki: https://github.com/edx/edx-analytics-pipeline/wiki
- Analytics pipeline and dashboard repo README docs
- Send update to openedx-analytics list.
- Update the docs PR with the info on this wiki page: https://github.com/edx/edx-documentation/pull/216
Further improvements
- Clean up the old pipeline tasks to take standard params from main config
- Get install that uses EMR for hadoop working and document that\