SE-1032 Adds option to store anonymous user events in module_engagement hive table

Description

Since the Public Courseware feature was released, our client's engagement data has dropped significantly, and so they would like a way to count the anonymous hits to this content. They use the per-learner module_engagement hive tables for their own internal reporting, and so only need the data recorded there.

Add this configuration to store anonymous user records under the given replacement username, e.g.
```cfg
[module-engagement]
store_anonymous_username = ANONYMOUS USER
```

*JIRA tickets*: OSPR

*Merge deadline*: None, though we would appreciate feedback on the approach ASAP so we can decide whether to merge to our client's branch now.

*Testing instructions*:

1. Set up the [Analytics Pipeline docker devstack](https://github.com/edx/devstack#getting-started-on-analytics)
1. Open the pipeline shell:
```bash
make analytics-pipeline-shell
```
1. Copy the anonymous user tracking logs into the hdfs /data dir:

  • [anon_tracking.log](https://github.com/edx/edx-analytics-pipeline/files/3289758/anon_tracking.log)

  • [honor_tracking.log](https://github.com/edx/edx-analytics-pipeline/files/3289759/honor_tracking.log)
    ```
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hdfs dfs -put anon_tracking.log /data/
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hdfs dfs -put honor_tracking.log /data/
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hdfs dfs -ls /data
    Found 2 items
    rw-rr- 3 hadoop hadoop 12897 2019-06-11 06:32 /data/anon_tracking.log
    rw-rr- 3 hadoop hadoop 12972 2019-06-11 06:42 /data/honor_tracking.log
    ```
    1. Check the hive database – a clean install won't have any tables in it, but even if you do, there shouldn't be any anonymous user records in the `module_engagement` table yet.
    ```bash
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hive
    hive> show tables;
    hive> select * from module_engagement;
    ```
    1. Check out this branch and install it:
    ```bash
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ make install
    ```
    1. Add custom configuration for the module-engagement tasks, which ignores anonymous user records by default:
    ```bash
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ echo "[module-engagement]
    allow_empty_insert = true" > override.cfg
    ```
    1. Process the tracking data loaded into hdfs above.
    ```bash
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ START_DATE=2019-06-10
    ; TODAY=`date +%Y-%m-%d`

  1. Load enrollments for the interval
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ launch-task ImportEnrollmentsIntoMysql --interval $START_DATE-$TODAY --local-scheduler

  2. Process per-learner data for the interval
    launch-task ModuleEngagementIntervalTask --interval $START_DATE-$TODAY --local-scheduler
    ```
    1. Check that only the honor user records are in hive
    ```bash
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hive
    hive> select * from module_engagement;
    edX/DemoX/Demo_Course_2 honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
    edX/DemoX/Demo_Course honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
    Time taken: 1.929 seconds, Fetched: 4 row(s)
    ```
    1. Update config to store anonymous users under the username, `ANONYMOUS USER`
    ```bash
    echo "[module-engagement]
    store_anonymous_username = ANONYMOUS USER
    allow_empty_insert = true" > override.cfg
    ```
    1. Overwrite the module engagement data to include these anonymous user records
    ```bash
    launch-task ModuleEngagementIntervalTask --interval $START_DATE-$TODAY --local-scheduler --overwrite-from-date $START_DATE --overwrite-mysql
    ```
    1. Check that there are now both honor and anonymous user records present in the table.
    ```bash
    hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hive
    hive> select * from module_engagement;
    edX/DemoX/Demo_Course_2 ANONYMOUS USER 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
    edX/DemoX/Demo_Course_2 honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
    edX/DemoX/Demo_Course ANONYMOUS USER 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
    edX/DemoX/Demo_Course honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
    Time taken: 1.929 seconds, Fetched: 4 row(s)
    ```

*Author notes and concerns*:

1. Acceptance tests are in the works.
1. This change does not modify the various summary tables generated by module_engagement, so the data will only be visible in the hive table itself.
The summary tables also require that the per-learner data be linked to an enrollment record, which anonymous records cannot be. So future enhancements would be required to make to make this information visible as appropriate in Insights.
1. Any logs with anonymous user events will need to be re-processed with this option enabled in order to fix the historical data records.

*Reviewers*

  • [ ] @gr4yscale

  • [ ] edX reviewer[s] TBD

Analytics Pipeline Pull Request

Make sure that the following steps are done before merging:

  • [ ] If you have a migration please contact data engineering team before merging.

  • [ ] Before merging run full acceptance tests suite and provide URL for the acceptance tests run.

  • [ ] A member of data engineering team has approved the pull request.

Status

Assignee

Brian Wilson

Reporter

Open Source Pull Request Bot

Labels

Contributor Name

Jillian Vogel

Repo

edx/edx-analytics-pipeline

Customer

Epic Link

None

OSCM Assignee

None

Platform Map Area (Levels 1 & 2)

Data & Analytics

Platform Map Area (Levels 3 & 4)

None

Priority

Unset
Configure