Since the Public Courseware feature was released, our client's engagement data has dropped significantly, and so they would like a way to count the anonymous hits to this content. They use the per-learner module_engagement hive tables for their own internal reporting, and so only need the data recorded there.
Add this configuration to store anonymous user records under the given replacement username, e.g.
```cfg
[module-engagement]
store_anonymous_username = ANONYMOUS USER
```
*JIRA tickets*: OSPR
*Merge deadline*: None, though we would appreciate feedback on the approach ASAP so we can decide whether to merge to our client's branch now.
*Testing instructions*:
1. Set up the [Analytics Pipeline docker devstack](https://github.com/edx/devstack#getting-started-on-analytics)
1. Open the pipeline shell:
```bash
make analytics-pipeline-shell
```
1. Copy the anonymous user tracking logs into the hdfs /data dir:
[anon_tracking.log](https://github.com/edx/edx-analytics-pipeline/files/3289758/anon_tracking.log)
[honor_tracking.log](https://github.com/edx/edx-analytics-pipeline/files/3289759/honor_tracking.log)
```
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hdfs dfs -put anon_tracking.log /data/
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hdfs dfs -put honor_tracking.log /data/
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hdfs dfs -ls /data
Found 2 items
rw-rr- 3 hadoop hadoop 12897 2019-06-11 06:32 /data/anon_tracking.log
rw-rr- 3 hadoop hadoop 12972 2019-06-11 06:42 /data/honor_tracking.log
```
1. Check the hive database – a clean install won't have any tables in it, but even if you do, there shouldn't be any anonymous user records in the `module_engagement` table yet.
```bash
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hive
hive> show tables;
hive> select * from module_engagement;
```
1. Check out this branch and install it:
```bash
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ make install
```
1. Add custom configuration for the module-engagement tasks, which ignores anonymous user records by default:
```bash
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ echo "[module-engagement]
allow_empty_insert = true" > override.cfg
```
1. Process the tracking data loaded into hdfs above.
```bash
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ START_DATE=2019-06-10
; TODAY=`date +%Y-%m-%d`
Load enrollments for the interval
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ launch-task ImportEnrollmentsIntoMysql --interval $START_DATE-$TODAY --local-scheduler
Process per-learner data for the interval
launch-task ModuleEngagementIntervalTask --interval $START_DATE-$TODAY --local-scheduler
```
1. Check that only the honor user records are in hive
```bash
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hive
hive> select * from module_engagement;
edX/DemoX/Demo_Course_2 honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
edX/DemoX/Demo_Course honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
Time taken: 1.929 seconds, Fetched: 4 row(s)
```
1. Update config to store anonymous users under the username, `ANONYMOUS USER`
```bash
echo "[module-engagement]
store_anonymous_username = ANONYMOUS USER
allow_empty_insert = true" > override.cfg
```
1. Overwrite the module engagement data to include these anonymous user records
```bash
launch-task ModuleEngagementIntervalTask --interval $START_DATE-$TODAY --local-scheduler --overwrite-from-date $START_DATE --overwrite-mysql
```
1. Check that there are now both honor and anonymous user records present in the table.
```bash
hadoop@analyticspipeline:/edx/app/analytics_pipeline/analytics_pipeline$ hive
hive> select * from module_engagement;
edX/DemoX/Demo_Course_2 ANONYMOUS USER 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
edX/DemoX/Demo_Course_2 honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
edX/DemoX/Demo_Course ANONYMOUS USER 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
edX/DemoX/Demo_Course honor 2019-06-10 video i4x-edX-DemoX-video-8c0028eb2a724f48a074bc184cd8635f viewed 2 2019-06-10
Time taken: 1.929 seconds, Fetched: 4 row(s)
```
*Author notes and concerns*:
1. Acceptance tests are in the works.
1. This change does not modify the various summary tables generated by module_engagement, so the data will only be visible in the hive table itself.
The summary tables also require that the per-learner data be linked to an enrollment record, which anonymous records cannot be. So future enhancements would be required to make to make this information visible as appropriate in Insights.
1. Any logs with anonymous user events will need to be re-processed with this option enabled in order to fix the historical data records.
*Reviewers*
[ ] @gr4yscale
[ ] edX reviewer[s] TBD
Analytics Pipeline Pull Request
—
Make sure that the following steps are done before merging:
[ ] If you have a migration please contact data engineering team before merging.
[ ] Before merging run full acceptance tests suite and provide URL for the acceptance tests run.
[ ] A member of data engineering team has approved the pull request.