Initial tagstore implementation

Description

This is an implementation of a tagging system, within blockstore but easily separable if the need should arise.

Originally I started coding this to test the ability to use both Neo4j and AWS Neptune as tag storage backends (so that a taxonomy can be an abitrary graph of tags). It turns out that there is no common protocol that works with both externally hosted Neo4j and Neptune, so I coded this as a pluggable backend model instead. Then, as I was doing that, I realized that (A) The Gremlin API used with AWS Neptune is not/poorly documented for use with python and makes me want to tear my hair out compared to Neo4j Cypher, and (B) if we narrow the scope to only supporting taxonomies that contain free-form tags and simple hierarchical trees of tags, then we can use MySQL as a backend. Since the only use cases I'm aware of so far fit with that, I ran with it and got this prototype tagging service.

It:

  • Defines a "Taxonomy" as a collection of tags

  • Allows tags to optionally be hierarchical - tags can exist in a tree with parent-child relationships, e.g. all dogs are mammals, all mammals are animals. This is designed to support learning outcome hierarchies in particular.

  • Allows any "entity" to be tagged, where an entity could be a user, a block, a collection, etc.

  • Allows rich searching for entities by tags. e.g. "Find all large animals" will return an entity that was tagged with "large" and "mammal", since it knows that the "mammal" tag is a type of "animal" tag.

  • Has a pluggable backend that can store tags in either Neo4j or any SQL database supported by Django's ORM (we can probably ditch the Neo4j backend going forward if we deem the limitations of the Django backend to be insignificant).

  • Has a very simple python API

  • Has no runtime dependencies other than the python driver for your backend of choice (neo4j or django)

  • Uses `asyncio` for its API.

  • This made sense when considering Neo4j / AWS Neptune as the primary backends, but offers no advantage when using the Django ORM backend, as it is [not yet](https://www.aeracode.org/2018/06/04/django-async-roadmap/) asyncio-capable. If we stick with the django backend, it may make sense to make the API synchronous-only. Alternately, we could bypass the django ORM for DB reads/writes (using it only for migrations), and use `aiomysql` which would give a big concurrent user performance boost to tagstore.

  • Does *not* allow other types of relationships between tags other than organizing them into a hierarchy (no support for arbitrary relationships like "dog is similar to wolf"; such advanced graph relationships - which enable other types of taxonomies and fuzzy searches - could be added later but would mean we can't use SQL backends)

  • Does *not* implement "private tags" (user A applies tag T to entity E, but only user A sees that tag). I'm not sure if we want private tags or private taxonomies. One use case for private tags could be users tagging blocks as "favorite", but I can't think of any other use cases.

  • Does *not* allow manipulating tag hierarchies once they are created, other than by adding new tags to the tree. i.e. you cannot remove tags from a hierarchy, nor change their position in the tree etc. We assume that hierarchical tags will usually be created via import/export of externally developed taxonomies.

API example:
```python
sizes = await self.tagstore.create_taxonomy("sizes", owner_id=1)
small = await self.tagstore.add_tag_to_taxonomy('small', sizes)
med = await self.tagstore.add_tag_to_taxonomy('med', sizes)
large = await self.tagstore.add_tag_to_taxonomy('large', sizes)

biology = await self.tagstore.create_taxonomy("Biology", owner_id=1)

animal = await self.tagstore.add_tag_to_taxonomy('animal', biology)
mammal = await self.tagstore.add_tag_to_taxonomy('mammal', biology, parent_tag=animal)
canine = await self.tagstore.add_tag_to_taxonomy('canine', biology, parent_tag=mammal)

  1. Create some entities:
    elephant = EntityId(entity_type='thing', external_id='elephant')
    await self.tagstore.add_tag_to(large, elephant)
    await self.tagstore.add_tag_to(mammal, elephant)
    dog = EntityId(entity_type='thing', external_id='dog')
    await self.tagstore.add_tag_to(med, dog)
    await self.tagstore.add_tag_to(canine, dog)

  1. large animals:
    self.tagstore.get_entities_tagged_with_all({large, animal})

  2. result: {elephant}

  3. notice it knows that an elephant is an animal, even though we only tagged elephant with "mammal"
    ```
    For more examples, see [`tests.py`](https://github.com/open-craft/blockstore/blob/tagstore/tagstore/backends/tests.py)

The intended path forward would be that this lives in blockstore for now and is only used to tag XBlocks and collections, with taxonomies like "subject area" and "learning outcome".

Note to self: TODOs if we want to proceed in this direction:

  • [x] Implement more of the API (e.g. list tags hierarchically, remove tags)

  • [x] Possibly make API sync-only, possibly remove Neo4j backend

  • [x] Determine if we want to support private tags or private taxonomies - no for now

  • [ ] Make taxonomy IDs UUIDs so we get a better import/export story (when course content is imported/exported across instances, tag data could be better preserved)

  • [ ] ~Add a case-sensitive index to Tag.tag, in addition to the case sensitive one (slight oversight)~

  • [x] Perhaps allow "/" in tags, since people may want to tag things with phrases like "accepted/approved"

Done

Assignee

Unassigned

Reporter

Open Source Pull Request Bot

Labels

None

Contributor Name

Braden MacDonald

Repo

edx/blockstore

Customer

Epic Link

None

OSCM Assignee

None

Platform Map Area (Levels 1 & 2)

None

Platform Map Area (Levels 3 & 4)

None

Blended Hour Utilization Percentage

None

edX Theme

None

edX Squad

None

Github Lines Added

1098

Github Lines Deleted

27

Priority

Unset