Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Warning

This documentation is now stale and superseded by the implementation and documentation in the Code Annotations repository.



OEP-30 outlines our intentions regarding PII annotations in code that runs on edx.org and other instances of openedx.  This article is a place where we can collect decisions around implementation details and outputs from OEP-30 discovery tasks.

...

Annotating 3rd Party Django Models (PLAT-2344)

Model Inheritance and Mixins

...

Implementation Decision: Treat forked repositories as 3rd party.  Do not annotate models in them directly, but rather use the 3rd party annotation mechanism (the safelist).

Generating RST docs (PLAT-2346)

In short, the python libraries offering RST generation are very basic, and individually cannot offer even the gamut of basic features we need for annotations reports.  Fortunately, RST Isn't a huge pain to see in raw (meant to be human-readable), so we should just focus on constructing raw RST for annotations reporting.  See PLAT-2346 for more details.

Extensions (PLAT-2365)

There are several languages that may need to be searched, each with their own unique comment style and challenges. Not every repository will need every language, so our goals are to:

...

Code Block
languagepy
titletest_extensions.py
collapsetrue
import os
import yaml
from stevedore import named

test_config = """
annotations:
    pii:
        - ".. pii::"
        - ".. pii_types::":
            - id
            - name
            - other
        - ".. pii_retirement::":
            - retained
            - local_api
            - consumer_api
            - third_party
    nopii: ".. no_pii::"

extensions:
    python:
        - py
    javascript:
        - js
        - jsx
"""


def load_failed_handler(*args, **kwargs):
    """
    Callback for when we fail to load an extension, otherwise it fails silently
    """
    print(args)
    print(kwargs)


def search(ext, file_handle, file_extensions_map, filename_extension):
    """
    Executes a search on the given file, only if it is configured for this
    extension
    """
    if filename_extension not in file_extensions_map[ext.name]:
        print('{} does not support {}. Skipping.'.format(ext.name, filename_extension))
        return (ext.name, [])

    return ext.name, ext.obj.search(file_handle)


if __name__ == '__main__':
    config = yaml.load(test_config)

    print(config)

    # These are the names of all of our configured extensions
    configured_extension_names = config['extensions'].keys()

    print(configured_extension_names)

    # Load Stevedore extensions that we are configured for (and only those)
    mgr = named.NamedExtensionManager(
        names=configured_extension_names,
        namespace='annotation_finder.searchers',
        invoke_on_load=True,
        on_load_failure_callback=load_failed_handler,
        invoke_args=(config['annotations'],),  # This is temporary
    )

    # Output all found extension entry points (whether or not they were loaded)
    print(mgr.list_entry_points())

    # Output all extensions that were actually able to load
    for extension in mgr.extensions:
        print(extension)

    # Index the results by extension name
    file_extensions_map = {}
    known_extensions = set()
    for extension_name in config['extensions']:
        file_extensions_map[extension_name] = config['extensions'][extension_name]
        known_extensions.update(config['extensions'][extension_name])

    source_path = '/foo/bar/'

    # From here we could begin the actual file searching and reporting...
    # This is not optimized, but without the prints or doing any actual searching 
    # runs all of edx-platform in 1.18 second.
    for root, dirs, files in os.walk(source_path):
        for filename in files:
            filename_extension = os.path.splitext(filename)[1][1:]

            if filename_extension not in known_extensions:
                print("{} is not a known extension, skipping.".format(filename_extension))
                continue

            full_name = os.path.join(root, filename)
            print(full_name)

            with open(full_name, 'r') as file_handle:
                try:
                    # Call get_supported_extensions on all loaded extensions
                    results = mgr.map(search, file_handle, file_extensions_map, filename_extension)
                    print(results)
                except IndexError:
                    # Should we define a catchall in config?
                    print("No file extension in {}, skipping.".format(full_name))

Configuration (PLAT-2361)

Configuration for the annotation tooling needs to handle the following things:

...

Code Block
languagejs
titleConfig file format
collapsetrue
# This section describes the known annotations
annotations:
    # An annotation can be a single statement that stands alone
    nopii: ".. no_pii::"

    # Or it can describe a group of statements, in which case 
    # the statements must appear in the same order as listed here
    pii:
        # A statement can be a simple value, in which case the
        # text that follows it will be captured
        - ".. pii::"

        # Or it can be an enum list, in which case only the values
        # included will be allowed. In this case a ".. pii::" 
        # annotation must be followed immediately by a 
        # ".. pii_types::" statement which must then be followed
        # immediately by a ".. pii_retirement::" statement.
        - ".. pii_types::":
            # Multiple enum values can be given on an annotation
            # as long as they are separated by spaces such as:
            # .. pii_types:: name username ip
            # An enum annotation must include at least on enum
            # value
            - id
            - name
            - username
            - password
            - location
            - phone_number
            - email_address
            - birth_date
            - ip
            - external_service
            - biography
            - gender
            - sex
            - image
            - video
            - other
        - ".. pii_retirement::":
            - retained
            - local_api
            - consumer_api
            - third_party

# This section is for extension configuration, each 
# sub-section is the name of a Stevedore extension
# that must be installed. Under each extension name
# is a list of file extensions that it will be used
# for.
extensions:
    python:
        - py
        - py3
        - pyw
        - rpy
        - pyt
    javascript:
        - js
        - jsx

Reporting Output (PLAT-2350)

Reporting output from the tools should match the following format:


Code Block
languagejs
titleReporting Output
collapsetrue
# Top level is a dict, keys are filenames relative to the search path
{
     '/openedx/core/djangoapps/pii_enforcer/pii_searcher.py': 
    # Underneath the keys are a list of annotations
    [   
        # Stand-alone annotations are formatted as follows:
        {   
            'annotation_data': 'No PII is stored here',
            'annotation_token': '.. no_pii::',
            'line_number': 2,
            'found_by': ['python']  # These are the names of the extensions or scripts that found this annotation
        },                                                         
        {
            'annotation_data': 'We do not store PII in this model',
            'annotation_token': '.. no_pii::',
            'line_number': 17,
            'found_by': ['python']
        }
    ],
    '/openedx/core/djangoapps/user_api/legacy_urls.py': 
    [
        # Annotation groups are represented differently
        {
            'annotation_group': 'pii', # This is the name given to the group in configuration
            'annotations': 
            [
                {
                    'annotation_data': 'This model stores user addresses and phone numbers',
                    'annotation_token': '.. pii::',
                    'line_number': 16,
                    'found_by': ['python']
                },
                {
                    # In cases where the annotation type is an enum, "annotation_data" becomes a list
                    'annotation_data': ['address', 'phone_number'],
                    'annotation_token': '.. pii_types::',
                    'line_number': 17,
                    'found_by': ['python']
                },
                {
                    'annotation_data': ['local_api', 'consumer_api'],
                    'annotation_token': '.. pii_retirement::',
                    'line_number': 18,
                    'found_by': ['python']
                }
            ]
         }
    ]
}