Django 1.11 - changing zh-cn to zh-hans

As part of the Django 1.11 upgrade we need to address that they dropped support for the zh-cn language code in Django 1.9 in favor of zh-hans. These are the results of my efforts in trying to figure out where and how we store language codes to make a plan for the update. I was kind of starting from scratch with edx-platform loc so a lot of this is beginner stuff, but may be helpful to other people trying to figure out the whole system end-to-end.


Translation packages in play

EdX Platform Translation Python Packages has a list of packages involved in Platform translations as of the time of this writing. Having at least a cursory understanding of them will help with reading this doc.

End-to-end Translation Process

  1. English strings are created in a bunch of different places
    1. Django templates
    2. Mako templates
    3. Python code
    4. Javascript
    5. underscore.js templates
    6. Other places?
  2. English strings are gathered
    1. paver i18n_extract
      1. shells out "i18ntool_extract"
    2. reads config info from conf/locale/config.yaml
    3. Excecute platform child extractions
      1. mako
      2. underscore
      3. django-admin makemessages django
      4. django-admin makemessages djangojs
    4. Execute 3rd party extractions
      1. wiki
    5. All of this generates different .po files for different "segments" to be translated at Transifex by priority
      1. django-partial.po
      2. django-studio.po
      3. mako.po
      4. mako-studio.po
      5. wiki.po
      6. djangojs-partial.po
      7. djangojs-studio.po
      8. underscore.po
      9. underscore-studio.po
  3. Upload English strings to Transifex
    1. paver i18n_transifex_push
      1. shells out "i18n_tool transifex push"
        1. shells out "tx push -s"
  4. Strings are translated at Transifex
  5. Update strings
    1. Ned manually does the following two steps every two weeks
    2. "paver i18n_robot_pull" does a bunch of things
      1. git clean local files
      2. download translated .po files from Transifex
        1. paver i18n_transifex_pull
          1. By default pulls all translated segments of all languages listed in conf/locale/config.yaml with at least 10% reviewed translations
          2. shells out "tx pull -f -mode=reviewed -l {lang}" for each language
          3. Can be used to only pull some
          4. config.yaml stores language codes in standard format (zh_CN)
      3. extract new po files
        1. paver i18n_extract
      4. generate dummy strings
        1. Dummy strings are 2 translations we have for testing. One left-to-right, one right-to-left. More info can be found here.
        2. paver i18n_dummy
        3. runs the static javascript generation command "compilejsi18n"
        4. This generates the compiled dummy languages for testing, with the side effect of compiling all other language files too
      5. compile translated strings
        1. paver i18n_generate_strict
        2. Shells out "i18n_tool generate"
        3. Merge segmented files back together and perform some cleanup
          1. config.yaml tells Transifex how to merge the segments back into two .po files per language with a result like:
            1. conf/locale/zh_CN/LC_MESSAGES/django.po
            2. conf/locale/zh_CN/LC_MESSAGES/djangojs.po
        4. django-admin.py compilemessages
          1. Takes the human-readable .po files and compiles them to the tighter .mo files that Django reads in for gettext to use
    3. "paver i18n_robot_push"
      1. This is the command that actually performs steps 2 and 3 above 
  6. Django reads .mo files for non-js files
    1. https://docs.djangoproject.com/en/1.11/topics/i18n/translation/#how-django-discovers-translations
      1. tl;dr: settings.LOCALE_PATHS prioritized in the order they're listed
    2. Most importantly it still uses the standard locale names to find files here
  7. We do not use Django's javascript i18n functionality
    1. We serve up static pre-localized javascript files generated in "generate dummy strings" above
    2. See: https://github.com/zyegfryed/django-statici18n
  8. Use translated strings
    1. In a web page
      1. Language selection for a page seems to be handled by the LanguagePreferenceMiddleware (openedx/core/djangoapps/lang_pref/middleware.py)
      2. This uses a combination of cookies, request headers, and user preferences to manipulate Django's accept header parsing into displaying the correct thing
      3. Django uses the re-touched accept header to find the correct localization to load up for gettext to translate
        1. It will spin through the list of accepted languages and try to find one that we support (from settings.LANGUAGES)
        2. If it cannot find an exact match (zh-tw) it will fall back on "sister" languages (zh-hans) or the macrolanguage code (zh)
          1. See https://github.com/django/django/blob/1.11.3/django/utils/translation/trans_real.py#L449
          2. This means we may actually be ok, provided the checking is not strict?
      4. The path for javascript translations is included in the top level page templates for scripts to use
        1. Studio - base.tpl
        2. LMS - main.html
    2. As part of a celery task (some celery tasks use translations, if only for error messaging)
      1. bulk_email
        1. Uses the course_language, if set at current there does not see to be any courses using zh_CN
      2. import_olx / export_olx 
        1. Store the language from the request that started the task and use that

Potential Changes Needed For zh-cn Update

Update settings

  • Dark / Dark Database (do we need to do both? what do they each do?)
  • Settings files (update settings.LANGUAGES)
  • Transifex (conf/locale)
    • Need to update config.yaml to have a lang_map entry to make sure zh_CN gets correctly downloaded as zh_HANS
      • In the [main] section of .tx/config we need this line:
        • lang_map: zh_CN : zh_HANS
      • This is a 1:1 mapping, we can't download the Chinese translations to both zh_CN and zh_HANS. It's unclear if the download process will delete the old CN translation files or if we can just leave them during the transition. If not we may need to add some shim code to copy the files after they are pulled.
  • What else?

Update code

  • Some tests have hard coded zh_CN as the lang
  • Lots of common_static / staticfiles, I assume those are generated by django-static-i18n and would be fixed by just rebuilding after settings.LANGUAGES is updated?
  • pdfjs is localized with zh-CN and zh-TW
  • tinymce is localizable, are we doing it? Files aren't in the locale dir.
  • moment uses zh-cn, how to work around that?
  • video transcripts?
  • help-tokens looks like it's configured to only use English, so that should be ok

Update mysql

  • darklangconfig - definitely has target rows
    • Slightly complicated since it's a comma separated list (~10 rows)
    • Probably easier just to hand edit in the admin?
  • userprofile this is unnecessary as it seems to only store un-sanitized data hand-entered by the user
  • courseoverview Does not currently have target rows.
  • teams_courseteam has a language field that might need to be updated (~1800 rows)... but does not currently have zh_CN, uses zh_HANS
  • languageproficiency Does not currently have target rows, uses zh_HANS

Update mongo

  • course language, looks like there are several in zh_CN
  • anything else?

Celery tasks

  • bulk_email uses the course language, we could end up in a situation where things are in the queue with zh_CN during the deploy and shortly after so we will probably need to support both language codes briefly
  • import / export olx tasks use the language from the request that starts them

Caches

  • cache_if_anonymous (at least) uses language in the cache key, this change would effectively invalidate everything cached for zh_CN, which is probably not a huge performance hit but I'll check in with Dave Ormsbee on it. Dave reports this is a non-issue.

Misc

  • Branding's cached footer uses get_supported_language_variant and so should be resilient
  • video summaries / transcripts?

Cookies

  • In theory the language cookie should get reset by LanguagePreferenceMiddleware on the first request after the code change

Rough outline of upgrade work

  1. Figure out if we can just leave the translations as zh_CN and let Django fall back to that, leaving everything else the same
    1. This will probably fail quickly and spectacularly
  2. Figure out if we can just rename the translations
    1. This will also probably fail
  3. Learn more about how 3rd party packages are using loc and how to integrate them
    1. pdfjs
    2. tinymce
    3. moment
      1. Andy has discovered that Wikipedia had to map zh_hans to zh_cn for moment, we'll likely have to find the right place to do the same
    4. what else?
  4. Figure out how we're handling video descriptions / transcripts and what work needs to be done there
  5. Figure out what we need to do to get Transifex changed over
  6. Create a branch to work on the changes
    1. Copy (don't move) translation .po / .mo files from zh_CN to zh_HANS 
    2. Copy (don't move) 3rd party translations to zh_HANS? (pdfjs, ...?) and update any config we need to
      1. Potentially upgrade the packages if they've moved over
    3. Make sure static files will get rebuilt appropriately, and with the correct new loc
    4. Update tests to use zh_HANS and start fixing breakages
    5. Write new tests to make sure zh_CN and zh_HANS can co-exist
    6. Update the Transifex config for the new mapping (see Update Settings above)
      1. Note that this is a 1:1 mapping, and pulling future strings from Transifex won't update zh_CN, and may delete it
    7. Do whatever we need to do for video summaries / transcripts *hand wave*
    8. Write and test migration scripts to update mysql and mongo as necessary
    9. Merge
    10. Fix whatever breaks
    11. Remove zh_CN and shims / tests around it
    12. Merge
    13. Fix whatever breaks
    14. Profit

Video Transcripts

Video transcripts are tricky. We have the current video_module xblock and a new version in design right now. In the current code there are HTML5 and Youtube videos. video_module allows a choice of languges that are generated by the list of transcript languages that exist for this video that are included in the LANGUAGES setting. There are also alternative sources for videos, such as CDNs in China, based on user locale. It looks like these actually use the ALL_LANGUAGES setting and have nothing to do with Django translations so they're probably ok. ALL_LANGUAGES already has entries for zh_HANS / zh_HANT.


Ned Batchelder (Deactivated)
July 14, 2017

I run "paver i18n_robot_pull" and "_push" manually every two weeks.