Django 1.11 - changing zh-cn to zh-hans
As part of the Django 1.11 upgrade we need to address that they dropped support for the zh-cn language code in Django 1.9 in favor of zh-hans. These are the results of my efforts in trying to figure out where and how we store language codes to make a plan for the update. I was kind of starting from scratch with edx-platform loc so a lot of this is beginner stuff, but may be helpful to other people trying to figure out the whole system end-to-end.
Translation packages in play
EdX Platform Translation Python Packages has a list of packages involved in Platform translations as of the time of this writing. Having at least a cursory understanding of them will help with reading this doc.
End-to-end Translation Process
English strings are created in a bunch of different places
Django templates
Mako templates
Python code
Javascript
underscore.js templates
Other places?
English strings are gathered
paver i18n_extract
shells out "i18ntool_extract"
reads config info from conf/locale/config.yaml
Excecute platform child extractions
mako
underscore
django-admin makemessages django
django-admin makemessages djangojs
Execute 3rd party extractions
wiki
All of this generates different .po files for different "segments" to be translated at Transifex by priority
django-partial.po
django-studio.po
mako.po
mako-studio.po
wiki.po
djangojs-partial.po
djangojs-studio.po
underscore.po
underscore-studio.po
Upload English strings to Transifex
paver i18n_transifex_push
shells out "i18n_tool transifex push"
shells out "tx push -s"
Strings are translated at Transifex
Update strings
Ned manually does the following two steps every two weeks
"paver i18n_robot_pull" does a bunch of things
git clean local files
download translated .po files from Transifex
paver i18n_transifex_pull
By default pulls all translated segments of all languages listed in conf/locale/config.yaml with at least 10% reviewed translations
shells out "tx pull -f -mode=reviewed -l {lang}" for each language
Can be used to only pull some
config.yaml stores language codes in standard format (zh_CN)
extract new po files
paver i18n_extract
generate dummy strings
Dummy strings are 2 translations we have for testing. One left-to-right, one right-to-left. More info can be found here.
paver i18n_dummy
runs the static javascript generation command "compilejsi18n"
This generates the compiled dummy languages for testing, with the side effect of compiling all other language files too
compile translated strings
paver i18n_generate_strict
Shells out "i18n_tool generate"
Merge segmented files back together and perform some cleanup
config.yaml tells Transifex how to merge the segments back into two .po files per language with a result like:
conf/locale/zh_CN/LC_MESSAGES/django.po
conf/locale/zh_CN/LC_MESSAGES/djangojs.po
django-admin.py compilemessages
Takes the human-readable .po files and compiles them to the tighter .mo files that Django reads in for gettext to use
"paver i18n_robot_push"
This is the command that actually performs steps 2 and 3 above
Django reads .mo files for non-js files
https://docs.djangoproject.com/en/1.11/topics/i18n/translation/#how-django-discovers-translations
tl;dr: settings.LOCALE_PATHS prioritized in the order they're listed
Most importantly it still uses the standard locale names to find files here
We do not use Django's javascript i18n functionality
We serve up static pre-localized javascript files generated in "generate dummy strings" above
Use translated strings
In a web page
Language selection for a page seems to be handled by the LanguagePreferenceMiddleware (openedx/core/djangoapps/lang_pref/middleware.py)
This uses a combination of cookies, request headers, and user preferences to manipulate Django's accept header parsing into displaying the correct thing
Django uses the re-touched accept header to find the correct localization to load up for gettext to translate
It will spin through the list of accepted languages and try to find one that we support (from settings.LANGUAGES)
If it cannot find an exact match (zh-tw) it will fall back on "sister" languages (zh-hans) or the macrolanguage code (zh)
See https://github.com/django/django/blob/1.11.3/django/utils/translation/trans_real.py#L449
This means we may actually be ok, provided the checking is not strict?
The path for javascript translations is included in the top level page templates for scripts to use
As part of a celery task (some celery tasks use translations, if only for error messaging)
bulk_email
Uses the course_language, if set at current there does not see to be any courses using zh_CN
import_olx / export_olx
Store the language from the request that started the task and use that
Potential Changes Needed For zh-cn Update
Update settings
Dark / Dark Database (do we need to do both? what do they each do?)
Settings files (update settings.LANGUAGES)
Transifex (conf/locale)
Need to update config.yaml to have a lang_map entry to make sure zh_CN gets correctly downloaded as zh_HANS
In the [main] section of .tx/config we need this line:
lang_map: zh_CN : zh_HANS
This is a 1:1 mapping, we can't download the Chinese translations to both zh_CN and zh_HANS. It's unclear if the download process will delete the old CN translation files or if we can just leave them during the transition. If not we may need to add some shim code to copy the files after they are pulled.
What else?
Update code
Some tests have hard coded zh_CN as the lang
Lots of common_static / staticfiles, I assume those are generated by django-static-i18n and would be fixed by just rebuilding after settings.LANGUAGES is updated?
pdfjs is localized with zh-CN and zh-TW
tinymce is localizable, are we doing it? Files aren't in the locale dir.
moment uses zh-cn, how to work around that?
video transcripts?
help-tokens looks like it's configured to only use English, so that should be ok
Update mysql
darklangconfig - definitely has target rows
Slightly complicated since it's a comma separated list (~10 rows)
Probably easier just to hand edit in the admin?
userprofile this is unnecessary as it seems to only store un-sanitized data hand-entered by the user
courseoverview Does not currently have target rows.
teams_courseteam has a language field that might need to be updated (~1800 rows)... but does not currently have zh_CN, uses zh_HANS
languageproficiency Does not currently have target rows, uses zh_HANS
Update mongo
course language, looks like there are several in zh_CN
anything else?
Celery tasks
bulk_email uses the course language, we could end up in a situation where things are in the queue with zh_CN during the deploy and shortly after so we will probably need to support both language codes briefly
import / export olx tasks use the language from the request that starts them
Caches
cache_if_anonymous (at least) uses language in the cache key, this change would effectively invalidate everything cached for zh_CN, which is probably not a huge performance hit but I'll check in with @Dave Ormsbee (Deactivated) on it. Dave reports this is a non-issue.
Misc
Branding's cached footer uses get_supported_language_variant and so should be resilient
video summaries / transcripts?
Cookies
In theory the language cookie should get reset by LanguagePreferenceMiddleware on the first request after the code change
Rough outline of upgrade work
Figure out if we can just leave the translations as zh_CN and let Django fall back to that, leaving everything else the same
This will probably fail quickly and spectacularly
Figure out if we can just rename the translations
This will also probably fail
Learn more about how 3rd party packages are using loc and how to integrate them
pdfjs
tinymce
moment
Andy has discovered that Wikipedia had to map zh_hans to zh_cn for moment, we'll likely have to find the right place to do the same
what else?
Figure out how we're handling video descriptions / transcripts and what work needs to be done there
Figure out what we need to do to get Transifex changed over
Create a branch to work on the changes
Copy (don't move) translation .po / .mo files from zh_CN to zh_HANS
Copy (don't move) 3rd party translations to zh_HANS? (pdfjs, ...?) and update any config we need to
Potentially upgrade the packages if they've moved over
Make sure static files will get rebuilt appropriately, and with the correct new loc
Update tests to use zh_HANS and start fixing breakages
Write new tests to make sure zh_CN and zh_HANS can co-exist
Update the Transifex config for the new mapping (see Update Settings above)
Note that this is a 1:1 mapping, and pulling future strings from Transifex won't update zh_CN, and may delete it
Do whatever we need to do for video summaries / transcripts *hand wave*
Write and test migration scripts to update mysql and mongo as necessary
Merge
Fix whatever breaks
Remove zh_CN and shims / tests around it
Merge
Fix whatever breaks
Profit
Video Transcripts
Video transcripts are tricky. We have the current video_module xblock and a new version in design right now. In the current code there are HTML5 and Youtube videos. video_module allows a choice of languges that are generated by the list of transcript languages that exist for this video that are included in the LANGUAGES setting. There are also alternative sources for videos, such as CDNs in China, based on user locale. It looks like these actually use the ALL_LANGUAGES setting and have nothing to do with Django translations so they're probably ok. ALL_LANGUAGES already has entries for zh_HANS / zh_HANT.