Current status
Completed mitigations around slow message index rebuild. The mitigations allow system to work as expected even while waiting message index to be rebuild. MessageIndex rebuild was optimised to take less than 20 seconds on MetaWiki.
Original report
Sometimes translators are unable to save translations due to error "This namespace is reserved for content page translations". This error should not happen in regular use, only in edge cases such as someone keeping a page open so long that the translatable page has changed in meanwhile, or manually editing URLs.
General causes
The underlying cause is a mismatch of the state in MessageIndex versus MessageCollection. MessageIndex is a reverse map of translation unit names to message groups. MessageCollection asks the message group directly what are the names of translation units in that group.
Every time there is a change to translation units, the code is supposed to run a MessageIndexRebuildJob. If this job is not run, or is delayed, or fails, then we get a mismatch of state.
There is additional check on the Translations namespace that prevents editing of translation units that do not correspond to any translatable pages. This is just to keep things tidy to avoid mistakes when people manipulate URLs by hand. Nothing bad would happen if this check was not there.
This issue happens on newly created translation units, or pages newly added for translations.
Current causes
We believe that currently this issue manifests when MessageIndexRebuildJob fails due to Exception: MessageIndex: unable to acquire lock. This means that some other MessageIndexRebuildJob is keeping the table locked for a long time. https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now%2FM&to=now%2FM&var-site=All&var-type=MessageIndexRebuildJob provides some insight how long the job takes when run in JobQueue (but it is often run as part of other jobs as well).
This is happening when a lot of pages are marked for translations in a short time period. Given each update takes about 20 seconds, even three pages can cause the lock to time out after 30 seconds (first job 0-20s, second job 20-40s, third job fails at 30s).
Another case were this can happen is when MessageIndex is out of date, which can easily spawn a lot of jobs. Further debugging is needed to see if there is anything to fix there.
Options to explore
- Increase the lock timeout
- Only makes the problem rarer, could run in job execution time limit
- Make it faster
- Would need a profile from production
- Re-architect to allow concurrency or incremental updates or combining multiple jobs
- Likely a lot of work
Currently being done:
- Some low hanging fruit to make the rebuild faster
- Enabled de-duplication of jobs that are run via the job queue and using the job queue in more places
- Investigating message index out of date condition frequency and causes
Workaround
Any other activity that makes the MessageIndexRebuildJob run again will "heal" the MessageIndex by making it up to date again. This is most easily accomplished by doing a dummy edit on any translatable page and marking it for translation. Sometimes just waiting a bit can help if the job is just slow or delayed.
We added a front-cache that temporarily caches newly added keys until the job finishes running. This mostly alleviates the user facing issues about the error, however message group stats and translation pages get out of sync every time the job fails in middle due to the lock timeout exception.
QA plan
Affected projects: Multilingual Wikimedia projects
Pre-deployment testing: skip
Post-deployment testing:
- monitor Logstash for this error for 7 days after deployment to confirm the error MessageIndex: unable to acquire lock is fully gone or extremely infrequent
- Check for performance improvements per T221119#6162764
- Check Translate channel in Logstash for messages about MessageIndex
Outcome
Translators are able to translate pages recently marked for translation and Translate has less production issues related to the message index.