Parsoid
Parsoid is a library that converts between wikitext and HTML. The HTML contains additional metadata that allows it to be converted back ("round-tripped") to wikitext.
Uses
- VisualEditor fetches the HTML for a given page from Parsoid, edits it, then delivers the modified HTML to Parsoid, which converts it back to wikitext.
- Flow (as configured on WMF wikis with
$wgFlowContentFormat = 'html') works the other way around. When a user creates a post, Flow uses Parsoid to convert the wikitext to HTML, and Flow stores the HTML in ExternalStore. If someone later edits a post, Flow uses Parsoid to convert the HTML back to wikitext for editing. - Many more!
Monitoring
- Monitoring is done through LVS ProbeDown, mw-on-k8s alertmanager, and httpbb hourly runs.
- Logs in logstash (Parsoid/PHP): https://logstash.wikimedia.org/app/dashboards#/view/AW4Y6bumP44edBvO7lRc
- Logging in Parsoid starts at "warn" level (see https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L5459)
- Useful links for Parsoid deployers
- Currently running Parsoid version:
- In beta: https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version#mw-version-library-wikimedia/parsoid
- In production: https://en.wikipedia.org/wiki/Special:Version#mw-version-library-wikimedia/parsoid
- On parsoidtest1001: See https://www.mediawiki.org/wiki/Parsoid/Round-trip_testing
- (Be aware that the parsoid cluster is behind restbase and although the cluster *should* be running the same version of Parsoid as the mediawiki frontends, if puppet or scap are broken (esp in beta) things could diverge.)
Machine overview
These are the machines involved in a Parsoid deploy:
- In the beta/wmflabs cluster:
deployment-deploy01.deployment-prep.eqiad.wmflabs: staging host in beta; no longer used.deployment-parsoid12.deployment-prep.eqiad.wmflabs: parsoid server in betadeployment-restbase02.deployment-prep.eqiad.wmflabs: restbase server in beta
- In the production cluster:
deployment.eqiad.wmnet: Server used to run deploymentmw-parsoid: kubernetes namespace where parsoid code is deployedrestbase1xxx: restbase servers in eqiad clusterrestbase2xxx: restbase servers in codfw clusterparsoidtest1001.eqiad.wmnet: Parsoid testing host, has read-only access to the production database.
Deploying changes
Parsoid is deployed as part of the MediaWiki train. See How to deploy code for an overview, Heterogeneous deployment for a more technical description of the directory structures involved, and Heterogeneous deployment/Train deploys for the steps to do a train deploy. When code changes outside the train schedule are required, a Backport windows will be required. Generally Parsing team members won't be doing train deploys or Backport deploys directly; we will tag a Parsoid version (which releases it to packagist to make it available via composer) and merge a version bump into the mediawiki/vendor repository. Once the patch is merged into vendor, the new version of Parsoid goes live in beta (almost) immediately; it will then be rolled out to production on the next train.
Deploying Parsoid
Communicate changes
If there are changes to generated HTML or accepted wikitext, please see the communication plan for the Parsing team to ensure they have been appropriately announced.
Test the version you hope to deploy
- See mw:Parsoid/Round-trip testing for details.
- Check http://parsoid-rt-tests.wikimedia.org/regressions/between/{from}/{to} where {from} is the last deployed hash from mw:Parsoid/Deployments and {to} is the latest tested commit (which we're about to deploy)
- http://parsoid-rt-tests.wikimedia.org/commits gives you a nice radio-button interface to create this URL
- BEWARE: if you get the output
total regressions between selected revisions: 0, it is extremely likely that you mistyped the hash or that we didn't actually run round-trip tests for that particular hash. (This is a bug, we should probably give a better message in this case.) - Since we are using current revision of titles in round-trip testing, edits to pages can show up as false regressions. tools/regression-testing.php in the Parsoid repo is useful in filtering those out. Running it with the right parameters (use --help for usage) will get a list of pages to look more closely, if necessary.
- Check that there are no concerning notices or errors in logstash from the rt run
Prepare the vendor patch
$ tools/prepare_vendor_patch.sh v0.23.0-a{N-1} v0.23.0-a{N} <phab task id> <git hash for v0.23.0-a{N}> /path/to/mediawiki-vendor /path/to/mediawiki-core
Note that the v0.23.0-a{N} tag does not need to have been created yet. The <phab task id> refers to Content Transform Team chore task for the release.
- If you were late and just missed the train branch, be sure to check the "If the train branch has already been cut" section below.
Verify deployment version on beta after the vendor patch is merged
- Check that this is live on the mediawiki front ends in beta by watching the version number listed on https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version#mw-version-library-wikimedia/parsoid
- If ever you need to, you can check the version on
parsoid12as well:
$ ssh deployment-parsoid12.deployment-prep.eqiad.wmflabs
user@deployment-parsoid12$ curl -x deployment-parsoid12:80 'http://en.wikipedia.beta.wmflabs.org/wiki/Special:Version' | fgrep wikimedia/parsoid -C0
- If beta cluster is down or visual editor is down in beta cluster, do not continue with routine deployments.
- On beta cluster (eg
en.wikipedia.beta.wmflabs.org), perform manual VisualEditor editing tests. This requires you to have an account on the beta cluster wiki. Test with non-ASCII content too to catch encoding issues. Be particularly alert to integration issues: library conflicts, etc. - Watch the logs on beta: https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash
Logs and dasboards to monitor
- Logstash: Parsoid
- wt2html perf
- html2wt perf
- parser cache usage
- Parsoid Health is a dashboard that tracks health across different components (REST API endpoints, VE stashing, ParserCache, Job Queue).
- See also the Chore documentation
Post-deploy checks
- Test VE editing on enwiki and non-latin wikis
- For example, open it:Luna (or other complex page), start the visual editor, make some random vandalism, click save -> review changes, then verify that the wikitext reflects your changes and was not corrupted. Hit cancel to abort the edit.
- Reading through the recent edits (frwiki, enwiki) can also be a good check.
Testing a version bump
If the deployed version of Parsoid updates the Parsoid DOM version and/or will exercises the html2html "down convert" endpoint, the following test procedure will ensure that clients are getting the appropriate DOM version:
- First and foremost, mocha tests should already be present that cover both downgrading the HTML and serializing it with and without selser.
- Create a test page on the beta cluster containing the features that merited the major version bump.
- Deploy the desired commit to the beta cluster and, as a sanity check, make requests for the above test page from Parsoid directly (via
deployment-parsoid12.deployment-prep.eqiad.wmflabs) accepting the various specs that are available. The inline meta tag and aforementioned features should indicate that it worked. Example requests might be,- For the old version,
curl -x deployment-parsoid12:80 'http://en.wikipedia.beta.wmflabs.org/w/rest.php/en.wikipedia.beta.wmflabs.org/v3/page/html/Test_Page' -H'Accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/1.7.0"' - For the new version,
curl -x deployment-parsoid12:80 'http://en.wikipedia.beta.wmflabs.org/w/rest.php/en.wikipedia.beta.wmflabs.org/v3/page/html/Test_Page' -H'Accept: text/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/2.0.0"'
- For the old version,
- Confirm that VE on the beta cluster is still tied to the older content version and will be needing a downgrade (see the commit in Special:Version for the extension and compare with the header defined in
includes/ApiVisualEditor.php) - At this point, two scenarios need to be tested: an edit starting from the older content version stored in RESTBase (which won't require a downgrade) and one starting from the new content version, which will.
- Note that, for extra points, there are potentially several versions numbers stored in RESTBase that satisfy the VE request based on caret semantics and it might be worthwhile to confirm that edits starting from those versions work as well.
- Once you've found stored content in RESTBase with an appropriate version for your test it's prudent to confirm that VE is actually editing what you expect. This can be achieved by dumping the various DOMs: the original
copy(ve.init.target.doc.body.outerHTML)and the editedcopy(ve.init.target.docToSave.body.outerHTML)
- In each case, try to confirm that the features can be edited directly as well as being ignored by selser (usually because no normalizations occur). Unfortunately, testing here is a bit more art than science.
- Finally, open up the various testing dashboards for logging and metrics to verify that no unexpected errors are present and that the downgrades are accounted for.
Testing on parsoidtest1001
When on parsoidtest1001, use this command to test Parsoid directly:
NO_PROXY="" no_proxy="" curl -x parsoidtest1001.eqiad.wmnet:80 http://<domain>/w/rest.php/<domain>/v3/page/html/<title>/<revid>
Note: yes, it's really http and not https.
The NO_PROXY="" no_proxy="" business is to ignore the environment variables that are set, which override the explicit proxy from -x.
Testing LanguageConverter
LanguageConverter can be tested on beta in a manner similar to testing a version bump.
- Create a test page on the beta cluster containing the language converter features you wish to touch. Either the page language for the article must be set to a language w/ variants, or else the article must take place on a wiki where the main language has variants. We'll use the SrTest page on beta srwiki in our examples below.
- Deploy the desired commit to the beta cluster and, as a sanity check, make requests for the above test page from Parsoid directly (via ssh to
deployment-parsoid12.deployment-prep.eqiad.wmflabs) specifying the desired variant language. Verify that the result has been converted appropriately. Example requests might be,
curl -H'Accept-Language: sr-ec' -x deployment-parsoid12:80 http://sr.wikipedia.beta.wmflabs.org/w/rest.php/sr.wikipedia.beta.wmflabs.org/v3/page/html/User:Cscott%2FSrTest/23curl -H'Accept-Language: sr-el' -x deployment-parsoid12:80 http://sr.wikipedia.beta.wmflabs.org/w/rest.php/sr.wikipedia.beta.wmflabs.org/v3/page/html/User:Cscott%2FSrTest/23
- To test in production, try something like:
curl -X GET --header 'Accept-Language: sr-el' 'https://sr.wikipedia.org/api/rest_v1/page/html/%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B0/21280369'
See https://phabricator.wikimedia.org/T241146#5810424 for some more examples.
Deploying a cherry-picked patch
One way to do this is to create a new branch in the Parsoid repo and cherry-pick your patches to that. For example:
git checkout v0.23.0-a3 # this is the commit on the master branch that you want to cherry pick on top of git checkout -b deploy-20251112 # give it a name (go ahead and use the date of your deploy) git cherry-pick f274c3f54f385a6ac159a47209d279b9040a161c # patch number 1 git cherry-pick de087b106be48fc6e97f2ebc4644f9d297ecdfed # patch number 2 git push gerrit deploy-20251112 # create the branch in gerrit (DON'T USE SLASHES HERE)
For that last step you'll have to give yourself unusual permissions in gerrit. Go to https://gerrit.wikimedia.org/r/admin/repos/mediawiki/services/parsoid,access, click "Edit" then either clone one of the existing refs/heads/deploy-YYYYMMDD configurations or just change the date in the most recent one to match your deploy branch, then click "Save" at the bottom of the page.
Now do the usual steps to tag a release and prepare a vendor branch patch (see above) using `.N` (for some suitable N) after the tag corresponding to the branch base. For example, in the example above the branch was made from v0.23.0-a3, so the new tag would be v0.23.0-a3.1. Use the phab task ID of the branch base (ie, the task used to create v0.23.0-a3 in this example). You can use the deploy branch name as the "git hash" to name the tip of the deploy branch with your cherry-picked patch(es):
# first argument is branch point (-a3) and second argument is the # next available release number (-a4) tools/prepare_vendor_patch v0.23.0-a3 v0.23.0-a3.1 <phab task id> deploy-20251112 /path/to/mediawiki-vendor /path/to/mediawiki-core
This creates patches to core and vendor against the master branch.
If you are cherry picking before the next train has branched (the usual case), you should review these patches and merge them to the master branch as usual before proceeding. In the example, the master branch (beta, WMF CI, etc) had v0.23.0-a3, and you want to update that to v0.23.0-a3.1 before proceeding.
When this is merged into mediawiki-vendor it will (shortly) go live on beta; you should verify that everything looks good there. See #Verify deployment version on beta after the vendor patch is merged. Typically you'll want to backport this to production; if so skip ahead to the section, "If the train branch has already been cut" and follow the instructions there.
If you are cherry picking after the next train has branched, then master already (presumably) has the fixes and you don't want to downgrade master. Don't merge these to the master branch, but instead rebase the mediawiki-core and mediawiki-vendor patches to the appropriate wmf.X branch before uploading. (It's recommended not to upload the patches against master to gerrit, to avoid confusing jenkins. You can also recreate the patches from scratch on the wmf.X branch instead of rebasing.) Do not merge them! Skip ahead to the section, "If the train branch has already been cut" and follow the instructions there, which also include instructions on rebasing the vendor patch.
Making a stable release
Forking for the major release
For a major release from (for example) 0.19 to 0.20, corresponding to MW 1.42 and 1.43:
- Before you start: sync the parser tests! It makes things much more tidy if the released, stable version of MediaWiki and the released, stable version of Parsoid have matching parser tests.
- Create a phab task for the release (the same way we have a phab task for the weekly deploy). This doesn't have to be very fancy; see T378130.
- Before you create a new task check to see if one already exists, usually as a child of the main release task. For example T408461 is the main release task for MW 1.45, and the release managers had already created a phab task for the new parsoid version.
- Create a tag for the branch point. Use the same process as the weekly deploy patch, but omit the
-aXXsuffix. For example, if the last alpha release wasv0.20.0-a27, the branch point was0d78...and the release task wasT378139:This will create the v0.20.0 tag. Sometimes we want to squeeze an extra patch or two into the release branch that weren't tagged in the previous week's train deploy, but don't worry if this new tag points to exactly the same hash as the previous alpha release.tools/prepare_vendor_patch.sh v0.20.0-a27 v0.20.0 T378130 0d78ea31be8aa <path to mediawiki-vendor> <path to mediawiki-core>
- As usual add the deploy log to mw:Parsoid/Deployments, but don't upload the patches to core and vendor; we need to retarget them to the branch.
- Create a new
REL1_43parsoid branch, corresponding to themediawiki-corerelease version.git branch REL1_43 v0.20.0 git push origin REL1_43
- Bump
mediawiki-vendor:composer.jsonon theREL1_43branch with the new0.20.0tag. If you are quick, you might be able to take themediawiki-vendorpatch prepared above and just rebase it (iegit rebase -i origin/REL1_43) but if other changes have landed onmediawiki-vendoryou'll have to create the patch manually:You can/should use thecd ..../mediawiki-vendor git checkout REL1_43 ; git pull origin .. edit composer.json .. # don't actually use 'composer update' use the docker command from README.md composer update --no-dev # or equivalent .. git add, git commit etc .. git review -u REL1_43
dockercommand frommediawiki-vendor:README.mdinstead ofcomposer update, and your git commit message should use the sameBug:lines as the draft commit prepared bytools/prepare_vendor_patch.shabove. - Bump
mediawiki-core:composer.jsonon theREL1_43branch with the new0.20.0tag. Again, you might be able to tag the existingmediawiki-corepatch prepared above and rebase it on to the branch before submitting. Be sure to check that theDepends-Onmatches theChange-Idof themediawiki-vendorpatch submitted in the previous step. If you need to create the patch manually:cd ..../mediawiki-core git checkout REL1_43 ; git pull origin .. edit composer.json .. .. git add, git commit etc .. git review -u REL1_43
- Vote C+2 on the
mediawiki-corepatch, then ask for review on themediawiki-vendorpatch. - Update mw:Parsoid/Releases to include the new 0.20.0 release.
- Probably no one will have remembered to update this page when the last ".0" version of the previous release was tagged, so check the history of the
REL1_42branch (aka current release minus one) to see what non-alpha versions of parsoid were released and update the releases page accordingly; you can use the date of themediawiki-vendorpatch as the "release date".
- Probably no one will have remembered to update this page when the last ".0" version of the previous release was tagged, so check the history of the
- This completes the
0.20.0release. But you probably want to make sure that the next Parsoid deployer remembers to bump the version to0.21, so let's tag an initial-a1release with the same phab task and commit hash as we used for the stable release:Add contents to mw:Parsoid/Deployments as usual (the change list will be empty, but there will be a tracking template) and upload the patches to core and vendor. C+2 the core patch, ask for review on the vendor patch.tools/prepare_vendor_patch.sh v0.20.0 v0.21.0-a1 T378130 0d78ea31be8aa <path to mediawiki-vendor> <path to mediawiki-core> - Update mw:Parsoid/Releases to include the new "not yet released" 0.21.0 branch.
Making the first/another minor release
In the early stages of the release cycle, you'll probably keep cherry-picking patches back onto your REL1_42 branch -- you might even make new alpha releases in the 0.19 series. But around the time of the -rc1 or -rc2 release of the new MediaWiki version, you're going to want to tag a non-alpha .0 version. Continuing the example above, we'll want to tag v0.19.0 on the REL1_42 branch. This may end up identical to one of the pre-existing alpha tags, or it might have an extra patch or two. The process for tagging and releasing a non-alpha version is very similar to the above, but we're going to be working on the REL1_42 branch exclusively, for both the parsoid tag as well as the mediawiki-vendor and mediawiki-core patches. And if you need to release a bug fix v0.19.1, you'll use this same process for that as well.
- (If necessary:) Tag the branch point, aka
0.19.0.git tag v0.19.0 <some hash> git push origin v0.19.0
- Bump
mediawiki-vendor:composer.jsonon theREL1_42branch with the new0.19.0tag.cd ..../mediawiki-vendor git checkout REL1_42 ; git pull origin .. edit composer.json .. # don't actually use 'composer update' use the docker command from README.md composer update --no-dev # or equivalent .. git add, git commit etc .. git review -u
- Bump
mediawiki-core:composer.jsoninmediawiki-coreto"0.19.0"(note no caret) and make this patchDepends-On: <change-id-for-mediawiki-vendor-patch>.cd ..../mediawiki-core git checkout REL1_42 ; git pull origin .. edit composer.json .. git add composer.json git commit # INCLUDE THE DEPENDS-ON git review -u
- Update mw:Parsoid/Releases to include the new ".0" version.
Edge case deployment scenarios
If the train branch has already been cut
IF THE NEXT TRAIN BRANCH HAS ALREADY BEEN CUT (aka you're cherry picking to 1.43.0-wmf.4 but 1.43.0-wmf.5 already exists) you can manually rebase the mediawiki-core patch to the wmf.4 branch, but you will probably need to recreate the vendor patch from scratch using the steps below. Ignore the warnings in the next two clauses about merging to master, because presumably the next train already has the fixes. Ensure that your Change-Ids are unique, if you have a patch against master with the same Change-Id sitting in gerrit, even if (especially since) you don't plan to merge it to master, you can confuse jenkins.
IF THE TRAIN BRANCH HAS ALREADY BEEN CUT (aka the wmf/1.XX.0-wmf.YY branch exists) then after you merge to master of mediawiki-vendor you will also need to cherry-pick a patch to the appropriate branch of mediawiki-vendor, for example wmf/1.42.0-wmf.3.
BEFORE YOU CHERRY PICK TO THE WMF/ BRANCH be sure the new version of Parsoid has been merged to the master branch of mediawiki-vendor and mediawiki-core (if you are merging to master) or (otherwise) that no patches against master with the same Change-Ids exist. The CI stack expects the master branch to have merged before the cherry-picks are created, and the Depends-On in the core patch gets confused if there are multiple unmerged patches with the same Change-Ids---like if you were trying to merge to master and the branch in parallel. (Yes, in *theory* it should know to depend on just the patches on the same branch, but it doesn't know that the branches in different repositories correspond.)
In some cases you can use gerrit to cherry-pick the vendor branch to the branch, but in practice most updates to vendor conflict with each other due to the presence of content hashes, so you'll most likely need to repeat the steps above:
# from mediawiki/vendor
git remote update # if needed
git checkout wmf/1.42.0-wmf.3
edit composer.json # set wikimedia/parsoid to v0.19.0-a21
# don't actually run `composer update`, use the multiline command from
# mediawiki-vendor:README.md starting with `docker` and ending with `update --no-dev`
composer update --no-dev # or equivalent
git add -u
git commit -m "Bump wikimedia/parsoid to v0.19.0-a21"
git review -u
(Don't forget that you also have a cherry-picked patch to mediawiki-core which depends on this patch and should already be C+2'ed.)
Now, before you merge these cherry-picks onto the branch, you need to check one of three possible cases:
- If the train branch is new and the "branch commit" has not yet been merged (it looks like this; here is a gerrit search) -- wait! Do not merge the cherry-pick into mediawiki-vendor until the branch commit has landed, or the git submodules in mediawiki-core will be left out of sync (T259832). You might want to add a
Depends-Onclause to the cherry-pick patch to enforce this. If you accidentally merged this, see below for how to fix it. - If the branch commit has been merged, but the train has not been deployed anywhere (check Deployments and the status page on versions.toolforge.org), then it's safe to just C+2 the cherry-picks. But be sure to ping #wikimedia-operations connect and get clearance before C+2 and merge, since (a) the deployer may have already checked out the branch in preparation for the train, and (b) since jenkins can take a while to complete the merge and they need to know to wait for it. Probably worth leaving a comment on the phab task for the blocker bug for the train release as well.
- If the train has already been deployed, then you will need to backport this cherry-pick; it is considered bad form to leave code committed on the branch which isn't deployed. Don't merge the cherry-pick until the backport window. You'll want to backport the vendor patch and the core patch together; if using spiderpig just input both patches when prompted for the patch to deploy.
- Spiderpig is smart about deploying to both the "current train" branch and "previous train" branch as long as the patches are based on the proper branches. Nothing special needs to be done if you're backporting to "last week's train" before "this week's train" has finished deploying.
If you accidentally merged into vendor before the branch commit has been merged
Merging a patch onto a branch in the mediawiki-vendor repository will automatically update the git submodules in core, but only after the branch commit is in place. See phab:T259832 for details. If you think you might have merged onto vendor before the branch commit was merged, check the appropriate vendor branch history for core, aka https://gerrit.wikimedia.org/g/mediawiki/core/+/refs/heads/wmf/1.42.0-wmf.3. Verify that the submodule hash for vendor corresponds to the tip of the branch of mediawiki-vendor. If it's not correct, after the branch commit has been merged into mediawiki-core you need to manually bump the submodules:
cd .../mediawiki-core
# note that the below will clobber your vendor, extensions, and skins directories
# you might want to use a new clean checkout of core
git checkout wmf/1.42.0-wmf.3
git submodule update --init
git submodule update --remote vendor
git add vendor
git commit -m "Update git submodules"
git review -u
Review and merge that.
Misc stuff
- mw-on-k8s deployment documentation
- To see the list of parsoid hosts in beta:
cat /srv/deployment/parsoid/deploy/scap/betacluster
- See also
/srv/deployment/parsoid/deploy/scap/scap.cfgin general
Data flow
Parsoid runs entirely on an internal subnet, so requests to it are proxied through the ve-parsoid API module. This module is implemented in extensions/VisualEditor/ApiVisualEditor.php and is invoked with a POST request to /w/api.php?action=ve-parsoid. The API module then sends a request to Parsoid, either GET /$prefix/$pagename to get the HTML for a page, or POST /$prefix/$pagename to submit HTML and get wikitext back. Parsoid itself also issues requests to /w/api.php to get the wikitext of the requested page and to do template expansion.
Once the ve-parsoid API module receives a response from Parsoid, it either relays it back to the client (when requesting HTML), or saves the returned wikitext to the page (when submitting HTML).
(POST /w/api.php?action=ve-parsoid) (GET /en/Barack_Obama?oldid=1234) (requests for page content and template expansions)
Client browser ------------------------------------------> API ----------------------------> Parsoid -----------------------------------------------------> API
^ | ^ | ^ |
| (response) | | (HTML) | | (responses) |
+------------------------------------------------------+ +---------------------------------+ +----------------------------------------------------------+
(POST /w/api.php?action=ve-parsoid) (POST /en/Barack_Obama; oldid=1234)
Client browser ------------------------------------------> API ----------------------------> Parsoid
| ^ |
(save page) | | (wikitext) |
| +---------------------------------+
|
Database