Skip to content

Sanitize HTML for documentation from translators markdown#16043

Merged
seanbudd merged 3 commits intobetafrom
sanitise
Jan 15, 2024
Merged

Sanitize HTML for documentation from translators markdown#16043
seanbudd merged 3 commits intobetafrom
sanitise

Conversation

@seanbudd
Copy link
Copy Markdown
Member

@seanbudd seanbudd commented Jan 15, 2024

Link to issue number:

Follow up to #15945

Summary of the issue:

Translators can add arbitrary HTML to markdown translations files.
This is a stored XSS risk

Description of user facing changes

Should be none - however see known issues

Description of development approach

Used the python bindings for the Rust Ammonia sanitization library

Testing strategy:

Tested building docs, diffed HTML results between this PR and beta.

Known issues with pull request:

The sanitization deletes any HTML tags that are not recognized.
This includes using angle brackets around words, e.g.: <minor>.
If these are wrapped by code formatting these are correctly escaped: e.g. `<minor>`
While english files are mostly correct, many translated files do not wrap text with angle brackets with code formatting.
This means certain parts of translated documentation will be stripped, i.e. <major>.<minor>.<patch> becomes ..

Code Review Checklist:

  • Documentation:
    • Change log entry
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English
  • API is compatible with existing add-ons.
  • Security precautions taken.

@seanbudd seanbudd requested review from a team as code owners January 15, 2024 01:26
@seanbudd seanbudd requested review from Qchristensen, gerald-hartig and michaelDCurran and removed request for a team January 15, 2024 01:26
michaelDCurran
michaelDCurran previously approved these changes Jan 15, 2024
Co-authored-by: Michael Curran <mick@nvaccess.org>
@seanbudd seanbudd added this to the 2024.1 milestone Jan 15, 2024
@seanbudd seanbudd merged commit f759c21 into beta Jan 15, 2024
@seanbudd seanbudd deleted the sanitise branch January 15, 2024 03:55
@CyrilleB79
Copy link
Copy Markdown
Contributor

@seanbudd, you write:

The sanitization deletes any HTML tags that are not recognized. This includes using angle brackets around words, e.g.: <minor>. If these are wrapped by code formatting these are correctly escaped: e.g. `<minor>` While english files are mostly correct, many translated files do not wrap text with angle brackets with code formatting. This means certain parts of translated documentation will be stripped, i.e. <major>.<minor>.<patch> becomes ..

I can see that you have fixed such issues in the Russian change log. But I guess other languages are impacted.
Is there an easy way for translators to figure which part of the file has been modified/deleted so that they can fix it? Else, we will probably end up with many information removed from the translated documents.

Though, I understand the need to sanitize documentation's HTML; my remark is not a criticism of this approach.

@seanbudd
Copy link
Copy Markdown
Member Author

Is there an easy way for translators to figure which part of the file has been modified/deleted so that they can fix it?

Searching for angle brackets in markdown files should make it easy to find these issues.
Some early bad news: Translators will be needing to re-check the entire translated documents line by line when these files go up in crowdin. This means every line will need to be manually checked anyway.

The iframe tag in the Russian source caused the rest of the document to be escaped - i.e. half the document appeared as HTML source code in the browser. For other instances, it is just the fake "tag" that was removed.

Adriani90 pushed a commit to Adriani90/nvda that referenced this pull request Mar 13, 2024
…6043)

Follow up to nvaccess#15945

Summary of the issue:
Translators can add arbitrary HTML to markdown translations files.
This is a stored XSS risk

Description of user facing changes
Should be none - however see known issues

Description of development approach
Used the python bindings for the Rust Ammonia sanitization library

Testing strategy:
Tested building docs, diffed HTML results between this PR and beta.

Known issues with pull request:
The sanitization deletes any HTML tags that are not recognized.
This includes using angle brackets around words, e.g.: <minor>.
If these are wrapped by code formatting these are correctly escaped: e.g. `<minor>`
While english files are mostly correct, many translated files do not wrap text with angle brackets with code formatting.
This means certain parts of translated documentation will be stripped, i.e. <major>.<minor>.<patch> becomes ..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants