Skip to content

Cleanup sanitize core#285

Closed
otherdaniel wants to merge 10 commits intoWICG:mainfrom
otherdaniel:cleanup-sanitize-core
Closed

Cleanup sanitize core#285
otherdaniel wants to merge 10 commits intoWICG:mainfrom
otherdaniel:cleanup-sanitize-core

Conversation

@otherdaniel
Copy link
Collaborator

@otherdaniel otherdaniel commented Apr 10, 2025

Re-work the handling of empty vs missing lists, along the lines discussed in #268. Empty allow-lists now effectively block everything; while a missing allow-list does nothing. For remove-lists, there is no difference and neither form will remove anything.

Examples:

  • {} => everything passes
  • {elements: []} => nothing passes

The error handling is updated to disallow both allow and block lists (for elements & attributes, each). For per-element attributes no such restriction exists; unlike what #281 wants.

Example:
{ attributes: ["src", "srcset", "alt"], elements: [{name: "img", attributes: ["width", "height"], removeAttributes: ["srcset"]}]}
This would allow <img src=... alt=... width=... height=...>, but would drop a srcset attribute from the element.

Fixes #268 and #281


Preview | Diff

index.bs Outdated
Comment on lines +478 to +480
1. If |configuration|["{{SanitizerConfig/elements}}"] [=map/exists=] and
|configuration|["{{SanitizerConfig/elements}}"]["{{SanitizerElementNamespaceWithAttributes/removeAttributes}}"]
[=map/exists=] and [=SanitizerConfig/contains=] |attrName|,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly ambiguous. Perhaps a comma before the first "and" removes it. (A more verbose version that also works would have you repeat what contains attrName and remove the duplicate "and".)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, by seperating the clauses with , and if.

@evilpie
Copy link
Collaborator

evilpie commented Apr 16, 2025

Indirectly related to this, but I don't want to forget about this.

Don't we also need to handle missing properties in set a configuration when handling the input config? We just access configuration["elements"] etc.

Similarly in allow an element etc. we are just appending/removing from potentially missing lists.

@evilpie
Copy link
Collaborator

evilpie commented Apr 16, 2025

Oh, and the way it's currently specified, we can't actually tell the difference between { elements: [] } and {}, because we don't differentiate the two cases properly in set a configuration.

Both cases don't enter the for each, so we can't tell the difference. I guess we need to do something like:

1. If configurations["elements"] exists:
   1. Set sanitizer's configuration["elements"] to empty list
   2. For each configurations["elements"]
      1. Allow an element
// 2. else (sanitizer's configuration["elements"] is still missing)

For the other lists (removeElements etc.) we should maybe just always explicitly initialize the sanitizer's configuration to an empty list?

@otherdaniel
Copy link
Collaborator Author

Thanks for the review!

Similarly in allow an element etc. we are just appending/removing from potentially missing lists.

Fixed this in the latest push.

(Other fixes are still coming...)

@otherdaniel
Copy link
Collaborator Author

"set a configuration" now explicitly initializes any list that exists in the input. So absent and empty lists should now be accurately reflected in the internal config.

@evilpie
Copy link
Collaborator

evilpie commented Apr 29, 2025

"set a configuration" now explicitly initializes any list that exists in the input. So absent and empty lists should now be accurately reflected in the internal config.

I would have actually preferred to always initialize everything but elements/attributes, but I can also totally see how only having those be optional would be confusing. (It does emphasize the difference between missing and empty being special for these though ...)

I think we are still missing an initialization of "attributes" and "remove attributes" in canonicalize a sanitizer element with attributes.

index.bs Outdated
1. If all of the following are false,
then [=/remove an attribute|remove=] |attribute| from |child|.
- |configuration|["{{SanitizerConfig/attributes}}"] [=map/exists=] and
does not [=SanitizerConfig/contain=] |attrName|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "does not" wrong? If the attribute is contained in SanitizerConfig/attributes (i.e. it should be allowed) this expression would result in false, which would mean removal.

This step might be the most complicated in the whole specification, we should try to brainstorm something to simplify it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "does not" wrong? If the attribute is contained in SanitizerConfig/attributes (i.e. it should be allowed) this expression would result in false, which would mean removal.

Yes, I think so. I'll update the PR.

This step might be the most complicated in the whole specification, we should try to brainstorm something to simplify it.

I also agree. I tried to keep it simple, but... this is the result. :-)

Alternatives or improvements I can think of:

  • Be more aggressive about factoring out common sub-expression. In particular, if map/exists and SanitizerConfig/contains occurs a lot.
  • This is basically an if-then with 3 sub-conditions. One could either move all the conditions into a seperate algorithm. (Which I'm usually not that fond of, because that mostly only introduces an indirection, rather than a proper abstraction.)
  • We could resolve the 3-condition if-then into three seperate if-thens. That would repeat the action 3 times, though.
  • We could resolve the 3-condition if-then into three seperate if-thens, which skip over the remaining steps of this sequence of steps. I.e., "If [...], then skip over the remaining steps in this sequence," sort of like continue or breakdoes for loops. The infra spec defines suitable languages for loops, but not for overly long if-thens. I think we can just describe it in English, though.

wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make at least one change. Write it like

If all of the following are false:

  • a is true;
  • b is true; and
  • c is true,

then ...

to make it flow a bit better (and be more aligned with Infra). (Might be harder with Markdown, not sure.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the 'flow' suggestion here, and simplified the expressions a little bit by merging existance check + lookup into one operation.

index.bs Outdated
1. [=Continue=].
1. If |configuration|["{{SanitizerConfig/removeElements}}"] [=SanitizerConfig/contains=] |elementName|, or if |configuration|["{{SanitizerConfig/elements}}"] is not [=list/empty=] and does not [=SanitizerConfig/contain=] |elementName|:
1. If |configuration|["{{SanitizerConfig/removeElements}}"] [=map/exists=] and
[=SanitizerConfig/contains=] |elementName|, or if
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you sometimes use a nested "if". The usual specification style does not have that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "or if" and "and if".

index.bs Outdated
1. If all of the following are false,
then [=/remove an attribute|remove=] |attribute| from |child|.
- |configuration|["{{SanitizerConfig/attributes}}"] [=map/exists=] and
does not [=SanitizerConfig/contain=] |attrName|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make at least one change. Write it like

If all of the following are false:

  • a is true;
  • b is true; and
  • c is true,

then ...

to make it flow a bit better (and be more aligned with Infra). (Might be harder with Markdown, not sure.)

index.bs Outdated
Comment on lines +549 to +550
1. [=SanitizerConfig/Remove=] |element| from the "{{SanitizerConfig/elements}}"
key of |configuration|.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notation is very confusing. Should this be

Remove element from configuration["elements"]

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing indeed.

The problem is that all those keys are optional and might not be there. The intent is that I can move the existence check into the algorithm that get's called: If configuration doesn't actually have an "elements", then Remove element from configuration["elements"] would run into the assertion at the beginning of [=map/get the value of an entry=]. Here, I handle this by moving the map operation into the algorithm, and can do the existence-check at the beginning of remove.

One alternative would be to just do the existence check plus default value at every calling site. That strikes me as very repetetive.

Ideal would be if there'd be an a[b] operator that would return None (or somesuch). Then the calling algorithm could just treat None in whichever fashion it likes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've prepared a version based on the latter, based on an a[b,default] operation. Not sure how much that improves things, and whether that's at all palatable for eventual inclusion into HTML. On the pro side, it helps with this problem; with the size problem below, and occurs again in the core algorithm issue above. On the con side, I'm inventing another new notation. :-/

Please let my know what you all think. I'm still unsure how to proceed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I like that size can return None. We should probably do something more explicit for "set a configuration"?

Like, roughly:

Did list change?

  1. If a and b is some, return number of items in a and b are identical.
  2. Return true if both a and b are none
  3. Return false

index.bs Outdated
<div algorithm>
To determine the <dfn for="SanitizerConfig">size</dfn> of a [=/list=] |list|:

1. If |list| does not [=map/exist=], return None.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using map/exists on a list doesn't make sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The intent was that this gets called with configuration["bla"] and I'm hoisting the existance check into the caller. I don't think that was valid. I've now prepared a version that's explicit.

aarongable pushed a commit to chromium/chromium that referenced this pull request May 8, 2025
Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyf@chromium.org>
Commit-Queue: Daniel Vogelheim <vogelheim@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1457595}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 8, 2025
Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyf@chromium.org>
Commit-Queue: Daniel Vogelheim <vogelheim@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1457595}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request May 9, 2025
Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyf@chromium.org>
Commit-Queue: Daniel Vogelheim <vogelheim@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1457595}
@evilpie
Copy link
Collaborator

evilpie commented May 12, 2025

Could we move the DocumentType change to another PR?

Comment on lines +874 to +890
<div algorithm>
To <dfn for="map">maybe get the value</dfn> in an [=ordered map=]
|map| given a [=map/key=] |key| and an optional |default|:

1. If |default| is given and |map| does not [=map/contain=] |key|,
return |default|.
1. Return [=get the value=] in |map| given |key|.

</div>

We can also denote [=maybe get the value|maybe getting the value=] using an indexing syntax with two parameters, by providing a [=map/key=] and a default value seperated by comma, inside square brackets directly following a map.

<div class=example>
`map["test",None]` would return
`"ok"` for a [=map=] &laquo;["test"&rightarrow;"ok"]&raquo;,
and `None` for a [=map=] &laquo;["ok"&rightarrow;"test"]&raquo;.
</div>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea, but let's discuss this over in https://github.com/whatwg/infra. I suppose we could merge this for now, but I think this needs a bit more scrutiny and wider buy-in before we use it in a standard.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +456 to +457
1. If |configuration|["{{SanitizerConfig/replaceWithChildrenElements}}"]
[=map/exists=] and [=SanitizerConfig/contains=] |elementName|:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to define an operation for these kind of checks.

In particular these contains checks seem somewhat problematic as they don't perform a by-reference comparison but expect a by-value comparison. But we haven't really nailed down comparisons in Infra yet and even the sketch we have at whatwg/infra#664 doesn't want it to be used for maps as you are doing throughout the document.

Comment on lines +845 to +846
1. [=Assert=]: |list| is a [=/list=] or None.
1. If |list| is None, return.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why None and not null? Null is the usual value for this kind of thing. See also Infra.

@otherdaniel
Copy link
Collaborator Author

Could we move the DocumentType change to another PR?

Done: #289 Will also remove those bits from this PR.

@evilpie
Copy link
Collaborator

evilpie commented May 12, 2025

(Sorry, All these comments are getting confusing)

Coming back to dreaded attributes removal condition. I still don't think it really works. Imagine 1. { attributes: [] } vs. 2. { }. 1. should probably remove all attributes, while 2. keeps all of them?

Edit: I think I would recommend refactoring this to another pseudo method like this:

Should remove attribute with configuration, elementName and attrName?

  1. If configuration["attributes"] does not exists or configuraton["attributes"] contains attrName, return false.
  2. If configuration["dataAttributes"] is true and "data-" is a code unit prefix of attrName["name"] and attrName["namespace"] is null, then return false.
  3. If configuration["elements"][elementName] exists and configuration["elements"][elementName]["attributes"] exists and configuration["elements"][elementName]["attributes"] contains attrName, then return false.
  4. Otherwise, return true.

Mhm. I guess that definition makes it impossible to define a strict list of attributes per element with just { elements: [{name: "div", attributes: ["title"]}] }, instead { elements: [{name: "div", attributes: ["title"] }], attributes: [] } would be required 💥

lando-prod-mozilla bot pushed a commit to mozilla-firefox/firefox that referenced this pull request May 14, 2025
… fix DOCTYPE handling.,

Automatic update from web-platform-tests
[Sanitizer] Add parseHTML testcases, and fix DOCTYPE handling.

Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyf@chromium.org>
Commit-Queue: Daniel Vogelheim <vogelheim@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1457595}

--

wpt-commits: 95a60666dc001951821439078be97c0bcb421476
wpt-pr: 52406

Differential Revision: https://phabricator.services.mozilla.com/D249335
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request May 15, 2025
… fix DOCTYPE handling.,

Automatic update from web-platform-tests
[Sanitizer] Add parseHTML testcases, and fix DOCTYPE handling.

Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyf@chromium.org>
Commit-Queue: Daniel Vogelheim <vogelheim@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1457595}

--

wpt-commits: 95a60666dc001951821439078be97c0bcb421476
wpt-pr: 52406

Differential Revision: https://phabricator.services.mozilla.com/D249335
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request May 16, 2025
… fix DOCTYPE handling.,

Automatic update from web-platform-tests
[Sanitizer] Add parseHTML testcases, and fix DOCTYPE handling.

Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyfchromium.org>
Commit-Queue: Daniel Vogelheim <vogelheimchromium.org>
Cr-Commit-Position: refs/heads/main{#1457595}

--

wpt-commits: 95a60666dc001951821439078be97c0bcb421476
wpt-pr: 52406

Differential Revision: https://phabricator.services.mozilla.com/D249335

UltraBlame original commit: 7033756c58b04d96f527d92e4b01d6ba3d4542e6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request May 16, 2025
… fix DOCTYPE handling.,

Automatic update from web-platform-tests
[Sanitizer] Add parseHTML testcases, and fix DOCTYPE handling.

Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyfchromium.org>
Commit-Queue: Daniel Vogelheim <vogelheimchromium.org>
Cr-Commit-Position: refs/heads/main{#1457595}

--

wpt-commits: 95a60666dc001951821439078be97c0bcb421476
wpt-pr: 52406

Differential Revision: https://phabricator.services.mozilla.com/D249335

UltraBlame original commit: 7033756c58b04d96f527d92e4b01d6ba3d4542e6
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request May 16, 2025
… fix DOCTYPE handling.,

Automatic update from web-platform-tests
[Sanitizer] Add parseHTML testcases, and fix DOCTYPE handling.

Add WPT testcases for parseHTML + parseHTMLUnsafe.

Fix <!DOCTYPE> handling, as a follow-up to issue #288 / PR #285.
Previously, it was assumed Doc-Type nodes couldn't make it into
the sanitizer implementation; but with the change to parseHTML
context they can.

Spec: WICG/sanitizer-api#288
Spec: WICG/sanitizer-api#285
Bug: 356601280
Change-Id: Iec797b929eb3b90d5f08318bb28964d2f683acb4
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/6523521
Reviewed-by: Yifan Luo <lyfchromium.org>
Commit-Queue: Daniel Vogelheim <vogelheimchromium.org>
Cr-Commit-Position: refs/heads/main{#1457595}

--

wpt-commits: 95a60666dc001951821439078be97c0bcb421476
wpt-pr: 52406

Differential Revision: https://phabricator.services.mozilla.com/D249335

UltraBlame original commit: 7033756c58b04d96f527d92e4b01d6ba3d4542e6
otherdaniel added a commit to otherdaniel/purification that referenced this pull request Jun 3, 2025
This re-writes config handling and sanitize-core based on 2025-05-28
meeting discussion. This would replace WICG#285.
@evilpie
Copy link
Collaborator

evilpie commented Jul 21, 2025

I think we can close this in favor of #296 already.

@otherdaniel otherdaniel deleted the cleanup-sanitize-core branch October 1, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of omitted vs empty lists

3 participants