Skip to content

Improved doc search#6785

Merged
Alkarex merged 2 commits intoFreshRSS:edgefrom
Alkarex:doc-search
Sep 7, 2024
Merged

Improved doc search#6785
Alkarex merged 2 commits intoFreshRSS:edgefrom
Alkarex:doc-search

Conversation

@Alkarex
Copy link
Member

@Alkarex Alkarex commented Sep 7, 2024

No description provided.

@Alkarex Alkarex added this to the 1.25.0 milestone Sep 7, 2024
@Alkarex Alkarex mentioned this pull request Sep 7, 2024
Additional reading: [De Morgan’s laws](https://en.wikipedia.org/wiki/De_Morgan%27s_laws).

> ℹ️ Searches are applied to the raw HTML content
> ℹ️ Searches are applied to the HTML content, and are automatically XML-encoded (so one can search for `'A & B'` without having to encode the `&`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You call it "XML encoded" here, but "HTML encoded" down in the regex section. It's the same encoding, but XML is probably also relevant to mention down in the regex since searching even a plain text article is affected (I.e. match anplaintext title containing Q&A: needs to be done with /Q&A:/). The HTML example down there is also relevant though too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the title is also an HTML field, hence the same syntax

Copy link
Contributor

@mtalexan mtalexan Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the title is also an HTML field, hence the same syntax

If I'm not mistaken, it's supported, but up to the actual feed to decide whether it will specify it in plain text or HTML, correct? Plain text is just a subset of HTML with the exception of the 4 characters that have to be escaped, and those also have to be escaped for XML.

I guess I never thought too much about it, but I suppose HTML in the RSS XML probably isn't being double-escaped, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we sanitize the title (and other text fields), the end result is always HTML. Otherwise we would not be able to display those different fields safely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh duh, or CDATA in the XML.
So when FreshRSS is importing the XML content, does it require CDATA sections for the title and content, or does it unwrap CDATA and decode non-CDATA fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that is handled by the sanitization / normalisation.

@Alkarex Alkarex merged commit af37d88 into FreshRSS:edge Sep 7, 2024
@Alkarex Alkarex deleted the doc-search branch September 7, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants