A function to extract text from HTML and XML

**Use case**

We have web crawl database with HTML content in a column.
Need to extract text content to do some analysis.

**Describe the solution you'd like**

It can be not 100% correct but must be fast.

For HTML and XHTML:
- remove `script` and `style` elements (and maybe `meta`) with all their content (assuming `</script>` is properly escaped in JS string literals as expected, counterexample: `<script>var x = "</script>"</script>`);
- unwrap CDATA;
- remove tags (assuming entities are properly escaped in attributes, counterexample: `<test test=">">`);
- collapse whitespaces;

For XML:
- unwrap CDATA;
- remove tags (assuming entities are properly escaped in attributes, counterexample: `<test test=">">`);
- collapse whitespaces;

(everything should be done in single pass but the logical order matters)

We will not support custom XML entities. We will not decode HTML and XML entities (there will be a separate function for it). We will not process meta charset declaration... It's in question should we involve processing of comments.

**Describe alternatives you've considered**

[Parse HTML with regular expressions](https://i.redd.it/k6sded6b9mkz.png):
```
replaceRegexpAll(replaceRegexpAll(content, '(?s)<(script|style)[^>]*>.*?</(script|style)>', ''), '<[^>]+>', '')
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A function to extract text from HTML and XML #18454

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

A function to extract text from HTML and XML #18454

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions