Use case
We have web crawl database with HTML content in a column.
Need to extract text content to do some analysis.
Describe the solution you'd like
It can be not 100% correct but must be fast.
For HTML and XHTML:
- remove
script and style elements (and maybe meta) with all their content (assuming </script> is properly escaped in JS string literals as expected, counterexample: <script>var x = "</script>"</script>);
- unwrap CDATA;
- remove tags (assuming entities are properly escaped in attributes, counterexample:
<test test=">">);
- collapse whitespaces;
For XML:
- unwrap CDATA;
- remove tags (assuming entities are properly escaped in attributes, counterexample:
<test test=">">);
- collapse whitespaces;
(everything should be done in single pass but the logical order matters)
We will not support custom XML entities. We will not decode HTML and XML entities (there will be a separate function for it). We will not process meta charset declaration... It's in question should we involve processing of comments.
Describe alternatives you've considered
Parse HTML with regular expressions:
replaceRegexpAll(replaceRegexpAll(content, '(?s)<(script|style)[^>]*>.*?</(script|style)>', ''), '<[^>]+>', '')
Use case
We have web crawl database with HTML content in a column.
Need to extract text content to do some analysis.
Describe the solution you'd like
It can be not 100% correct but must be fast.
For HTML and XHTML:
scriptandstyleelements (and maybemeta) with all their content (assuming</script>is properly escaped in JS string literals as expected, counterexample:<script>var x = "</script>"</script>);<test test=">">);For XML:
<test test=">">);(everything should be done in single pass but the logical order matters)
We will not support custom XML entities. We will not decode HTML and XML entities (there will be a separate function for it). We will not process meta charset declaration... It's in question should we involve processing of comments.
Describe alternatives you've considered
Parse HTML with regular expressions: