-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
com_finder: FinderIndexerParserHtml: parse() too aggressive #7927
Description
administrator/components/com_finder/helpers/indexer/parser/html.php, line 63:
$input = str_replace('>', '> ', $input)
This is too aggressive. I discovered a case where one of my articles contains the text for an expanded acronym using <b> tags to highlight the relevant letters. E.g. (example from Wikipedia):
the onset of congestive heart failure (CHF)
In the search output after indexing this becomes:
the onset of c ongestive h eart f ailure (CHF)
Which is broken. The comment in the code describes a problem with adjacent block tags:
// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.
With that in mind I propose a solution that only adds the space after block tags, not all tags:
$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);
Or that's too much, just add a space between adjacent tags instead of putting a space after all tags:
$input = str_replace('><', '> <', $input)
I'm happy to discuss further or raise a PR if needed.
Thanks,
Andy