Skip to content

com_finder: FinderIndexerParserHtml: parse() too aggressive #7927

@andykirk

Description

@andykirk

administrator/components/com_finder/helpers/indexer/parser/html.php, line 63:

$input = str_replace('>', '> ', $input)

This is too aggressive. I discovered a case where one of my articles contains the text for an expanded acronym using <b> tags to highlight the relevant letters. E.g. (example from Wikipedia):

the onset of congestive heart failure (CHF)

In the search output after indexing this becomes:

the onset of c ongestive h eart f ailure (CHF)

Which is broken. The comment in the code describes a problem with adjacent block tags:

// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.

With that in mind I propose a solution that only adds the space after block tags, not all tags:

$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);

Or that's too much, just add a space between adjacent tags instead of putting a space after all tags:
$input = str_replace('><', '> <', $input)

I'm happy to discuss further or raise a PR if needed.

Thanks,
Andy

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions