Skip to content

Optimization of mb_str_split#199

Merged
nicolas-grekas merged 1 commit intosymfony:masterfrom
kamil-tekiela:Optimization-of-mb_str_split
Nov 27, 2019
Merged

Optimization of mb_str_split#199
nicolas-grekas merged 1 commit intosymfony:masterfrom
kamil-tekiela:Optimization-of-mb_str_split

Conversation

@kamil-tekiela
Copy link
Copy Markdown
Contributor

mb_strlen + mb_substr can be very slow for large strings. In the common case of split length 1 and charset UTF=8 we can use preg_split instead.

Added an early exit to the function:

if (1 === $split_length && 'UTF-8' === $encoding) {
    return preg_split('//u', $string, null, PREG_SPLIT_NO_EMPTY);
}

See https://3v4l.org/U6PPf

@kamil-tekiela kamil-tekiela mentioned this pull request Nov 10, 2019
3 tasks
@BackEndTea
Copy link
Copy Markdown
Contributor

This probably needs a test case with exactly this combo (split of 1, and 'UTF-8'encoding)

@kamil-tekiela
Copy link
Copy Markdown
Contributor Author

kamil-tekiela commented Nov 10, 2019

The first test kind of handles this, because 1 and UTF-8 should be the default values.

We could add the following test:

$this->assertSame(array('e', '́', '💩', '𐍈'), mb_str_split('é💩𐍈', 1, 'UTF-8'));

@BackEndTea
Copy link
Copy Markdown
Contributor

I'd say 👍 for a more explicit test for this case

@voku
Copy link
Copy Markdown

voku commented Nov 15, 2019

preg_match_all seems to be faster, or?

-> https://3v4l.org/8XPK7 (preg_match_all)
-> https://3v4l.org/Dl2ml (preg_split)

PS: I think for short strings it's even faster to use substr ? e.g. voku/portable-utf8@e70be38#diff-890805f35966ea20340b9076d7b47329R7212

@nicolas-grekas
Copy link
Copy Markdown
Member

Thank you @kamil-tekiela.

nicolas-grekas added a commit that referenced this pull request Nov 27, 2019
This PR was squashed before being merged into the 1.13-dev branch.

Discussion
----------

Optimization of mb_str_split

`mb_strlen` + `mb_substr` can be very slow for large strings. In the common case of split length 1 and charset UTF=8 we can use `preg_split` instead.

Added an early exit to the function:

    if (1 === $split_length && 'UTF-8' === $encoding) {
        return preg_split('//u', $string, null, PREG_SPLIT_NO_EMPTY);
    }

See https://3v4l.org/U6PPf

Commits
-------

8dd88d7 Optimization of mb_str_split
@nicolas-grekas nicolas-grekas merged commit 8dd88d7 into symfony:master Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants