Skip to content

Default to using strtolower and only use mb_strtolower when available and necessary#92

Merged
allejo merged 6 commits into
9.18from
feature/strtolower-fallback
Oct 24, 2021
Merged

Default to using strtolower and only use mb_strtolower when available and necessary#92
allejo merged 6 commits into
9.18from
feature/strtolower-fallback

Conversation

@allejo

@allejo allejo commented Oct 16, 2021

Copy link
Copy Markdown
Collaborator

Currently, ext-mbstring is a hard Composer dependency, if this plugin is bundled and redistributed, like in a WordPress plugin, then that can't always be met. This has brought into question the necessity of using mb_strtolower.

After some discussion, this PR changes ext-mbstring into a suggested dependency for when you need to highlight code snippets with Unicode characters. We now default to using strtolower since it is much faster than the mb_ equivalent.

The new behavior is as follows: if Unicode is detected in the code snippet, then we look to see if mb_strtolower is available and use it. If it is not available, it will throw a warning once indicating that the extension or a polyfill should be installed and continue to use strtolower as to fail gracefully.

/cc @westonruter

@allejo allejo force-pushed the feature/strtolower-fallback branch from 1e32678 to 6705488 Compare October 16, 2021 05:15
Comment thread Highlight/Highlighter.php Outdated
Comment on lines +331 to +340
$kwd = $this->language->case_insensitive ? mb_strtolower($match[0]) : $match[0];
$kwd = $this->language->case_insensitive ? self::strToLower($match[0]) : $match[0];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any language that has a keyword with multibyte characters? If not, maybe using mb_strtolower() was overkill in the first place?

@westonruter westonruter Oct 18, 2021

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you said 1c does.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the warning about mb_strtolower() not being available should be conditional based whether the language has multibyte keywords?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility would be to check if the string contains any non-printable ASCII like preg_match( '/[^\x20-\x7E]/', $match[0] ) and then only try to use mb_strtolower() if that is the case.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up using /[^\x00-\x7F]/, courtesy of a SO answer, to check if the code we're highlighting has any Unicode characters. Right now, I'm defaulting to using mb_strtolower if it's available. Is there a performance gain if I default to using strtolower?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very much so: https://3v4l.org/gRXk3

mb_strtolower() is 10x slower than strtolower().

@westonruter westonruter Oct 22, 2021

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even throwing in a preg_match() call for each iteration with strtolower(), mb_strtolower() is still slower: https://3v4l.org/Rj9bo

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I did not expect mb_strtolower() to be such a performance hit. I appreciate the testcase and example!

I've switched the logic to default to strtolower and only use mb_strtolower if needed and when available.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found strtolower being up to 50x faster ;)

@allejo allejo changed the title Gracefully fail when mbstring isn't installed Default to using strtolower and only use mb_strtolower when available and necessary Oct 22, 2021

@westonruter westonruter left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than the test failure.

@allejo allejo merged commit 6d5049c into 9.18 Oct 24, 2021
@allejo allejo deleted the feature/strtolower-fallback branch October 24, 2021 00:28
@allejo

allejo commented Oct 24, 2021

Copy link
Copy Markdown
Collaborator Author

I have no idea what's going on with the PHP 7.3 environment, it was originally passing but re-running passed tests now fail. I ran locally with two different 7.3 versions and all tests passed, so I'm not too concerned about merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants