-
Notifications
You must be signed in to change notification settings - Fork 8k
Fix #53823 - empty matches could result in garbled UTF-8 despite u modifier #1327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ugh the u modifier was given
|
@cmb69 i haven't test it yet, but weren't it safer to use something like php_next_utf8_char() if you wanted to get a valid UTF-8 char? Like strlen(php_next_utf8_char(...)) or advance by 1 byte, if not. The 'u' modifier expects an UTF-8 string per se. If the charset is broken, so the behavior is undefined. IMHO there should be more error check, so the behavior were more predictable at the end. Thanks. |
|
@weltling I wasn't aware of the existance of php_next_utf8_char(). I'll have a look at it, and add some tests for broken UTF-8 strings. |
|
@cmb69, great, let's see to which conclusion you came, i'll be testing subsequently then. Thanks. |
|
@weltling At this point |
|
Patch looks good to me. I'd only suggest extracting the duplicate code into an inline function. |
|
@weltling @nikic I've did some very quick experiments with php_next_utf8_char() a while ago, but that didn't work out well (the API is overly complex for this purpose). And if UTF-8 validity is guaranteed at this point, the simple while loop is sufficient, and faster anyway. Wrt. inlining: is zend_always_inline the proper way to declare this? |
|
@nikic, to me it looks like it operates on the plain subject passed. Thanks. |
|
@weltling For php_pcre_match_impl() the UTF-8 validity is supposed to be guaranteed further above. For php_pcre_replace_impl() it's here. |
|
@cmb69, ack ) |
|
@cmb69 tested and don't see any issue. I'd still suggest to add a test with invalid utf8 though. Fe, i've modified the string 'áéíóú' in one of the tests just appending 'ü' encoded with cp1252, and that already delivers NULL. Maybe that's nitpicking though :) But otherwise tests are good. Thanks. |
|
@weltling I've added some tests for invalid UTF-8 sequences. You're right: better safe than sorry. :) I suppose the patch should be merged into PHP 5.6 and master. |
|
@cmb69, if there are no other comments, i guess it can be merged. Thanks for the effort :) |
|
Comment on behalf of cmb at php.net: Merged to 5.5, 5.6 and master. |
When advancing after empty matches,
php_pcre_match_impl()as well asphp_pcre_replace_impl()always advance to the next byte instead of the next code point when theumodifier is given. This may result in garbled UTF-8 results, and maybe in match errors. This PR fixes that.Note that this issue has been fixed for
php_pcre_split_impl()long ago. I've noticed that, only after having already implemented another solution: instead of using an additional PCRE, the string is iterated over looking for UTF-8 start sequences. The latter solution is more lightweight than the former, so it seems approrpiate to do it this way. The already implemented solution forphp_pcre_split_impl()could be replaced, but as it is not broken, it seems reasonable to keep it – at least for PHP 5.I've tested the patches on Windows 7 and Ubuntu 14.04 (bundled PCRE lib only).