Skip to content

Conversation

@cmb69
Copy link
Member

@cmb69 cmb69 commented Jun 5, 2015

When advancing after empty matches, php_pcre_match_impl() as well as php_pcre_replace_impl() always advance to the next byte instead of the next code point when the u modifier is given. This may result in garbled UTF-8 results, and maybe in match errors. This PR fixes that.

Note that this issue has been fixed for php_pcre_split_impl() long ago. I've noticed that, only after having already implemented another solution: instead of using an additional PCRE, the string is iterated over looking for UTF-8 start sequences. The latter solution is more lightweight than the former, so it seems approrpiate to do it this way. The already implemented solution for php_pcre_split_impl() could be replaced, but as it is not broken, it seems reasonable to keep it – at least for PHP 5.

I've tested the patches on Windows 7 and Ubuntu 14.04 (bundled PCRE lib only).

@weltling
Copy link
Contributor

weltling commented Jun 8, 2015

@cmb69 i haven't test it yet, but weren't it safer to use something like php_next_utf8_char() if you wanted to get a valid UTF-8 char? Like strlen(php_next_utf8_char(...)) or advance by 1 byte, if not.

The 'u' modifier expects an UTF-8 string per se. If the charset is broken, so the behavior is undefined. IMHO there should be more error check, so the behavior were more predictable at the end.

Thanks.

@cmb69
Copy link
Member Author

cmb69 commented Jun 8, 2015

@weltling I wasn't aware of the existance of php_next_utf8_char(). I'll have a look at it, and add some tests for broken UTF-8 strings.

@weltling
Copy link
Contributor

weltling commented Jun 8, 2015

@cmb69, great, let's see to which conclusion you came, i'll be testing subsequently then.

Thanks.

@laruence laruence added the Bug label Jun 17, 2015
@nikic
Copy link
Member

nikic commented Jun 17, 2015

@weltling At this point pcre_exec will have already checked that the string is valid UTF-8.

@nikic
Copy link
Member

nikic commented Jun 17, 2015

Patch looks good to me. I'd only suggest extracting the duplicate code into an inline function.

@cmb69
Copy link
Member Author

cmb69 commented Jun 17, 2015

@weltling @nikic I've did some very quick experiments with php_next_utf8_char() a while ago, but that didn't work out well (the API is overly complex for this purpose). And if UTF-8 validity is guaranteed at this point, the simple while loop is sufficient, and faster anyway.

Wrt. inlining: is zend_always_inline the proper way to declare this?

@weltling
Copy link
Contributor

@cmb69, but did you at least ensured that if you pass an invalid utf8, it works as you expect? @nikic, where the check happens then?

@weltling
Copy link
Contributor

@nikic, to me it looks like it operates on the plain subject passed.

Thanks.

@cmb69
Copy link
Member Author

cmb69 commented Jun 17, 2015

@weltling For php_pcre_match_impl() the UTF-8 validity is supposed to be guaranteed further above. For php_pcre_replace_impl() it's here.

@weltling
Copy link
Contributor

@cmb69, ack )

@cmb69
Copy link
Member Author

cmb69 commented Jun 18, 2015

@weltling @nikic Please review.

@weltling
Copy link
Contributor

@cmb69 tested and don't see any issue. I'd still suggest to add a test with invalid utf8 though. Fe, i've modified the string 'áéíóú' in one of the tests just appending 'ü' encoded with cp1252, and that already delivers NULL. Maybe that's nitpicking though :) But otherwise tests are good.

Thanks.

@cmb69
Copy link
Member Author

cmb69 commented Jun 18, 2015

@weltling I've added some tests for invalid UTF-8 sequences. You're right: better safe than sorry. :)

I suppose the patch should be merged into PHP 5.6 and master.

@weltling
Copy link
Contributor

@cmb69, if there are no other comments, i guess it can be merged.

Thanks for the effort :)

@php-pulls
Copy link

Comment on behalf of cmb at php.net:

Merged to 5.5, 5.6 and master.

@php-pulls php-pulls closed this Jun 23, 2015
@cmb69 cmb69 deleted the pcre-offset branch July 12, 2015 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants