Fix #53823 - empty matches could result in garbled UTF-8 despite u modifier #1327

cmb69 · 2015-06-05T13:12:15Z

When advancing after empty matches, php_pcre_match_impl() as well as php_pcre_replace_impl() always advance to the next byte instead of the next code point when the u modifier is given. This may result in garbled UTF-8 results, and maybe in match errors. This PR fixes that.

Note that this issue has been fixed for php_pcre_split_impl() long ago. I've noticed that, only after having already implemented another solution: instead of using an additional PCRE, the string is iterated over looking for UTF-8 start sequences. The latter solution is more lightweight than the former, so it seems approrpiate to do it this way. The already implemented solution for php_pcre_split_impl() could be replaced, but as it is not broken, it seems reasonable to keep it – at least for PHP 5.

I've tested the patches on Windows 7 and Ubuntu 14.04 (bundled PCRE lib only).

…ugh the u modifier was given

weltling · 2015-06-08T13:12:49Z

@cmb69 i haven't test it yet, but weren't it safer to use something like php_next_utf8_char() if you wanted to get a valid UTF-8 char? Like strlen(php_next_utf8_char(...)) or advance by 1 byte, if not.

The 'u' modifier expects an UTF-8 string per se. If the charset is broken, so the behavior is undefined. IMHO there should be more error check, so the behavior were more predictable at the end.

Thanks.

cmb69 · 2015-06-08T14:35:20Z

@weltling I wasn't aware of the existance of php_next_utf8_char(). I'll have a look at it, and add some tests for broken UTF-8 strings.

weltling · 2015-06-08T14:55:37Z

@cmb69, great, let's see to which conclusion you came, i'll be testing subsequently then.

Thanks.

nikic · 2015-06-17T20:32:47Z

@weltling At this point pcre_exec will have already checked that the string is valid UTF-8.

nikic · 2015-06-17T20:42:14Z

Patch looks good to me. I'd only suggest extracting the duplicate code into an inline function.

cmb69 · 2015-06-17T20:55:36Z

@weltling @nikic I've did some very quick experiments with php_next_utf8_char() a while ago, but that didn't work out well (the API is overly complex for this purpose). And if UTF-8 validity is guaranteed at this point, the simple while loop is sufficient, and faster anyway.

Wrt. inlining: is zend_always_inline the proper way to declare this?

weltling · 2015-06-17T21:18:33Z

@cmb69, but did you at least ensured that if you pass an invalid utf8, it works as you expect? @nikic, where the check happens then?

weltling · 2015-06-17T21:21:46Z

@nikic, to me it looks like it operates on the plain subject passed.

Thanks.

cmb69 · 2015-06-17T21:36:02Z

@weltling For php_pcre_match_impl() the UTF-8 validity is supposed to be guaranteed further above. For php_pcre_replace_impl() it's here.

weltling · 2015-06-17T21:45:35Z

@cmb69, ack )

cmb69 · 2015-06-18T00:25:20Z

@weltling @nikic Please review.

weltling · 2015-06-18T07:53:50Z

@cmb69 tested and don't see any issue. I'd still suggest to add a test with invalid utf8 though. Fe, i've modified the string 'áéíóú' in one of the tests just appending 'ü' encoded with cp1252, and that already delivers NULL. Maybe that's nitpicking though :) But otherwise tests are good.

Thanks.

cmb69 · 2015-06-18T09:58:37Z

@weltling I've added some tests for invalid UTF-8 sequences. You're right: better safe than sorry. :)

I suppose the patch should be merged into PHP 5.6 and master.

weltling · 2015-06-18T10:40:46Z

@cmb69, if there are no other comments, i guess it can be merged.

Thanks for the effort :)

php-pulls · 2015-06-23T17:47:43Z

Comment on behalf of cmb at php.net:

Merged to 5.5, 5.6 and master.

cmb69 added 2 commits June 5, 2015 14:40

added failing tests

b91fffb

fixed bug, where empty matches could result in garbled UTF-8 even tho…

edbbb7c

…ugh the u modifier was given

laruence added the Bug label Jun 17, 2015

extracted calculate_unit_length()

8d10a2f

added tests for invalid UTF-8

c3a4037

php-pulls closed this Jun 23, 2015

cmb69 deleted the pcre-offset branch July 12, 2015 22:46

Fix #53823 - empty matches could result in garbled UTF-8 despite u modifier #1327

Fix #53823 - empty matches could result in garbled UTF-8 despite u modifier #1327

Uh oh!

Conversation

cmb69 commented Jun 5, 2015

Uh oh!

weltling commented Jun 8, 2015

Uh oh!

cmb69 commented Jun 8, 2015

Uh oh!

weltling commented Jun 8, 2015

Uh oh!

nikic commented Jun 17, 2015

Uh oh!

nikic commented Jun 17, 2015

Uh oh!

cmb69 commented Jun 17, 2015

Uh oh!

weltling commented Jun 17, 2015

Uh oh!

weltling commented Jun 17, 2015

Uh oh!

cmb69 commented Jun 17, 2015

Uh oh!

weltling commented Jun 17, 2015

Uh oh!

cmb69 commented Jun 18, 2015

Uh oh!

weltling commented Jun 18, 2015

Uh oh!

cmb69 commented Jun 18, 2015

Uh oh!

weltling commented Jun 18, 2015

Uh oh!

php-pulls commented Jun 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants