Skip to content

Conversation

@masakielastic
Copy link
Contributor

This pull request add validation for encoding. Some of ill-formed byte sequence can be converterted to dangerous ascii characters by legacy UTF-8 decoder without checking trailing bytes. htmlspecharchars and PCRE functions reject them. Unicode Security Guide shows example. "C2 22 3C" can be treated as "C2 22" + "3C". "3C" means "<".

// http://websec.github.io/unicode-security-guide/character-transformations/#handling
//
// Code point   First byte  Second byte Third byte  Fourth byte
// U+0000..U+007F   00..7F          
// U+0080..U+07FF   C2..DF  80..BF      
// U+0800..U+0FFF   E0  A0..BF  80..BF

$str = "\xC2\x22\x3C";
var_dump(
    "" === htmlspecialchars($str),
    false === preg_match('/./u', $str)
);

@ghost
Copy link

ghost commented Mar 7, 2015

Can one of the admins verify this patch?

@smalyshev smalyshev added the Bug label Mar 9, 2015
@weltling
Copy link
Contributor

Merged to master. Thanks!

@nikic nikic closed this Jul 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants