Skip to content

UTF8::icompare unexpected behavior #254

@pqvst

Description

@pqvst

Lets say I have two null-terminated C strings containing UTF8 encoded text. The case insensitive UTF8 compare function will not correctly compare these two strings. Test case:

// Create test string "A"
const wchar_t* wa = L"åäö";
std::string sa;
UnicodeConverter::toUTF8(wa, sa);
const char* ca = sa.c_str();

// Create test string "B"
const wchar_t* wb = L"ÅÄÖ";
std::string sb;
UnicodeConverter::toUTF8(wb, sb);
const char* cb = sb.c_str();

// Comparing the std::strings works as expected
bool sr = UTF8::icompare(sa, sb) == 0;
poco_assert (sr);

// Comparing the raw null-terminated strings does not work as expected
bool cr = UTF8::icompare(ca, cb) == 0;
poco_assert (cr);

The reason is quite obvious when looking at the source code for the icompare function. For the first case (both std::string) TextIterators are used for both arguments. Hence it iterates both arguments by character (not byte).

In the second case (null-term strings) a TextIterator is only used for the first argument. Hence, it iterates the first argument by character, and iterates the second argument by byte, which will obviously not work for wide UTF8 characters.

int UTF8::icompare(const std::string& str, 
    std::string::size_type pos, 
    std::string::size_type n, 
    const std::string::value_type* ptr)

Perhaps this is "as-designed", but it seems a bit strange to me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions