Use correct class for UTF-16 to/from UTF-8 conversion#559
Conversation
9cbaf0c to
0e99a3e
Compare
There was a problem hiding this comment.
Pull request overview
Adjusts the standard-library Unicode conversion utilities to use the correct UTF-8↔UTF-16 conversion facet, and adds an (currently disabled) integration test covering multi-language UTF text.
Changes:
- Switched UTF-8 conversion facets for
char16_t/wchar_ttostd::codecvt_utf8_utf16 - Removed
char32_t-based conversion helpers/specializations - Added a disabled integration test for UTF string round-tripping through ODBC
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| driver/utils/conversion_std.h | Updates conversion facets used for UTF-8↔UTF-16 and removes char32_t conversions |
| driver/test/misc_it.cpp | Adds a disabled UTF-character integration test that binds/inserts/selects Unicode strings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
driver/utils/conversion_std.h
Outdated
| std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> UCS2_converter_char16; | ||
| #endif | ||
| std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> UCS2_converter_wchar; | ||
| std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> UCS2_converter_wchar; |
There was a problem hiding this comment.
std::codecvt_utf8_utf16<wchar_t> is only correct when wchar_t stores UTF-16 code units (typically Windows where sizeof(wchar_t)==2). On platforms where wchar_t is 32-bit (common on Linux), this facet will interpret the input as UTF-16 and produce incorrect conversions. Consider selecting the facet based on sizeof(wchar_t) (UTF-16: codecvt_utf8_utf16<wchar_t>, UTF-32: codecvt_utf8<wchar_t>) or routing wide conversions through char16_t/char32_t explicitly.
| return {}; | ||
|
|
||
| std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert; | ||
| std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert; |
There was a problem hiding this comment.
std::codecvt_utf8_utf16<wchar_t> is only correct when wchar_t stores UTF-16 code units (typically Windows where sizeof(wchar_t)==2). On platforms where wchar_t is 32-bit (common on Linux), this facet will interpret the input as UTF-16 and produce incorrect conversions. Consider selecting the facet based on sizeof(wchar_t) (UTF-16: codecvt_utf8_utf16<wchar_t>, UTF-32: codecvt_utf8<wchar_t>) or routing wide conversions through char16_t/char32_t explicitly.
| inline std::string toUTF8(const char16_t * src, SQLLEN length = SQL_NTS) { | ||
| if (!src || (length != SQL_NTS && length <= 0)) | ||
| return {}; | ||
|
|
||
| std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> convert; | ||
| return (length == SQL_NTS ? convert.to_bytes(src) : convert.to_bytes(src, src + length)); | ||
| } | ||
|
|
||
| inline std::string toUTF8(const char32_t * src, SQLLEN length = SQL_NTS) { | ||
| if (!src || (length != SQL_NTS && length <= 0)) | ||
| return {}; | ||
|
|
||
| std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert; | ||
| std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert; | ||
| return (length == SQL_NTS ? convert.to_bytes(src) : convert.to_bytes(src, src + length)); | ||
| } |
There was a problem hiding this comment.
This change removes the char32_t overload/specialization for UTF-8 conversion (previously supported via std::codecvt_utf8<char32_t>). That’s a breaking API surface change within this header. If char32_t support is still needed, keep the char32_t overload/specialization using std::codecvt_utf8<char32_t> (UTF-8↔UTF-32), or explicitly deprecate and document the removal if it’s intentional.
driver/utils/conversion_std.h
Outdated
| return context.UCS2_converter_char16.from_bytes(src); | ||
| } | ||
|
|
||
| template <> | ||
| inline decltype(auto) fromUTF8<char32_t>(const std::string & src, UnicodeConversionContext & context) { | ||
| return context.UCS2_converter_char32.from_bytes(src); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
This change removes the char32_t overload/specialization for UTF-8 conversion (previously supported via std::codecvt_utf8<char32_t>). That’s a breaking API surface change within this header. If char32_t support is still needed, keep the char32_t overload/specialization using std::codecvt_utf8<char32_t> (UTF-8↔UTF-32), or explicitly deprecate and document the removal if it’s intentional.
driver/utils/conversion_std.h
Outdated
| std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> UCS2_converter_char16; | ||
| #endif | ||
| std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> UCS2_converter_wchar; | ||
| std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> UCS2_converter_wchar; |
There was a problem hiding this comment.
The member names UCS2_converter_* are misleading now that std::codecvt_utf8_utf16 is used (UTF-16 with surrogate pairs, not UCS-2). Renaming these to reflect UTF-16 (e.g., UTF16_converter_* or utf8_utf16_converter_*) will reduce confusion and prevent incorrect assumptions about supported code points.
| STMT_OK(SQLFetch(hstmt)); | ||
| STMT_OK(SQLGetData(hstmt, 1, SQL_C_BINARY, buffer, sizeof(buffer), &bytes_read)); | ||
| ASSERT_TRUE(std::equal(std::begin(buffer), std::begin(buffer) + bytes_read, std::begin(string))); | ||
| STMT_OK(SQLFetch(hstmt)); | ||
| STMT_OK(SQLGetData(hstmt, 1, SQL_C_BINARY, buffer, sizeof(buffer), &bytes_read)); | ||
| ASSERT_TRUE(std::equal(std::begin(buffer), std::begin(buffer) + bytes_read, std::begin(string))); |
There was a problem hiding this comment.
bytes_read from SQLGetData can be SQL_NULL_DATA (negative) or SQL_NO_TOTAL, and it represents bytes (not necessarily a safe std::equal range length). Using std::begin(buffer) + bytes_read without validating bytes_read >= 0 (and not handling SQL_NO_TOTAL) can lead to invalid iterator arithmetic / UB when/if this test is enabled. Add checks for these sentinel values and only compare a validated length (and ideally verify the result is NUL-terminated if you expect a C-string).
driver/test/misc_it.cpp
Outdated
| STMT_OK(SQLBindParameter( | ||
| hstmt, 1, | ||
| SQL_PARAM_INPUT, | ||
| getCTypeFor<decltype(param.data())>(), | ||
| SQL_VARCHAR, | ||
| 1024, | ||
| 0, | ||
| param.data(), | ||
| param.size(), | ||
| NULL)); |
There was a problem hiding this comment.
The SQLBindParameter call hardcodes SQL_VARCHAR and passes BufferLength = param.size(). If PTChar is a wide type (Unicode driver), this mismatches the SQL type (SQL_WVARCHAR is expected) and BufferLength should be in bytes (typically param.size() * sizeof(PTChar)), plus you generally want a non-NULL StrLen_or_IndPtr (or SQL_NTS) so the driver knows the actual length. As written, enabling this test is likely to produce incorrect binding behavior for UTF-16 parameters.
0e99a3e to
4ad8de6
Compare
No description provided.