Skip to content

Use correct class for UTF-16 to/from UTF-8 conversion#559

Merged
slabko merged 1 commit intomasterfrom
utf-16-conversion-fix
Feb 26, 2026
Merged

Use correct class for UTF-16 to/from UTF-8 conversion#559
slabko merged 1 commit intomasterfrom
utf-16-conversion-fix

Conversation

@slabko
Copy link
Copy Markdown
Contributor

@slabko slabko commented Feb 25, 2026

No description provided.

@slabko slabko force-pushed the utf-16-conversion-fix branch 2 times, most recently from 9cbaf0c to 0e99a3e Compare February 25, 2026 18:24
@slabko slabko marked this pull request as ready for review February 25, 2026 21:08
@slabko slabko requested a review from mzitnik as a code owner February 25, 2026 21:08
@mshustov mshustov requested a review from Copilot February 26, 2026 10:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the standard-library Unicode conversion utilities to use the correct UTF-8↔UTF-16 conversion facet, and adds an (currently disabled) integration test covering multi-language UTF text.

Changes:

  • Switched UTF-8 conversion facets for char16_t/wchar_t to std::codecvt_utf8_utf16
  • Removed char32_t-based conversion helpers/specializations
  • Added a disabled integration test for UTF string round-tripping through ODBC

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
driver/utils/conversion_std.h Updates conversion facets used for UTF-8↔UTF-16 and removes char32_t conversions
driver/test/misc_it.cpp Adds a disabled UTF-character integration test that binds/inserts/selects Unicode strings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +17 to +19
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> UCS2_converter_char16;
#endif
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> UCS2_converter_wchar;
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> UCS2_converter_wchar;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::codecvt_utf8_utf16<wchar_t> is only correct when wchar_t stores UTF-16 code units (typically Windows where sizeof(wchar_t)==2). On platforms where wchar_t is 32-bit (common on Linux), this facet will interpret the input as UTF-16 and produce incorrect conversions. Consider selecting the facet based on sizeof(wchar_t) (UTF-16: codecvt_utf8_utf16<wchar_t>, UTF-32: codecvt_utf8<wchar_t>) or routing wide conversions through char16_t/char32_t explicitly.

Copilot uses AI. Check for mistakes.
return {};

std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::codecvt_utf8_utf16<wchar_t> is only correct when wchar_t stores UTF-16 code units (typically Windows where sizeof(wchar_t)==2). On platforms where wchar_t is 32-bit (common on Linux), this facet will interpret the input as UTF-16 and produce incorrect conversions. Consider selecting the facet based on sizeof(wchar_t) (UTF-16: codecvt_utf8_utf16<wchar_t>, UTF-32: codecvt_utf8<wchar_t>) or routing wide conversions through char16_t/char32_t explicitly.

Copilot uses AI. Check for mistakes.
Comment on lines 57 to 63
inline std::string toUTF8(const char16_t * src, SQLLEN length = SQL_NTS) {
if (!src || (length != SQL_NTS && length <= 0))
return {};

std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> convert;
return (length == SQL_NTS ? convert.to_bytes(src) : convert.to_bytes(src, src + length));
}

inline std::string toUTF8(const char32_t * src, SQLLEN length = SQL_NTS) {
if (!src || (length != SQL_NTS && length <= 0))
return {};

std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
return (length == SQL_NTS ? convert.to_bytes(src) : convert.to_bytes(src, src + length));
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes the char32_t overload/specialization for UTF-8 conversion (previously supported via std::codecvt_utf8<char32_t>). That’s a breaking API surface change within this header. If char32_t support is still needed, keep the char32_t overload/specialization using std::codecvt_utf8<char32_t> (UTF-8↔UTF-32), or explicitly deprecate and document the removal if it’s intentional.

Copilot uses AI. Check for mistakes.
Comment on lines 132 to 135
return context.UCS2_converter_char16.from_bytes(src);
}

template <>
inline decltype(auto) fromUTF8<char32_t>(const std::string & src, UnicodeConversionContext & context) {
return context.UCS2_converter_char32.from_bytes(src);
}
#endif
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes the char32_t overload/specialization for UTF-8 conversion (previously supported via std::codecvt_utf8<char32_t>). That’s a breaking API surface change within this header. If char32_t support is still needed, keep the char32_t overload/specialization using std::codecvt_utf8<char32_t> (UTF-8↔UTF-32), or explicitly deprecate and document the removal if it’s intentional.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +19
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> UCS2_converter_char16;
#endif
std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> UCS2_converter_wchar;
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> UCS2_converter_wchar;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The member names UCS2_converter_* are misleading now that std::codecvt_utf8_utf16 is used (UTF-16 with surrogate pairs, not UCS-2). Renaming these to reflect UTF-16 (e.g., UTF16_converter_* or utf8_utf16_converter_*) will reduce confusion and prevent incorrect assumptions about supported code points.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +74
STMT_OK(SQLFetch(hstmt));
STMT_OK(SQLGetData(hstmt, 1, SQL_C_BINARY, buffer, sizeof(buffer), &bytes_read));
ASSERT_TRUE(std::equal(std::begin(buffer), std::begin(buffer) + bytes_read, std::begin(string)));
STMT_OK(SQLFetch(hstmt));
STMT_OK(SQLGetData(hstmt, 1, SQL_C_BINARY, buffer, sizeof(buffer), &bytes_read));
ASSERT_TRUE(std::equal(std::begin(buffer), std::begin(buffer) + bytes_read, std::begin(string)));
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_read from SQLGetData can be SQL_NULL_DATA (negative) or SQL_NO_TOTAL, and it represents bytes (not necessarily a safe std::equal range length). Using std::begin(buffer) + bytes_read without validating bytes_read >= 0 (and not handling SQL_NO_TOTAL) can lead to invalid iterator arithmetic / UB when/if this test is enabled. Add checks for these sentinel values and only compare a validated length (and ideally verify the result is NUL-terminated if you expect a C-string).

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +60
STMT_OK(SQLBindParameter(
hstmt, 1,
SQL_PARAM_INPUT,
getCTypeFor<decltype(param.data())>(),
SQL_VARCHAR,
1024,
0,
param.data(),
param.size(),
NULL));
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQLBindParameter call hardcodes SQL_VARCHAR and passes BufferLength = param.size(). If PTChar is a wide type (Unicode driver), this mismatches the SQL type (SQL_WVARCHAR is expected) and BufferLength should be in bytes (typically param.size() * sizeof(PTChar)), plus you generally want a non-NULL StrLen_or_IndPtr (or SQL_NTS) so the driver knows the actual length. As written, enabling this test is likely to produce incorrect binding behavior for UTF-16 parameters.

Copilot uses AI. Check for mistakes.
@slabko slabko force-pushed the utf-16-conversion-fix branch from 0e99a3e to 4ad8de6 Compare February 26, 2026 20:06
@slabko slabko merged commit 574876c into master Feb 26, 2026
20 checks passed
@slabko slabko deleted the utf-16-conversion-fix branch February 26, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants