Use correct class for UTF-16 to/from UTF-8 conversion by slabko · Pull Request #559 · ClickHouse/clickhouse-odbc

slabko · 2026-02-25T10:26:00Z

No description provided.

Copilot

Pull request overview

Adjusts the standard-library Unicode conversion utilities to use the correct UTF-8↔UTF-16 conversion facet, and adds an (currently disabled) integration test covering multi-language UTF text.

Changes:

Switched UTF-8 conversion facets for char16_t/wchar_t to std::codecvt_utf8_utf16
Removed char32_t-based conversion helpers/specializations
Added a disabled integration test for UTF string round-tripping through ODBC

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
driver/utils/conversion_std.h	Updates conversion facets used for UTF-8↔UTF-16 and removes `char32_t` conversions
driver/test/misc_it.cpp	Adds a disabled UTF-character integration test that binds/inserts/selects Unicode strings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T10:39:31Z

driver/utils/conversion_std.h

+    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> UCS2_converter_char16;
 #endif
-    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> UCS2_converter_wchar;
+    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> UCS2_converter_wchar;


std::codecvt_utf8_utf16<wchar_t> is only correct when wchar_t stores UTF-16 code units (typically Windows where sizeof(wchar_t)==2). On platforms where wchar_t is 32-bit (common on Linux), this facet will interpret the input as UTF-16 and produce incorrect conversions. Consider selecting the facet based on sizeof(wchar_t) (UTF-16: codecvt_utf8_utf16<wchar_t>, UTF-32: codecvt_utf8<wchar_t>) or routing wide conversions through char16_t/char32_t explicitly.

Copilot · 2026-02-26T10:39:31Z

driver/utils/conversion_std.h

        return {};

-    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
+    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;


std::codecvt_utf8_utf16<wchar_t> is only correct when wchar_t stores UTF-16 code units (typically Windows where sizeof(wchar_t)==2). On platforms where wchar_t is 32-bit (common on Linux), this facet will interpret the input as UTF-16 and produce incorrect conversions. Consider selecting the facet based on sizeof(wchar_t) (UTF-16: codecvt_utf8_utf16<wchar_t>, UTF-32: codecvt_utf8<wchar_t>) or routing wide conversions through char16_t/char32_t explicitly.

Copilot · 2026-02-26T10:39:32Z

driver/utils/conversion_std.h

 inline std::string toUTF8(const char16_t * src, SQLLEN length = SQL_NTS) {
    if (!src || (length != SQL_NTS && length <= 0))
        return {};

-    std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> convert;
-    return (length == SQL_NTS ? convert.to_bytes(src) : convert.to_bytes(src, src + length));
-}
-
-inline std::string toUTF8(const char32_t * src, SQLLEN length = SQL_NTS) {
-    if (!src || (length != SQL_NTS && length <= 0))
-        return {};
-
-    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
+    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
    return (length == SQL_NTS ? convert.to_bytes(src) : convert.to_bytes(src, src + length));
 }


This change removes the char32_t overload/specialization for UTF-8 conversion (previously supported via std::codecvt_utf8<char32_t>). That’s a breaking API surface change within this header. If char32_t support is still needed, keep the char32_t overload/specialization using std::codecvt_utf8<char32_t> (UTF-8↔UTF-32), or explicitly deprecate and document the removal if it’s intentional.

Copilot · 2026-02-26T10:39:32Z

driver/utils/conversion_std.h

    return context.UCS2_converter_char16.from_bytes(src);
 }

-template <>
-inline decltype(auto) fromUTF8<char32_t>(const std::string & src, UnicodeConversionContext & context) {
-    return context.UCS2_converter_char32.from_bytes(src);
-}
 #endif


This change removes the char32_t overload/specialization for UTF-8 conversion (previously supported via std::codecvt_utf8<char32_t>). That’s a breaking API surface change within this header. If char32_t support is still needed, keep the char32_t overload/specialization using std::codecvt_utf8<char32_t> (UTF-8↔UTF-32), or explicitly deprecate and document the removal if it’s intentional.

Copilot · 2026-02-26T10:39:32Z

driver/utils/conversion_std.h

+    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> UCS2_converter_char16;
 #endif
-    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> UCS2_converter_wchar;
+    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> UCS2_converter_wchar;


The member names UCS2_converter_* are misleading now that std::codecvt_utf8_utf16 is used (UTF-16 with surrogate pairs, not UCS-2). Renaming these to reflect UTF-16 (e.g., UTF16_converter_* or utf8_utf16_converter_*) will reduce confusion and prevent incorrect assumptions about supported code points.

Copilot · 2026-02-26T10:39:32Z

driver/test/misc_it.cpp

+    STMT_OK(SQLFetch(hstmt));
+    STMT_OK(SQLGetData(hstmt, 1, SQL_C_BINARY, buffer, sizeof(buffer), &bytes_read));
+    ASSERT_TRUE(std::equal(std::begin(buffer), std::begin(buffer) + bytes_read, std::begin(string)));
+    STMT_OK(SQLFetch(hstmt));
+    STMT_OK(SQLGetData(hstmt, 1, SQL_C_BINARY, buffer, sizeof(buffer), &bytes_read));
+    ASSERT_TRUE(std::equal(std::begin(buffer), std::begin(buffer) + bytes_read, std::begin(string)));


bytes_read from SQLGetData can be SQL_NULL_DATA (negative) or SQL_NO_TOTAL, and it represents bytes (not necessarily a safe std::equal range length). Using std::begin(buffer) + bytes_read without validating bytes_read >= 0 (and not handling SQL_NO_TOTAL) can lead to invalid iterator arithmetic / UB when/if this test is enabled. Add checks for these sentinel values and only compare a validated length (and ideally verify the result is NUL-terminated if you expect a C-string).

Copilot · 2026-02-26T10:39:33Z

driver/test/misc_it.cpp

+    STMT_OK(SQLBindParameter(
+        hstmt, 1,
+        SQL_PARAM_INPUT,
+        getCTypeFor<decltype(param.data())>(),
+        SQL_VARCHAR,
+        1024,
+        0,
+        param.data(),
+        param.size(),
+        NULL));


The SQLBindParameter call hardcodes SQL_VARCHAR and passes BufferLength = param.size(). If PTChar is a wide type (Unicode driver), this mismatches the SQL type (SQL_WVARCHAR is expected) and BufferLength should be in bytes (typically param.size() * sizeof(PTChar)), plus you generally want a non-NULL StrLen_or_IndPtr (or SQL_NTS) so the driver knows the actual length. As written, enabling this test is likely to produce incorrect binding behavior for UTF-16 parameters.

slabko force-pushed the utf-16-conversion-fix branch 2 times, most recently from 9cbaf0c to 0e99a3e Compare February 25, 2026 18:24

slabko marked this pull request as ready for review February 25, 2026 21:08

slabko requested a review from mzitnik as a code owner February 25, 2026 21:08

mshustov requested a review from Copilot February 26, 2026 10:33

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Use correct class for UTF-16 to/from UTF-8 conversion

4ad8de6

slabko force-pushed the utf-16-conversion-fix branch from 0e99a3e to 4ad8de6 Compare February 26, 2026 20:06

slabko merged commit 574876c into master Feb 26, 2026
20 checks passed

slabko deleted the utf-16-conversion-fix branch February 26, 2026 21:42

BrewTestBot mentioned this pull request Mar 11, 2026

clickhouse-odbc 1.5.3.20260311 Homebrew/homebrew-core#271784

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correct class for UTF-16 to/from UTF-8 conversion#559

Use correct class for UTF-16 to/from UTF-8 conversion#559
slabko merged 1 commit intomasterfrom
utf-16-conversion-fix

slabko commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slabko commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants