Skip to content

Fix unicode support for python binding on windows#21961

Open
liuxingbaoyu wants to merge 10 commits intoopencv:4.xfrom
liuxingbaoyu:fix-python-unicode-support
Open

Fix unicode support for python binding on windows#21961
liuxingbaoyu wants to merge 10 commits intoopencv:4.xfrom
liuxingbaoyu:fix-python-unicode-support

Conversation

@liuxingbaoyu
Copy link
Copy Markdown

@liuxingbaoyu liuxingbaoyu commented May 9, 2022

The python binding always uses utf8 encoding to handle strings between c and python.

But the opencv c library encoding on windows is mbcs.

Therefore, parameters or returns containing non-ascii characters will become garbled or throw exceptions.

This PR makes python binding use mbcs encoding on windows to process strings in and out of the c library, making it support unicode.

Fixes #4242
Fixes #17600
Fixes #18305
Fixes #21960

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
  • The PR is proposed to the proper branch
  • There is a reference to the original bug report and related work
  • There is accuracy test, performance test and test data in opencv_extra repository, if applicable
    Patch to opencv_extra has the same branch name.
  • The feature is well documented and sample code can be built with the project CMake

Copy link
Copy Markdown
Contributor

@VadimLevin VadimLevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for PR. Can you verify if this solution works if MBCS is not enabled (Based on this comment the issue is not always reproducible)?

@liuxingbaoyu
Copy link
Copy Markdown
Author

Thank you for PR. Can you verify if this solution works if MBCS is not enabled (Based on this comment the issue is not always reproducible)?

Thank you so much for your review!

Do you mean to test the utf8 mode introduced since Windows 10?

On my computer, when this feature is enabled, python starts abnormally (because the username contains unicode characters).

I checked python's source and it should work fine in utf8 mode.

With one exception, utf8 mode was introduced in April 2018, and CP_ACP was made equal to CP_UTF8 in May 2019 (1903).

Very few people use utf8 mode though, especially before version 1903 (because that would break most software).

In #18305 (comment) I explained why the code in #18305 (comment) cannot be reproduced .

You can try the tests in this PR, it should be 100% reproducible.

@VadimLevin
Copy link
Copy Markdown
Contributor

I don't have an access to the computer with Windows right now, but overall idea is to check if this patch don't introduce the regression to Windows users with enabled UTF-8 mode.
I would like to suggest the following schema. For Windows users first try MBCS conversion and if it fails then fallback to UTF8. For all other users - nothing new.

static inline PyObject* PyString_FromString(const char* u)
{
#if defined(_WIN32)
    PyObject* str = PyUnicode_DecodeMBCS(u, strlen(u), NULL);
    if (str) {
        return str;
    }
    // clears error indicator from failed conversion
    PyErr_Clear();
#endif
    return PyUnicode_FromString(u);
}

// PyString_FromStringAndSize looks similar to the above implementation.

and

static inline PyObject* PyUnicode_AsBytes(PyObject* unicode) {
#if defined(_WIN32)
    PyObject* bytes = PyUnicode_AsMBCSString(obj);
    if (bytes) {
        return bytes;
    }
     // clears error indicator from failed conversion
    PyErr_Clear();
#endif
    return PyUnicode_AsUTF8String(obj);
}

@liuxingbaoyu
Copy link
Copy Markdown
Author

You are right!
But I'm worried that sometimes it will convert incorrectly without throwing an exception, maybe we can use https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp.

@VadimLevin
Copy link
Copy Markdown
Contributor

VadimLevin commented May 16, 2022

it will convert incorrectly without throwing an exception

Do you mean the case when MBCS conversion fails, but UTF8 produces wrong result? But in this case it means that user provided a string in unnatural encoding for the OS. In my opinion, the only way to resolve such situation is to require to provide an encoding.

Can you check whenever MBCS encoding fails if provided string is in UTF8 on you Windows machine?

@liuxingbaoyu
Copy link
Copy Markdown
Author

liuxingbaoyu commented May 17, 2022

Can you check whenever MBCS encoding fails if provided string is in UTF8 on you Windows machine?

That's what I meant! I'm worried about garbled characters instead of throwing exceptions when utf8 mode is enabled (when passing a utf8 string to PyUnicode_DecodeMBCS).

But it's solved now!
And I tested it with windows 1803 + utf8 mode using a virtual machine, it works fine!

@tommyyliu
Copy link
Copy Markdown

I've run into the same issue, and it looks like this PR just got left hanging. @VadimLevin , do you have the bandwidth to take another look since you looked at it in the past?

@liuxingbaoyu
Copy link
Copy Markdown
Author

The CI failure after rebasing doesn't seem to be related to this PR, it succeeds on windows, and this PR only changes things under the _WIN32 macro.🤔

@liuxingbaoyu liuxingbaoyu changed the base branch from 3.4 to 4.x January 8, 2024 02:24
@liuxingbaoyu liuxingbaoyu force-pushed the fix-python-unicode-support branch from ff457dd to 9c2cc54 Compare January 8, 2024 02:28
@liuxingbaoyu liuxingbaoyu force-pushed the fix-python-unicode-support branch from 792dcdf to 94d04eb Compare January 9, 2024 05:25
@liuxingbaoyu
Copy link
Copy Markdown
Author

@asmorkalov Hi, sorry to bother you, can you approve the CI to run? Thanks!

@liuxingbaoyu
Copy link
Copy Markdown
Author

Thanks!
The cv.VideoWriter test for linux fails, this seems to be a different problem that exists in c.
As mentioned in https://github.com/opencv/opencv/pull/21961/files#r871041445, other tests ensure that the conversion between OpenCV -> Python -> OpenCV is correct.
Maybe just mark this test as skipped on linux so it can be fixed in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants