Support of Unicode file paths under Windows by Baum55 · Pull Request #13368 · opencv/opencv

Baum55 · 2018-12-05T14:32:13Z

On Windows the file API (open, fopen) assumes a ANSI coding of the std::string (const char*). Other operating systems assumes a UTF-8 coding of the std::string. To fix this issue it is necessary to use the Unicode format with the std::wstring (const wchar*) variant of the functions.
No functional change for other operating systems.

The file names ("imgcodecs.hpp") are now UTF-8 encoded on all operating systems:

imread
imreadmulti
imwrite
imdecode
imencode
haveImageReader
haveImageWriter

Internally, on Windows the file paths are std::wstring (Unicode).
In other operating systems the file paths are std::string (UTF-8).

resolves #4292
resolves #5631

force_builders=linux,windows,macosx,docs

On Windows the file API (open, fopen) assumes a ANSI coding of the std::string (const char*). Other operating systems assumes a UTF-8 coding of the std::string. To fix this issue it is necessary to use the Unicode format with the std::wstring (const wchar*) variant of the functions. No functional change for other operating systems. Changes in the interface of "imgcodecs.hpp": - imread - imreadmulti - imwrite - imdecode - imencode - haveImageReader - haveImageWriter Functions overloaded with a std::wstring (Unicode) variant for the file paths. Changes in the interface of "imgcodecs.hpp" (only Windows): - setcodepage Possibility to change the code page for the std::string variant of the imgcodecs function. So it is possible that openCV assumes UTF-8 coding for std::string file paths. Internally, on Windows the file paths are std::wstring (Unicode). In other operating systems the file paths are std::string (UTF-8).

Linux/Max: - Copy and Paste error: function was twice declared Windows: - Static cast conversion from 'size_t' to 'int' - The second argument (dwFlags) from MultiByteToWideChar have to 0 if the code page is changed to UTF-8 Doc: - Tab in indent removed - Trailing whitespaces removed - Documentation extended with an example

Linux/Max - Wrong name of a variable All - Removed new setcodepage function

Baum55 · 2018-12-06T13:13:14Z

I have problems with the generation of the python bindings generated by the buildbot.

To solve the issue, it is necessary to add std::wstring (at least for Windows). When I read the error messages from the buildbot, I see that this causes troubles.

Is there a way to fix this?

alalek · 2018-12-06T13:37:16Z

OpenCV is cross-platform library. Public API should be cross-platform, so no "#ifdef _WIN32" are allowed in public headers.
Also bindings parser doesn't work properly with "#if" conditions.

Start from writing a sample to demonstrate this functionality. Without excessive "#if"s.

Baum55 · 2018-12-06T14:00:23Z

OpenCV is cross-platform library. Public API should be cross-platform, so no "#ifdef _WIN32" are allowed in public headers.
Also bindings parser doesn't work properly with "#if" conditions.

Start from writing a sample to demonstrate this functionality. Without excessive "#if"s.

Ok. I can remove the "#if"s from the public headers:

modules/core/include/opencv2/core/cvstd.hpp
modules/core/include/opencv2/core/utility.hpp
modules/imgcodecs/include/opencv2/imgcodecs.hpp

The support of Unicode paths on Windows will not be possible without excessive "#if"s in the source files.

In the same way boost::filesystem::path handles file paths. std::wstring on Windows and std::string on everything else with a typedef for a "path" type who change depending on the operating system.

The Windows file API for char* use an ANSI coding, only the wchar_t variant use Unicode.

ifdef removed from public header - modules/core/include/opencv2/core/cvstd.hpp - modules/core/include/opencv2/core/utility.hpp

Baum55 · 2018-12-07T10:09:18Z

I remove all #ifdef from the public headers, but I still got the same message from the buildbot:

/build/precommit_linux64/opencv/modules/python/src2/cv2.cpp: In instantiation of 'bool pyopencv_to(PyObject*, T&, const char*) [with T = std::__cxx11::basic_string<wchar_t>; PyObject = _object]': /build/precommit_linux64/build/modules/python_bindings_generator/pyopencv_generated_funcs.h:10511:69: required from here /build/precommit_linux64/opencv/modules/python/src2/cv2.cpp:24:105: error: 'to' is not a member of 'PyOpenCV_Converter<std::__cxx11::basic_string<wchar_t>, void>' bool pyopencv_to(PyObject* obj, T& p, const char* name = "<unknown>") { return PyOpenCV_Converter<T>::to(obj, p, name); } ^ make[2]: *** [modules/python3/CMakeFiles/opencv_python3.dir/__/src2/cv2.cpp.o] Error 1 make[1]: *** [modules/python3/CMakeFiles/opencv_python3.dir/all] Error 2

I assume that the reason is because I have overloaded imread, imwrite, etc. with a std::wstring variant.

I have again thought about it intensively.
The behaviour of "fopen", "open" etc. are not equal over different operating systems. At least I do not see a way to support non ANSI characters on Windows without #ifdef.

Is there even a chance that the pull request will be accepted with the many #ifdef?

mshabunin · 2018-12-07T10:27:01Z

What about single lower-level interface: imread(std::basic_istream<char> &)? It can unify imdecode and imread.

Baum55 · 2018-12-07T11:17:03Z

What about single lower-level interface: imread(std::basic_istream<char> &)? It can unify imdecode and imread.

Interesting approach. I will try it next week.
Thank you very much.

ifdef concentrated in the new class Path The majority of the ifdef is now concentrated in the new Path class.

Mac: - 2 Function with the same signature Doc: - Trailing Whitespaces

Trailing Whitespaces removed

Baum55 · 2018-12-11T09:17:33Z

What about single lower-level interface: imread(std::basic_istream<char> &)? It can unify imdecode and imread.

Interesting approach. I will try it next week.
Thank you very much.

I tried it with imread(std::basic_istream<char> &), but basic_istream is not overloaded for std::string or const char*.
A function call like imread("path") would not work anymore. I doubt that would have been wanted.

Comments of the Path class extended

Baum55 · 2018-12-13T07:01:07Z

Despite that the buildbot has successfully passed the test, it still shows "Waiting for builds" and the status is yellow. "How to contribute" describe only the status red and green.
Is that normal?

Maybe it's because I accidentally clicked on "Close pull request" instead "Comment" button?

mshabunin · 2018-12-13T12:13:43Z

@Baum55 , I meant that we could add lower-level interface allowing users to provide thier own content streams, not filenames. Pseudocode:

Mat imread(std::istream & input, ...)
{
  input.seekg(0, std::ios_base::end);
  size_t fsize = input.tellg();
  Mat bitstream(Size(fsize, 1), CV_8U);
  input.seekg(0);
  input.read(bitstream.data, fsize);
  return imdecode(bitstream, ...);
}

// existing function can be reimplemented as follows
Mat imread(const char * filename, ...)
{
  fstream input(filename, std::ios::binary);
  if (!input) CV_Error(..., "File can not be opened");
  return imread(input);
}

Thus users will be able to open files using thier own methods, e.g. https://docs.microsoft.com/en-us/cpp/standard-library/basic-ifstream-class?view=vs-2017 :

wchar * fname = L"example.png";
std::ifstream input(fname);
Mat image = imread(input);

Baum55 · 2018-12-14T11:23:02Z

@mshabunin I misunderstood that, sorry,

I have tested your proposal and it works fine.

That would mean much smaller changes. Only in the files:

loadsave.cpp
imcodecs.hpp
python/src2/cv2.cpp

Currently, imwrite supports multi page images and imencode does not. imreadmulti should also take into account.

But there are some disadvantage. The function imdecode and imencode creates a temporary file for the operation.
new imread:

read the data from the file stream to cv::Mat
save the cv::Mat with the file API of the operating system
read the temporary file with the openCV decoder classes

In a similar way that would work for imwrite. The temporary file would mean a performance loss for every imread and imwrite. In additional there would be an unnecessary writing process on the hard drive.
I do not see a good way to avoid the temporary file. The grfmt_*.*pp files for the decoder and encoder heavily use file descriptors and file descriptors are not compatible with std file streams.

I am not sure that the simplifications justifies the disadvantages.
What do you think about that?

mshabunin · 2018-12-14T12:43:16Z

I believe changes in cpp files and python bindings are not necessary. New imread can be made inline serving as a proxy for imdecode for now.

I think multipage support should be out of scope for now.

imdecode and imencode will create temp files only in case if backend does not support working with memory buffer. All popular backends (png, tiff, jpeg, bmp) support it - m_buf_supported = true. File descriptors used in backends are their own and will not be opened if memory buffer is passed.

Baum55 · 2018-12-17T11:46:01Z

After my experiences when I added std::wstring to the public headers. I got an error on the buildbot, because the template function pyopencv_to was not specialized for std::wstring (The entire error message is in comment 4). Since I do not see a specialization for std::istream, std::ostream or std::basic_istream<char>, I assume that I will get the same error again.

The simplifications come at a price and the purpose for which I made the pull request would omit.
All im* functions should be Unicode compatible and not only imread. On Windows, the same functionality as on Linux and Mac should be possible with the same performance.

Disadvantages of the file stream solution:

Some less popular image formats require a temporary file (the difference should at least be documented)
imencode always loads the entire image into the memory also for popular formats (bmp, webp etc.)
haveImageReader(std::string) does not need to load the entire image, but haveImageReader(std::istream) would need it.
imwrite(std::string, ...) supports multi page images and imwrite(std::ostream, ...) as a wrapper for imencode does not.
no new functionality that you could not already write in your application (Wrapper function for imdecode and imencode)
no multi page image support

alalek · 2018-12-17T12:07:14Z

std::istream is good option for C++.

Python/Java should be covered by separate activity if needed.

Baum55 · 2018-12-19T13:15:02Z

I still do not understand the significant advantage of file streams in this case???

In the current solution with std::wstring, all functions of imgcodecs.hpp work for non-ANSI chracters under Windows. The functions have the same behavior and performance on different operating systems. All image formats behave the same for both overloaded function variants.

With the proposal of @mshabunin, only imread and imwrite would work with non-ANSI characters. The std::string and file stream function variants would behave differently. (No support for multi-page images in the file stream version, sometimes it creates temporary files, ...)

It's not that I have invented the solution with std::wstring. The standard library goes the same way with std::filesystem::path and boost also uses the same method.

What are the significant benefits of file streams that offset the disadvantages compared to the current solution?

alalek

Sample and test are necessary.

3rdparty/libjasper/jas_stream.c

Changes in the 3rdparty code reverted. Unicode compatibility is preserved under Windows. jasper/grfmt_jpeg_2000: - 3rdparty lib can be called with a pointer to a FILE object openexr/grfmt_exr: - 3rdparty lib can be called with a filestream

Sample code for the read and write process of images added. Test added. The test is based on the existing code and uses the previously tested std::string variants to check the std::wstring functions.

hrnr · 2019-01-14T22:50:34Z

Hi,

first of all thanks. This solves some issue for windows for me as well.

However, I'm not that happy about the implementation. This PR basically adds W overloads, that are de-facto windows specific. Therefore in portable applications all the burden of selecting which one of the functions to call is on me as the user. Not to mention double path handling in std::string/std::wstring to Windows and rest of the world. In fact, I would need to implement my own wrappers over OpenCV to make this portable. It would be much nicer if OpenCV could do this job for me and for everyone, so we all don't need to write our own wrappers.

I see two better options than the current implementation:

Don't change the API at all. Expect std::string to be UTF-8 encoded and convert it internally to UTF-16 on Windows. This would be preferable for me, because my portable applications would magically work on Windows too. Might break some current rare use-cases, but probably more theoretical problem than a real one.
Introduce overloads taking some Path class instead of a std::string. Preferably keeping the same name of the functions. Path class should support subset of std::filesystem::path, so it can be later typedefed to std::filesystem::path, like @alalek did with cv::String and almost with cv::Ptr for OpenCV 4.0. This would position OpenCV nicely for C++17 support.

Note that both approaches can be nicely used by multi-platform applications and it would also nicely work from Python. If you want, you can even do both.

Cheers,
Jiri

Baum55 · 2019-01-15T07:44:24Z

@hrnr @alalek @mshabunin

I can do both if there is consensus that this is desired.
I was thinking about a solution like point 1, but I was not sure if openCV would accept the changed behavior under Windows (openCV 3.x std::string filenames ANSI-coded and openCV 4.x UTF-8).

- I can remove the W overloads and treat all std::string on all OS as UTF-8.
- I can separate Path to a public and an internal header. In the public part I can remove the OS ifdefs and I would only use functions that std::filesystem::path have, too.

Instead of 2 overloads for each. It would look like:
Mat imread( const cv::FilesystemPath& filename, ...)
bool imwrite( const cv::FilesystemPath& filename, ...)

Honestly, this solution would fit even better to the needs of my application. Thanks for the suggestion.
Is this Ok?

Edit for point 2:
I must rename Path to something else (Path -> FilesystemPath?). A other module has already a Path structure.

hrnr · 2019-01-15T22:54:48Z

Great. Thanks for considering this.

I agree that the decision now needs to be made by OpenCV core team.

For option 1: It is non-compatible change, but in my experience it never breaks for reasonable strings in practice, paths especially. Of course, you need to do a fallback back to char* functions when UTF-8 -> UTF-16 conversion fails. This way it worked very nicely for me in multiple projects. Reasonable locale-encoded strings are almost never also valid UTF-8. Of course, this behaviour needs to be documented.

For option 2: You would need to probably keep the std::string functions in OpenCV 4.0, to prevent breaking the ABI.

Thanks again for leading this effort and implementing it.

Cheers,
Jiri

vpisarev · 2019-02-19T11:08:55Z

@Baum55, first of all, I join @hrnr in thanking you for raising the issue and providing the solution.

My vote is for keeping the API as-is and treating all the strings as UTF-8. If needed, we can introduce extra convenience function(s) to convert file paths encoded as wide strings to std::string and back.

- API change in imgcodecs reverted - Undo changes that were no longer needed - Doku extended

Baum55 · 2019-03-01T13:02:02Z

@vpisarev You're welcome.

In the version that I use for myself I already treat all the strings as UTF-8. I agree that this makes it easier and is better for multi-platform software. The UTF-8 approach also reduces the necessary changes. I have committed this version.

In the file "opencv2/core/utility.hpp" I have added the functions toString and toWString to convert std::string <=> std::wstring. That avoids copy-and-paste of the conversion functions that are used in different modules. (core, imgcodecs, python)

huangqinjin · 2019-07-08T13:31:08Z

I would like to leave a comment here. Since UCRT 10.0.17134, i.e. Windows 10 April 2018 Update version 1803, you can issue setlocale(LC_CTYPE, ".utf8"), then the filename to fopen is treated as UTF-8 encoded. See https://github.com/huangqinjin/wmain#ucrt-and-utf-8 .

Anyway, +1 for std::istream solution.

risa2000 · 2019-11-06T14:19:08Z

Just found this issue while looking for non-ASCII support for cv::imread and cv::imwrite. Reading this thread gave me an impression that both functions could be made to accept UTF-8 strings (which should not impact anyone using pure ASCII).

Now what I did not get, whether the internal implicit conversion from UTF-8 could be problem? Is it possible to use the current implementation with non-ASCII characters? Could there be a misinterpretation in such situation?

If not, or not a problem, I would prefer, if OpenCV implements implicit conversion from UTF-8 internally, and let the client code pass UTF-8 strings.

Using std::ifstream instead looks more like a workaround than a final solution, because it requires a user to handle the stream, while also relying on std::filesystem (which seems to be the only straightforward way to open an ifstream on UTF-8 path), which consequently requires C++17 support in the toolchain, etc. i.e. a lot of assumption on the user environment.

Keeping only "string" interface ensures that everything mentioned above can happen "under the hood" of the lib, and the lib can implement it the best way it would fit (using fstreams, fopen, etc.), without requiring any particular toolchain feature from the user's environment.

I am writing this in particular with regard to (#15675) which seems to be going the opposite way.

asmorkalov · 2020-03-20T13:08:28Z

OpenCV team discussed the solution on weekly meeting. @alalek @vpisarev @asmorkalov @vladimir-dudnik @VadimLevin decided to close thie PR. The reason of the decision and alternatives are discussed in #4292 (comment) and #5631.

risa2000 · 2020-03-20T14:08:53Z

@asmorkalov Since you closed all the other discussions as well, I am not sure where to post, so I am leaving it here:
What about providing function overrides for file I/O which would accept std::filesystem::path instead of std::string?
I do not know what is current status of C++17 support in the toolchains opencv mostly uses, (I know std::filesystem is supported by latest MSVC and clang equally), but choosing this "standard" way should ensure forward compatibility while remaining platform agnostic.

Baum55 added 3 commits December 5, 2018 15:26

Fix Regression

7198f2a

Linux/Max - Wrong name of a variable All - Removed new setcodepage function

ifdef removed from public header

863b53f

ifdef removed from public header - modules/core/include/opencv2/core/cvstd.hpp - modules/core/include/opencv2/core/utility.hpp

Baum55 added 3 commits December 10, 2018 15:51

ifdef concentrated

4313add

ifdef concentrated in the new class Path The majority of the ifdef is now concentrated in the new Path class.

Fix Mac und Documentation error

45029ca

Mac: - 2 Function with the same signature Doc: - Trailing Whitespaces

Trailing Whitespaces removed

68a1a16

Trailing Whitespaces removed

Baum55 closed this Dec 11, 2018

Baum55 reopened this Dec 11, 2018

Comments extended

6892c26

Comments of the Path class extended

alalek requested changes Dec 24, 2018

View reviewed changes

3rdparty/libjasper/jas_stream.c Outdated Show resolved Hide resolved

Baum55 added 2 commits December 29, 2018 17:49

3rdparty reverted

3816431

Changes in the 3rdparty code reverted. Unicode compatibility is preserved under Windows. jasper/grfmt_jpeg_2000: - 3rdparty lib can be called with a pointer to a FILE object openexr/grfmt_exr: - 3rdparty lib can be called with a filestream

Sample and test added

0ff3f09

Sample code for the read and write process of images added. Test added. The test is based on the existing code and uses the previously tested std::string variants to check the std::wstring functions.

vpisarev self-assigned this Feb 19, 2019

API in imgcodecs reverted

3ad06c3

- API change in imgcodecs reverted - Undo changes that were no longer needed - Doku extended

alalek mentioned this pull request Oct 9, 2019

Support I/O functionality through std::istream/ostream #15675

Open

asmorkalov added the pr: Discussion Required label Feb 21, 2020

asmorkalov self-assigned this Mar 6, 2020

asmorkalov removed the pr: Discussion Required label Mar 12, 2020

asmorkalov unassigned vpisarev Mar 12, 2020

asmorkalov closed this Mar 20, 2020

liuxingbaoyu mentioned this pull request Sep 9, 2020

imread supports Non-ASCII filename in opencv-python+Windows #18305

Open

4 tasks

honglei mentioned this pull request Nov 27, 2020

Suport Chinese Characters in file Path breezedeus/CnOCR#130

Closed

Uh oh!

Conversation

Baum55 commented Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baum55 commented Dec 6, 2018

Uh oh!

alalek commented Dec 6, 2018

Uh oh!

Baum55 commented Dec 6, 2018

Uh oh!

Baum55 commented Dec 7, 2018

Uh oh!

mshabunin commented Dec 7, 2018

Uh oh!

Baum55 commented Dec 7, 2018

Uh oh!

Baum55 commented Dec 11, 2018

Uh oh!

Baum55 commented Dec 13, 2018

Uh oh!

mshabunin commented Dec 13, 2018

Uh oh!

Baum55 commented Dec 14, 2018

Uh oh!

mshabunin commented Dec 14, 2018

Uh oh!

Baum55 commented Dec 17, 2018

Uh oh!

alalek commented Dec 17, 2018

Uh oh!

Baum55 commented Dec 19, 2018

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hrnr commented Jan 14, 2019

Uh oh!

Baum55 commented Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hrnr commented Jan 15, 2019

Uh oh!

vpisarev commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Baum55 commented Mar 1, 2019

Uh oh!

huangqinjin commented Jul 8, 2019

Uh oh!

risa2000 commented Nov 6, 2019

Uh oh!

asmorkalov commented Mar 20, 2020

Uh oh!

risa2000 commented Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Baum55 commented Dec 5, 2018 •

edited

Loading

Baum55 commented Jan 15, 2019 •

edited

Loading

vpisarev commented Feb 19, 2019 •

edited

Loading

risa2000 commented Mar 20, 2020 •

edited

Loading