Skip to content

Support of Unicode file paths under Windows#13368

Closed
Baum55 wants to merge 11 commits intoopencv:masterfrom
Baum55:master
Closed

Support of Unicode file paths under Windows#13368
Baum55 wants to merge 11 commits intoopencv:masterfrom
Baum55:master

Conversation

@Baum55
Copy link
Copy Markdown

@Baum55 Baum55 commented Dec 5, 2018

On Windows the file API (open, fopen) assumes a ANSI coding of the std::string (const char*). Other operating systems assumes a UTF-8 coding of the std::string. To fix this issue it is necessary to use the Unicode format with the std::wstring (const wchar*) variant of the functions.
No functional change for other operating systems.

The file names ("imgcodecs.hpp") are now UTF-8 encoded on all operating systems:

  • imread
  • imreadmulti
  • imwrite
  • imdecode
  • imencode
  • haveImageReader
  • haveImageWriter

Internally, on Windows the file paths are std::wstring (Unicode).
In other operating systems the file paths are std::string (UTF-8).

resolves #4292
resolves #5631

force_builders=linux,windows,macosx,docs

On Windows the file API (open, fopen) assumes a ANSI coding of the std::string (const char*). Other operating systems assumes a UTF-8 coding of the std::string. To fix this issue it is necessary to use the Unicode format with the std::wstring (const wchar*) variant of the functions.
No functional change for other operating systems.

Changes in the interface of "imgcodecs.hpp":
- imread
- imreadmulti
- imwrite
- imdecode
- imencode
- haveImageReader
- haveImageWriter
Functions overloaded with a std::wstring (Unicode) variant for the file paths.

Changes in the interface of "imgcodecs.hpp" (only Windows):
- setcodepage
Possibility to change the code page for the std::string variant of the imgcodecs function. So it is possible that openCV assumes UTF-8 coding for std::string file paths.

Internally, on Windows the file paths are std::wstring (Unicode).
In other operating systems the file paths are std::string (UTF-8).
Linux/Max:
- Copy and Paste error: function was twice declared

Windows:
- Static cast conversion from 'size_t' to 'int'
- The second argument (dwFlags) from MultiByteToWideChar have to 0 if the code page is changed to UTF-8

Doc:
- Tab in indent removed
- Trailing whitespaces removed
- Documentation extended with an example
Linux/Max
- Wrong name of a variable

All
- Removed new setcodepage function
@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 6, 2018

I have problems with the generation of the python bindings generated by the buildbot.

To solve the issue, it is necessary to add std::wstring (at least for Windows). When I read the error messages from the buildbot, I see that this causes troubles.

Is there a way to fix this?

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 6, 2018

OpenCV is cross-platform library. Public API should be cross-platform, so no "#ifdef _WIN32" are allowed in public headers.
Also bindings parser doesn't work properly with "#if" conditions.

Start from writing a sample to demonstrate this functionality. Without excessive "#if"s.

@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 6, 2018

OpenCV is cross-platform library. Public API should be cross-platform, so no "#ifdef _WIN32" are allowed in public headers.
Also bindings parser doesn't work properly with "#if" conditions.

Start from writing a sample to demonstrate this functionality. Without excessive "#if"s.

Ok. I can remove the "#if"s from the public headers:

  • modules/core/include/opencv2/core/cvstd.hpp
  • modules/core/include/opencv2/core/utility.hpp
  • modules/imgcodecs/include/opencv2/imgcodecs.hpp

The support of Unicode paths on Windows will not be possible without excessive "#if"s in the source files.

In the same way boost::filesystem::path handles file paths. std::wstring on Windows and std::string on everything else with a typedef for a "path" type who change depending on the operating system.

The Windows file API for char* use an ANSI coding, only the wchar_t variant use Unicode.

ifdef removed from public header
- modules/core/include/opencv2/core/cvstd.hpp
- modules/core/include/opencv2/core/utility.hpp
@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 7, 2018

I remove all #ifdef from the public headers, but I still got the same message from the buildbot:

/build/precommit_linux64/opencv/modules/python/src2/cv2.cpp: In instantiation of 'bool pyopencv_to(PyObject*, T&, const char*) [with T = std::__cxx11::basic_string<wchar_t>; PyObject = _object]': /build/precommit_linux64/build/modules/python_bindings_generator/pyopencv_generated_funcs.h:10511:69: required from here /build/precommit_linux64/opencv/modules/python/src2/cv2.cpp:24:105: error: 'to' is not a member of 'PyOpenCV_Converter<std::__cxx11::basic_string<wchar_t>, void>' bool pyopencv_to(PyObject* obj, T& p, const char* name = "<unknown>") { return PyOpenCV_Converter<T>::to(obj, p, name); } ^ make[2]: *** [modules/python3/CMakeFiles/opencv_python3.dir/__/src2/cv2.cpp.o] Error 1 make[1]: *** [modules/python3/CMakeFiles/opencv_python3.dir/all] Error 2

I assume that the reason is because I have overloaded imread, imwrite, etc. with a std::wstring variant.

I have again thought about it intensively.
The behaviour of "fopen", "open" etc. are not equal over different operating systems. At least I do not see a way to support non ANSI characters on Windows without #ifdef.

Is there even a chance that the pull request will be accepted with the many #ifdef?

@mshabunin
Copy link
Copy Markdown
Contributor

What about single lower-level interface: imread(std::basic_istream<char> &)? It can unify imdecode and imread.

@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 7, 2018

What about single lower-level interface: imread(std::basic_istream<char> &)? It can unify imdecode and imread.

Interesting approach. I will try it next week.
Thank you very much.

ifdef concentrated in the new class Path

The majority of the ifdef is now concentrated in the new Path class.
Mac:
- 2 Function with the same signature

Doc:
- Trailing Whitespaces
Trailing Whitespaces removed
@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 11, 2018

What about single lower-level interface: imread(std::basic_istream<char> &)? It can unify imdecode and imread.

Interesting approach. I will try it next week.
Thank you very much.

I tried it with imread(std::basic_istream<char> &), but basic_istream is not overloaded for std::string or const char*.
A function call like imread("path") would not work anymore. I doubt that would have been wanted.

@Baum55 Baum55 closed this Dec 11, 2018
@Baum55 Baum55 reopened this Dec 11, 2018
Comments of the Path class extended
@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 13, 2018

Despite that the buildbot has successfully passed the test, it still shows "Waiting for builds" and the status is yellow. "How to contribute" describe only the status red and green.
Is that normal?

Maybe it's because I accidentally clicked on "Close pull request" instead "Comment" button?

@mshabunin
Copy link
Copy Markdown
Contributor

@Baum55 , I meant that we could add lower-level interface allowing users to provide thier own content streams, not filenames. Pseudocode:

Mat imread(std::istream & input, ...)
{
  input.seekg(0, std::ios_base::end);
  size_t fsize = input.tellg();
  Mat bitstream(Size(fsize, 1), CV_8U);
  input.seekg(0);
  input.read(bitstream.data, fsize);
  return imdecode(bitstream, ...);
}

// existing function can be reimplemented as follows
Mat imread(const char * filename, ...)
{
  fstream input(filename, std::ios::binary);
  if (!input) CV_Error(..., "File can not be opened");
  return imread(input);
}

Thus users will be able to open files using thier own methods, e.g. https://docs.microsoft.com/en-us/cpp/standard-library/basic-ifstream-class?view=vs-2017 :

wchar * fname = L"example.png";
std::ifstream input(fname);
Mat image = imread(input);

@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 14, 2018

@mshabunin I misunderstood that, sorry,

I have tested your proposal and it works fine.

That would mean much smaller changes. Only in the files:

  • loadsave.cpp
  • imcodecs.hpp
  • python/src2/cv2.cpp

Currently, imwrite supports multi page images and imencode does not. imreadmulti should also take into account.

But there are some disadvantage. The function imdecode and imencode creates a temporary file for the operation.
new imread:

  • read the data from the file stream to cv::Mat
  • save the cv::Mat with the file API of the operating system
  • read the temporary file with the openCV decoder classes

In a similar way that would work for imwrite. The temporary file would mean a performance loss for every imread and imwrite. In additional there would be an unnecessary writing process on the hard drive.
I do not see a good way to avoid the temporary file. The grfmt_*.*pp files for the decoder and encoder heavily use file descriptors and file descriptors are not compatible with std file streams.

I am not sure that the simplifications justifies the disadvantages.
What do you think about that?

@mshabunin
Copy link
Copy Markdown
Contributor

I believe changes in cpp files and python bindings are not necessary. New imread can be made inline serving as a proxy for imdecode for now.

I think multipage support should be out of scope for now.

imdecode and imencode will create temp files only in case if backend does not support working with memory buffer. All popular backends (png, tiff, jpeg, bmp) support it - m_buf_supported = true. File descriptors used in backends are their own and will not be opened if memory buffer is passed.

@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 17, 2018

After my experiences when I added std::wstring to the public headers. I got an error on the buildbot, because the template function pyopencv_to was not specialized for std::wstring (The entire error message is in comment 4). Since I do not see a specialization for std::istream, std::ostream or std::basic_istream<char>, I assume that I will get the same error again.

The simplifications come at a price and the purpose for which I made the pull request would omit.
All im* functions should be Unicode compatible and not only imread. On Windows, the same functionality as on Linux and Mac should be possible with the same performance.

Disadvantages of the file stream solution:

  • Some less popular image formats require a temporary file (the difference should at least be documented)
  • imencode always loads the entire image into the memory also for popular formats (bmp, webp etc.)
  • haveImageReader(std::string) does not need to load the entire image, but haveImageReader(std::istream) would need it.
  • imwrite(std::string, ...) supports multi page images and imwrite(std::ostream, ...) as a wrapper for imencode does not.
  • no new functionality that you could not already write in your application (Wrapper function for imdecode and imencode)
  • no multi page image support

@alalek
Copy link
Copy Markdown
Member

alalek commented Dec 17, 2018

std::istream is good option for C++.

Python/Java should be covered by separate activity if needed.

@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Dec 19, 2018

I still do not understand the significant advantage of file streams in this case???

In the current solution with std::wstring, all functions of imgcodecs.hpp work for non-ANSI chracters under Windows. The functions have the same behavior and performance on different operating systems. All image formats behave the same for both overloaded function variants.

With the proposal of @mshabunin, only imread and imwrite would work with non-ANSI characters. The std::string and file stream function variants would behave differently. (No support for multi-page images in the file stream version, sometimes it creates temporary files, ...)

It's not that I have invented the solution with std::wstring. The standard library goes the same way with std::filesystem::path and boost also uses the same method.

What are the significant benefits of file streams that offset the disadvantages compared to the current solution?

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample and test are necessary.

Changes in the 3rdparty code reverted. Unicode compatibility is preserved under Windows.

jasper/grfmt_jpeg_2000:
- 3rdparty lib can be called with a pointer to a FILE object

openexr/grfmt_exr:
- 3rdparty lib can be called with a filestream
Sample code for the read and write process of images added.

Test added. The test is based on the existing code and uses the previously tested std::string variants to check the std::wstring functions.
@hrnr
Copy link
Copy Markdown
Contributor

hrnr commented Jan 14, 2019

Hi,

first of all thanks. This solves some issue for windows for me as well.

However, I'm not that happy about the implementation. This PR basically adds W overloads, that are de-facto windows specific. Therefore in portable applications all the burden of selecting which one of the functions to call is on me as the user. Not to mention double path handling in std::string/std::wstring to Windows and rest of the world. In fact, I would need to implement my own wrappers over OpenCV to make this portable. It would be much nicer if OpenCV could do this job for me and for everyone, so we all don't need to write our own wrappers.

I see two better options than the current implementation:

  • Don't change the API at all. Expect std::string to be UTF-8 encoded and convert it internally to UTF-16 on Windows. This would be preferable for me, because my portable applications would magically work on Windows too. Might break some current rare use-cases, but probably more theoretical problem than a real one.
  • Introduce overloads taking some Path class instead of a std::string. Preferably keeping the same name of the functions. Path class should support subset of std::filesystem::path, so it can be later typedefed to std::filesystem::path, like @alalek did with cv::String and almost with cv::Ptr for OpenCV 4.0. This would position OpenCV nicely for C++17 support.

Note that both approaches can be nicely used by multi-platform applications and it would also nicely work from Python. If you want, you can even do both.

Cheers,
Jiri

@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Jan 15, 2019

@hrnr @alalek @mshabunin

I can do both if there is consensus that this is desired.
I was thinking about a solution like point 1, but I was not sure if openCV would accept the changed behavior under Windows (openCV 3.x std::string filenames ANSI-coded and openCV 4.x UTF-8).

    • I can remove the W overloads and treat all std::string on all OS as UTF-8.
    • I can separate Path to a public and an internal header. In the public part I can remove the OS ifdefs and I would only use functions that std::filesystem::path have, too.

Instead of 2 overloads for each. It would look like:
Mat imread( const cv::FilesystemPath& filename, ...)
bool imwrite( const cv::FilesystemPath& filename, ...)

Honestly, this solution would fit even better to the needs of my application. Thanks for the suggestion.
Is this Ok?

Edit for point 2:
I must rename Path to something else (Path -> FilesystemPath?). A other module has already a Path structure.

@hrnr
Copy link
Copy Markdown
Contributor

hrnr commented Jan 15, 2019

Great. Thanks for considering this.

I agree that the decision now needs to be made by OpenCV core team.

For option 1: It is non-compatible change, but in my experience it never breaks for reasonable strings in practice, paths especially. Of course, you need to do a fallback back to char* functions when UTF-8 -> UTF-16 conversion fails. This way it worked very nicely for me in multiple projects. Reasonable locale-encoded strings are almost never also valid UTF-8. Of course, this behaviour needs to be documented.

For option 2: You would need to probably keep the std::string functions in OpenCV 4.0, to prevent breaking the ABI.

Thanks again for leading this effort and implementing it.

Cheers,
Jiri

@vpisarev vpisarev self-assigned this Feb 19, 2019
@vpisarev
Copy link
Copy Markdown
Contributor

vpisarev commented Feb 19, 2019

@Baum55, first of all, I join @hrnr in thanking you for raising the issue and providing the solution.

My vote is for keeping the API as-is and treating all the strings as UTF-8. If needed, we can introduce extra convenience function(s) to convert file paths encoded as wide strings to std::string and back.

- API change in imgcodecs reverted
- Undo changes that were no longer needed
- Doku extended
@Baum55
Copy link
Copy Markdown
Author

Baum55 commented Mar 1, 2019

@vpisarev You're welcome.

In the version that I use for myself I already treat all the strings as UTF-8. I agree that this makes it easier and is better for multi-platform software. The UTF-8 approach also reduces the necessary changes. I have committed this version.

In the file "opencv2/core/utility.hpp" I have added the functions toString and toWString to convert std::string <=> std::wstring. That avoids copy-and-paste of the conversion functions that are used in different modules. (core, imgcodecs, python)

@huangqinjin
Copy link
Copy Markdown
Contributor

I would like to leave a comment here. Since UCRT 10.0.17134, i.e. Windows 10 April 2018 Update version 1803, you can issue setlocale(LC_CTYPE, ".utf8"), then the filename to fopen is treated as UTF-8 encoded. See https://github.com/huangqinjin/wmain#ucrt-and-utf-8 .

Anyway, +1 for std::istream solution.

@risa2000
Copy link
Copy Markdown

risa2000 commented Nov 6, 2019

Just found this issue while looking for non-ASCII support for cv::imread and cv::imwrite. Reading this thread gave me an impression that both functions could be made to accept UTF-8 strings (which should not impact anyone using pure ASCII).

Now what I did not get, whether the internal implicit conversion from UTF-8 could be problem? Is it possible to use the current implementation with non-ASCII characters? Could there be a misinterpretation in such situation?

If not, or not a problem, I would prefer, if OpenCV implements implicit conversion from UTF-8 internally, and let the client code pass UTF-8 strings.

Using std::ifstream instead looks more like a workaround than a final solution, because it requires a user to handle the stream, while also relying on std::filesystem (which seems to be the only straightforward way to open an ifstream on UTF-8 path), which consequently requires C++17 support in the toolchain, etc. i.e. a lot of assumption on the user environment.

Keeping only "string" interface ensures that everything mentioned above can happen "under the hood" of the lib, and the lib can implement it the best way it would fit (using fstreams, fopen, etc.), without requiring any particular toolchain feature from the user's environment.

I am writing this in particular with regard to (#15675) which seems to be going the opposite way.

@asmorkalov
Copy link
Copy Markdown
Contributor

OpenCV team discussed the solution on weekly meeting. @alalek @vpisarev @asmorkalov @vladimir-dudnik @VadimLevin decided to close thie PR. The reason of the decision and alternatives are discussed in #4292 (comment) and #5631.

@asmorkalov asmorkalov closed this Mar 20, 2020
@risa2000
Copy link
Copy Markdown

risa2000 commented Mar 20, 2020

@asmorkalov Since you closed all the other discussions as well, I am not sure where to post, so I am leaving it here:
What about providing function overrides for file I/O which would accept std::filesystem::path instead of std::string?
I do not know what is current status of C++17 support in the toolchains opencv mostly uses, (I know std::filesystem is supported by latest MSVC and clang equally), but choosing this "standard" way should ensure forward compatibility while remaining platform agnostic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Supply imwrite/imread overloads to work with wstring, wchar_t Unicode Path/Filename for imread and imwrite

8 participants