Support of Unicode file paths under Windows#13368
Support of Unicode file paths under Windows#13368Baum55 wants to merge 11 commits intoopencv:masterfrom
Conversation
On Windows the file API (open, fopen) assumes a ANSI coding of the std::string (const char*). Other operating systems assumes a UTF-8 coding of the std::string. To fix this issue it is necessary to use the Unicode format with the std::wstring (const wchar*) variant of the functions. No functional change for other operating systems. Changes in the interface of "imgcodecs.hpp": - imread - imreadmulti - imwrite - imdecode - imencode - haveImageReader - haveImageWriter Functions overloaded with a std::wstring (Unicode) variant for the file paths. Changes in the interface of "imgcodecs.hpp" (only Windows): - setcodepage Possibility to change the code page for the std::string variant of the imgcodecs function. So it is possible that openCV assumes UTF-8 coding for std::string file paths. Internally, on Windows the file paths are std::wstring (Unicode). In other operating systems the file paths are std::string (UTF-8).
Linux/Max: - Copy and Paste error: function was twice declared Windows: - Static cast conversion from 'size_t' to 'int' - The second argument (dwFlags) from MultiByteToWideChar have to 0 if the code page is changed to UTF-8 Doc: - Tab in indent removed - Trailing whitespaces removed - Documentation extended with an example
Linux/Max - Wrong name of a variable All - Removed new setcodepage function
|
I have problems with the generation of the python bindings generated by the buildbot. To solve the issue, it is necessary to add std::wstring (at least for Windows). When I read the error messages from the buildbot, I see that this causes troubles. Is there a way to fix this? |
|
OpenCV is cross-platform library. Public API should be cross-platform, so no "#ifdef _WIN32" are allowed in public headers. Start from writing a sample to demonstrate this functionality. Without excessive "#if"s. |
Ok. I can remove the "#if"s from the public headers:
The support of Unicode paths on Windows will not be possible without excessive "#if"s in the source files. In the same way boost::filesystem::path handles file paths. std::wstring on Windows and std::string on everything else with a typedef for a "path" type who change depending on the operating system. The Windows file API for char* use an ANSI coding, only the wchar_t variant use Unicode. |
ifdef removed from public header - modules/core/include/opencv2/core/cvstd.hpp - modules/core/include/opencv2/core/utility.hpp
|
I remove all #ifdef from the public headers, but I still got the same message from the buildbot:
I assume that the reason is because I have overloaded imread, imwrite, etc. with a std::wstring variant. I have again thought about it intensively. Is there even a chance that the pull request will be accepted with the many #ifdef? |
|
What about single lower-level interface: |
Interesting approach. I will try it next week. |
ifdef concentrated in the new class Path The majority of the ifdef is now concentrated in the new Path class.
Mac: - 2 Function with the same signature Doc: - Trailing Whitespaces
Trailing Whitespaces removed
I tried it with |
Comments of the Path class extended
|
Despite that the buildbot has successfully passed the test, it still shows "Waiting for builds" and the status is yellow. "How to contribute" describe only the status red and green. Maybe it's because I accidentally clicked on "Close pull request" instead "Comment" button? |
|
@Baum55 , I meant that we could add lower-level interface allowing users to provide thier own content streams, not filenames. Pseudocode: Mat imread(std::istream & input, ...)
{
input.seekg(0, std::ios_base::end);
size_t fsize = input.tellg();
Mat bitstream(Size(fsize, 1), CV_8U);
input.seekg(0);
input.read(bitstream.data, fsize);
return imdecode(bitstream, ...);
}
// existing function can be reimplemented as follows
Mat imread(const char * filename, ...)
{
fstream input(filename, std::ios::binary);
if (!input) CV_Error(..., "File can not be opened");
return imread(input);
}Thus users will be able to open files using thier own methods, e.g. https://docs.microsoft.com/en-us/cpp/standard-library/basic-ifstream-class?view=vs-2017 : wchar * fname = L"example.png";
std::ifstream input(fname);
Mat image = imread(input); |
|
@mshabunin I misunderstood that, sorry, I have tested your proposal and it works fine. That would mean much smaller changes. Only in the files:
Currently, imwrite supports multi page images and imencode does not. imreadmulti should also take into account. But there are some disadvantage. The function imdecode and imencode creates a temporary file for the operation.
In a similar way that would work for imwrite. The temporary file would mean a performance loss for every imread and imwrite. In additional there would be an unnecessary writing process on the hard drive. I am not sure that the simplifications justifies the disadvantages. |
|
I believe changes in cpp files and python bindings are not necessary. New I think multipage support should be out of scope for now.
|
|
After my experiences when I added The simplifications come at a price and the purpose for which I made the pull request would omit. Disadvantages of the file stream solution:
|
|
Python/Java should be covered by separate activity if needed. |
|
I still do not understand the significant advantage of file streams in this case??? In the current solution with With the proposal of @mshabunin, only It's not that I have invented the solution with What are the significant benefits of file streams that offset the disadvantages compared to the current solution? |
alalek
left a comment
There was a problem hiding this comment.
Sample and test are necessary.
Changes in the 3rdparty code reverted. Unicode compatibility is preserved under Windows. jasper/grfmt_jpeg_2000: - 3rdparty lib can be called with a pointer to a FILE object openexr/grfmt_exr: - 3rdparty lib can be called with a filestream
Sample code for the read and write process of images added. Test added. The test is based on the existing code and uses the previously tested std::string variants to check the std::wstring functions.
|
Hi, first of all thanks. This solves some issue for windows for me as well. However, I'm not that happy about the implementation. This PR basically adds I see two better options than the current implementation:
Note that both approaches can be nicely used by multi-platform applications and it would also nicely work from Python. If you want, you can even do both. Cheers, |
|
I can do both if there is consensus that this is desired.
Instead of 2 overloads for each. It would look like: Honestly, this solution would fit even better to the needs of my application. Thanks for the suggestion. Edit for point 2: |
|
Great. Thanks for considering this. I agree that the decision now needs to be made by OpenCV core team. For option 1: It is non-compatible change, but in my experience it never breaks for reasonable strings in practice, paths especially. Of course, you need to do a fallback back to char* functions when UTF-8 -> UTF-16 conversion fails. This way it worked very nicely for me in multiple projects. Reasonable locale-encoded strings are almost never also valid UTF-8. Of course, this behaviour needs to be documented. For option 2: You would need to probably keep the std::string functions in OpenCV 4.0, to prevent breaking the ABI. Thanks again for leading this effort and implementing it. Cheers, |
|
@Baum55, first of all, I join @hrnr in thanking you for raising the issue and providing the solution. My vote is for keeping the API as-is and treating all the strings as UTF-8. If needed, we can introduce extra convenience function(s) to convert file paths encoded as wide strings to std::string and back. |
- API change in imgcodecs reverted - Undo changes that were no longer needed - Doku extended
|
@vpisarev You're welcome. In the version that I use for myself I already treat all the strings as UTF-8. I agree that this makes it easier and is better for multi-platform software. The UTF-8 approach also reduces the necessary changes. I have committed this version. In the file |
|
I would like to leave a comment here. Since UCRT 10.0.17134, i.e. Windows 10 April 2018 Update version 1803, you can issue Anyway, +1 for |
|
Just found this issue while looking for non-ASCII support for Now what I did not get, whether the internal implicit conversion from UTF-8 could be problem? Is it possible to use the current implementation with non-ASCII characters? Could there be a misinterpretation in such situation? If not, or not a problem, I would prefer, if OpenCV implements implicit conversion from UTF-8 internally, and let the client code pass UTF-8 strings. Using Keeping only "string" interface ensures that everything mentioned above can happen "under the hood" of the lib, and the lib can implement it the best way it would fit (using fstreams, fopen, etc.), without requiring any particular toolchain feature from the user's environment. I am writing this in particular with regard to (#15675) which seems to be going the opposite way. |
|
OpenCV team discussed the solution on weekly meeting. @alalek @vpisarev @asmorkalov @vladimir-dudnik @VadimLevin decided to close thie PR. The reason of the decision and alternatives are discussed in #4292 (comment) and #5631. |
|
@asmorkalov Since you closed all the other discussions as well, I am not sure where to post, so I am leaving it here: |
On Windows the file API (open, fopen) assumes a ANSI coding of the std::string (const char*). Other operating systems assumes a UTF-8 coding of the std::string. To fix this issue it is necessary to use the Unicode format with the std::wstring (const wchar*) variant of the functions.
No functional change for other operating systems.
The file names ("imgcodecs.hpp") are now UTF-8 encoded on all operating systems:
Internally, on Windows the file paths are std::wstring (Unicode).
In other operating systems the file paths are std::string (UTF-8).
resolves #4292
resolves #5631