Use UTF-8 encoding for nginx plugin#5341
Use UTF-8 encoding for nginx plugin#5341g6123 wants to merge 4 commits intocertbot:masterfrom g6123:master
Conversation
|
Is there a manual workaround? I have tried keeping the Nginx config file as small as possible, but still get the error on Arch Linux. |
|
@greencopper It is not really clean way, though you can manually apply the changes to the package. The path would be |
|
I was running into this problem for weeks, got the workaround thrown at me to find unicode characters in files. Used the command |
|
@ohemorange, would you like to review this instead? |
|
Just installed via instruction But, get the same error... 2018-06-10 21:03:05,538:DEBUG:certbot.error_handler:Calling registered functions ========= How to upgrade the certbot to 0.26.0? |
|
I'm a bit nervous about the possibility that some people's nginx files might also contain binary data or textual data in an encoding other than UTF-8, in which case the Python UTF-8 codec might still crash when trying to read these files. For example, using Markus Kuhn's stress test file at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt if we say we will get a crash because, indeed, the file isn't valid UTF-8. We might want to say that nginx configuration files must be valid UTF-8 in order for Certbot to parse them (which seems like a perfectly valid kind of requirement), but in that case we'd probably still want to give a more useful error in this context, maybe like telling the user I could imagine that there might be people using other character sets in comments. For instance, suppose a Chinese speaker uses BIG5 encoding in comments. In that case the line
would be encoded with bytes [35, 32, 166, 64, 10]. Unfortunately this is not only not UTF-8 but also has no interpretation as UTF-8.
ends up giving
😢 |
|
@g6123, thank you for doing this work. Do you think you could revise this to catch errors when the file contains non-UTF-8 data? I think we can assume that we can reject such files (rather than adding further complexity about detection of encodings) but we will probably want to give a useful error message to the user in such cases. I will also ask my colleagues to confirm that they agree with this path. |
schoen
left a comment
There was a problem hiding this comment.
I think we'll still need to explicitly catch errors when the files aren't valid as UTF-8, even though perhaps we don't need to proceed in such cases.
|
@schoen does nginx itself crash on that test file? we just want to match nginx's behavior, not catch everything |
|
@ohemorange, sad to say, I've just created an nginx configuration file with a comment containing a Chinese character in BIG5 encoding instead of UTF-8, and nginx happily parsed and accepted this file. But Python3 refuses to read the content of that same file into a UTF-8 string and instead crashes with a So I think we do need to deal with this case after all, or else at least some users may continue to get Unicode-related errors even with the improvement made by this PR! (I would still be OK with having Certbot refuse to process web server configuration files that aren't valid UTF-8, but I think it needs to give a useful error in this situation explaining what happened.) |
|
So, we've concluded that we don't necessarily need to match nginx's behavior but we should at least give a friendly error message rather than an error traceback when a non-UTF-8-valid string is encountered in a configuration file. After all, nginx does accept these strings at least inside comments (and maybe they can also be served in web headers or something). The error message would explain that Certbot is only able to modify nginx configuration files that are valid UTF-8, and indicate which particular file failed to be parsed. Maybe we could also open a wishlist issue to ask for more closely matching nginx's behavior in this regard (especially if nginx turns out to accept other string encodings outside of comment lines). @g6123, are you still interested in working on this issue? |
|
Fully agree with @schoen analysis. We cannot reasonably create a fully compatible nginx conf parser+writer in Python, this would be too much work to handle all the corner cases of the original nginx parser in C. A pragmatic fallback though, is to fail gracefully with a useful explanation for the user. Having a line number would be a neat addition. |
|
I also agree with @schoen. @ohemorange I'm working on it, but it would take some time, since I'm not actively developing Python programs. I've added commits that display warning message and return empty result, with some naive tests. If anyone is afford to take this task over, that's perfectly okay. |
|
Ok, be sure to let us know if there's anything we could do to help move it along, including questions or thoughts or anything. |
|
@ohemorange okay, I guess I'm not. You can take it over. |
|
Sorry if I let you folks under the wrong impression but it wasn't my intention to contribute on this. Hope this can be fixed someday though! |
Updated nginx plugin parser to always use UTF-8 encoding with Python (3.x, 2.x) standard library
codecs.This fixes the issue #5337.