gh-124188: Fix PyErr_ProgramTextObject()#124189
gh-124188: Fix PyErr_ProgramTextObject()#124189serhiy-storchaka merged 3 commits intopython:mainfrom
Conversation
* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines.
| if isinstance(source, str): | ||
| with open(TESTFN, 'w', encoding='utf-8') as testfile: | ||
| testfile.write(dedent(source)) | ||
| else: | ||
| with open(TESTFN, 'wb') as testfile: | ||
| testfile.write(source) |
There was a problem hiding this comment.
Should this be replaced by a call to script_helper.make_script"?
There was a problem hiding this comment.
I considered this option, but I have doubt that this would make the code simpler. This code calls dedent() for string source, so you need a branching in any case. The rest of make_script() is not useful here. You would need to pass os.curdir as the first argument, the .py suffix is not relevant here, as well as the importlib cache invalidation.
| } while (*pLastChar != '\0' && *pLastChar != '\n'); | ||
| const char *line = linebuf; | ||
| /* Skip BOM. */ | ||
| if (lineno == 1 && line_size >= 3 && memcmp(line, "\xef\xbb\xbf", 3) == 0) { |
There was a problem hiding this comment.
Maybe #define BOM_PREFIX "\xef\xbb\xbf"?
There was a problem hiding this comment.
This string is only occurs once in the sources. I do not think that defining a name for it (potentially in a place far from its use) would make the code more clearer. If the comment is not enough, suggest better comment.
| extern char* _PyTokenizer_FindEncodingFilename(int, PyObject *); | ||
|
|
||
| PyObject * | ||
| _PyErr_ProgramDecodedTextObject(PyObject *filename, int lineno, const char* encoding) |
There was a problem hiding this comment.
should this be _PyErr_DecodedProgramTextObject?
There was a problem hiding this comment.
This already was here.
| if isinstance(source, str): | ||
| with open(TESTFN, 'w', encoding='utf-8') as testfile: | ||
| testfile.write(dedent(source)) | ||
| else: | ||
| with open(TESTFN, 'wb') as testfile: | ||
| testfile.write(source) |
There was a problem hiding this comment.
I considered this option, but I have doubt that this would make the code simpler. This code calls dedent() for string source, so you need a branching in any case. The rest of make_script() is not useful here. You would need to pass os.curdir as the first argument, the .py suffix is not relevant here, as well as the importlib cache invalidation.
| ], | ||
| ), | ||
| ('assert 1 > 2, "message"', | ||
| ('assert 1 > 2, "messäge"', |
There was a problem hiding this comment.
These tests are not relevant to compiler errors. I updated/added them just to test that tracebacks for other errors also extract non-ASCII lines correctly.
| extern char* _PyTokenizer_FindEncodingFilename(int, PyObject *); | ||
|
|
||
| PyObject * | ||
| _PyErr_ProgramDecodedTextObject(PyObject *filename, int lineno, const char* encoding) |
There was a problem hiding this comment.
This already was here.
| } while (*pLastChar != '\0' && *pLastChar != '\n'); | ||
| const char *line = linebuf; | ||
| /* Skip BOM. */ | ||
| if (lineno == 1 && line_size >= 3 && memcmp(line, "\xef\xbb\xbf", 3) == 0) { |
There was a problem hiding this comment.
This string is only occurs once in the sources. I do not think that defining a name for it (potentially in a place far from its use) would make the code more clearer. If the comment is not enough, suggest better comment.
|
A test fails on WASI: |
|
Oh, it's a bug in unittest. It calls str() on a bytes object indirectly. Workaround: diff --git a/Lib/unittest/case.py b/Lib/unittest/case.py
index 55c79d35353..4e00267e77c 100644
--- a/Lib/unittest/case.py
+++ b/Lib/unittest/case.py
@@ -1455,6 +1455,8 @@ class _SubTest(TestCase):
def __init__(self, test_case, message, params):
super().__init__()
+ if message is not _subtest_msg_sentinel and not isinstance(message, str):
+ message = repr(message)
self._message = message
self.test_case = test_case
self.params = params |
|
Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13. |
* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
Sorry, @serhiy-storchaka, I could not cleanly backport this to |
|
GH-124423 is a backport of this pull request to the 3.13 branch. |
* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
|
GH-124426 is a backport of this pull request to the 3.12 branch. |
* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107)
|
That's a big piece of work. While _PyErr_ProgramDecodedTextObject() change is short, you added tons of tests for that: thanks. |
|
Because it contained tons of bugs. Not all tests were backported to 3.12, because they exposed other bugs, not fixed in 3.12. |
* Detect source file encoding. * Use the "replace" error handler even for UTF-8 (default) encoding. * Remove the BOM. * Fix detection of too long lines if they contain NUL. * Return the head rather than the tail for truncated long lines. (cherry picked from commit e2f7107) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Uh oh!
There was an error while loading. Please reload this page.