gh-130197: Test various encodings with pygettext by tomasr8 · Pull Request #132244 · python/cpython

tomasr8 · 2025-04-07T21:39:05Z

For context: #131902 (comment)

[...] We need this to test with and without --escape option. In particularly, to see what encoding is used in the POT file and in the header. In file with non-UTF-8 encoding use also characters not encodable with the source encoding (\uXXXX).

We currently set the charset of the POT file as the default encoding on the system (fp.encoding):

cpython/Tools/i18n/pygettext.py

Lines 574 to 576 in f5639d8

    
           def write_pot_file(messages, options, fp): 
        
               timestamp = time.strftime('%Y-%m-%d %H:%M%z') 
        
               encoding = fp.encoding if fp.encoding else 'UTF-8'

To have reproducible tests regardless of the OS they are running on, we set -X utf8 in the tests. As a consequence, the POT charset is always set to utf-8. I don't think there's an easy way to control that if we want to test other output encodings.. At least with these tests we know that non-utf8 input files can be read correctly.

cc @serhiy-storchaka Let me know if this is what you had in mind for the tests!

Issue: pygettext: Improve test coverage #130197

serhiy-storchaka

I do not think that duplicating this test with multiple encodings is needed. It is enough to test with one encoding -- and it should not be Latin1 or Windows-1252, which are often the default encoding. The CPU time can be spent on different tests.

Please add also non-ASCII comments.

Finally, we need to add tests for non-ASCII filenames on non-UTF-8 locale. I afraid that i18n_data cannot be used for this -- we need to try several locales with different encodings and generate an input file with corresponding name.

We need also to test the stderr output for files with non-ASCII file name and non-ASCII source encoding on non-UTF-8 locale. It contains a file name and may contain a fragment of the source text.

Test various encodings with pygettext

e28559c

tomasr8 added the skip news label Apr 7, 2025

bedevere-app bot added the awaiting review label Apr 7, 2025

bedevere-app bot mentioned this pull request Apr 7, 2025

pygettext: Improve test coverage #130197

Open

18 tasks

Update test subdirs

e12bb90

tomasr8 requested a review from erlend-aasland as a code owner April 7, 2025 22:01

serhiy-storchaka reviewed Apr 8, 2025

View reviewed changes

tomasr8 added 2 commits April 12, 2025 12:16

Use latin2 encoding

cad1efd

Lint fix

067807f

tomasr8 marked this pull request as draft April 12, 2025 11:26

bedevere-app bot removed the awaiting review label Apr 12, 2025

mtfreitasf mentioned this pull request Oct 9, 2025

pygettext.py crashes with UnicodeEncodeError when writing .pot files containing emoji #139873

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-130197: Test various encodings with pygettext#132244

gh-130197: Test various encodings with pygettext#132244
tomasr8 wants to merge 4 commits intopython:mainfrom
tomasr8:pygettext-encodings

tomasr8 commented Apr 7, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def write_pot_file(messages, options, fp):
	timestamp = time.strftime('%Y-%m-%d %H:%M%z')
	encoding = fp.encoding if fp.encoding else 'UTF-8'

Uh oh!

Conversation

tomasr8 commented Apr 7, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomasr8 commented Apr 7, 2025 •

edited by bedevere-app bot

Loading