Update Python scripts by Martin005 · Pull Request #155 · unicode-rs/unicode-segmentation

Martin005 · 2026-03-18T13:13:43Z

Update unicode.py and unicode_gen_breaktests.py to write explicitly UTF-8 files with LF new lines
Remove semicolons from unicode.py

- Update to Unicode 17 and generate `tables.rs` + `testdata/mod` - Update `unicode.py` and `unicode_gen_breaktests.py` to explicitly work with UTF-8 files with LF new lines - Update `unicode_gen_breaktests.py` to write into `testdata/mod.rs` as the `testdata.rs` was moved into this new path

Manishearth · 2026-03-18T22:31:24Z

Hmm, the unicode.py update doesn't work on my local Ubuntu machine because it doesn't have Python 3.10 (whcih is needed for the encoding parameter). Can we skip that part?

Martin005 · 2026-03-19T11:17:41Z

Hmm, the unicode.py update doesn't work on my local Ubuntu machine because it doesn't have Python 3.10 (whcih is needed for the encoding parameter). Can we skip that part?

The encoding and newline parameters of io.open() has been in Python since Python 3.0 - are you still running Python 2 on your machine? 😮

Manishearth · 2026-03-19T15:17:25Z

@Martin005 They're new on fileinput.input

Martin005 · 2026-03-19T16:03:56Z

@Martin005 They're new on fileinput.input

Oh, you are right, it was added in Python 3.10. Sorry, I didn't check that as I expected it to just pass the keywords to io.open(). Just removed it in 4e276b9 (this PR)

Manishearth · 2026-03-19T16:13:41Z

Either way, this isn't mergeable until the state machine is updated to fix the bugs. I think we have two bugs in the reverse state machine, one around the handling of ZWJ and emoji in word segmentation, the other around graphemes.

I started investigating in #156. I don't know if i'll have time to finish it, help appreciated.

I recommend splitting out the Python3 changes into a landable PR, and we can separately keep looking in to the test failures.

Martin005 · 2026-03-19T16:55:20Z

Either way, this isn't mergeable until the state machine is updated to fix the bugs. I think we have two bugs in the reverse state machine, one around the handling of ZWJ and emoji in word segmentation, the other around graphemes.

I started investigating in #156. I don't know if i'll have time to finish it, help appreciated.

I recommend splitting out the Python3 changes into a landable PR, and we can separately keep looking in to the test failures.

@Manishearth Oh, yeah. Thanks for starting to work on the bugs uncovered by Unicode 17.
Makes sense - I updated the branch so that this MR only brings the updates to the Python scripts. Should I update the commits so that there is only going to be 1 commit, or are you going to squash these commits?
I will take a look at the #156 and see if I can help 🙂

Manishearth · 2026-03-19T17:16:12Z

I'll squash

Manishearth approved these changes Mar 18, 2026

View reviewed changes

Fix path in test data verification script

12ff1d9

Manishearth approved these changes Mar 18, 2026

View reviewed changes

Manishearth reviewed Mar 18, 2026

View reviewed changes

Comment thread scripts/unicode_gen_breaktests.py Outdated

Fix test data file path in unicode break tests script

d3d87c8

Remove explicit UTF-8 encoding from fileinput.input() calls

4e276b9

Revert to Unicode 16

743a547

Martin005 changed the title ~~Support Unicode 17.0.0~~ Update Python scripts Mar 19, 2026

Manishearth approved these changes Mar 19, 2026

View reviewed changes

Manishearth merged commit 1441d3d into unicode-rs:master Mar 19, 2026
2 checks passed

Martin005 deleted the unicode-17 branch March 19, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Python scripts#155

Update Python scripts#155
Manishearth merged 5 commits into
unicode-rs:masterfrom
Martin005:unicode-17

Martin005 commented Mar 18, 2026 •

edited

Loading

Uh oh!

Manishearth commented Mar 18, 2026

Uh oh!

Uh oh!

Martin005 commented Mar 19, 2026

Uh oh!

Manishearth commented Mar 19, 2026

Uh oh!

Martin005 commented Mar 19, 2026

Uh oh!

Manishearth commented Mar 19, 2026

Uh oh!

Martin005 commented Mar 19, 2026

Uh oh!

Manishearth commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Martin005 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Manishearth commented Mar 18, 2026

Uh oh!

Uh oh!

Martin005 commented Mar 19, 2026

Uh oh!

Manishearth commented Mar 19, 2026

Uh oh!

Martin005 commented Mar 19, 2026

Uh oh!

Manishearth commented Mar 19, 2026

Uh oh!

Martin005 commented Mar 19, 2026

Uh oh!

Manishearth commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Martin005 commented Mar 18, 2026 •

edited

Loading