Skip to content

Update Python scripts#155

Merged
Manishearth merged 5 commits into
unicode-rs:masterfrom
Martin005:unicode-17
Mar 19, 2026
Merged

Update Python scripts#155
Manishearth merged 5 commits into
unicode-rs:masterfrom
Martin005:unicode-17

Conversation

@Martin005

@Martin005 Martin005 commented Mar 18, 2026

Copy link
Copy Markdown
Contributor
  • Update unicode.py and unicode_gen_breaktests.py to write explicitly UTF-8 files with LF new lines
  • Remove semicolons from unicode.py

- Update to Unicode 17 and generate `tables.rs` + `testdata/mod`
- Update `unicode.py` and `unicode_gen_breaktests.py` to explicitly work with UTF-8 files with LF new lines
- Update `unicode_gen_breaktests.py` to write into `testdata/mod.rs` as the `testdata.rs` was moved into this new path
@Manishearth

Copy link
Copy Markdown
Member

Hmm, the unicode.py update doesn't work on my local Ubuntu machine because it doesn't have Python 3.10 (whcih is needed for the encoding parameter). Can we skip that part?

Comment thread scripts/unicode_gen_breaktests.py Outdated
@Martin005

Copy link
Copy Markdown
Contributor Author

Hmm, the unicode.py update doesn't work on my local Ubuntu machine because it doesn't have Python 3.10 (whcih is needed for the encoding parameter). Can we skip that part?

The encoding and newline parameters of io.open() has been in Python since Python 3.0 - are you still running Python 2 on your machine? 😮

@Manishearth

Copy link
Copy Markdown
Member

@Martin005 They're new on fileinput.input

@Martin005

Copy link
Copy Markdown
Contributor Author

@Martin005 They're new on fileinput.input

Oh, you are right, it was added in Python 3.10. Sorry, I didn't check that as I expected it to just pass the keywords to io.open(). Just removed it in 4e276b9 (this PR)

@Manishearth

Copy link
Copy Markdown
Member

Either way, this isn't mergeable until the state machine is updated to fix the bugs. I think we have two bugs in the reverse state machine, one around the handling of ZWJ and emoji in word segmentation, the other around graphemes.

I started investigating in #156. I don't know if i'll have time to finish it, help appreciated.

I recommend splitting out the Python3 changes into a landable PR, and we can separately keep looking in to the test failures.

@Martin005 Martin005 changed the title Support Unicode 17.0.0 Update Python scripts Mar 19, 2026
@Martin005

Copy link
Copy Markdown
Contributor Author

Either way, this isn't mergeable until the state machine is updated to fix the bugs. I think we have two bugs in the reverse state machine, one around the handling of ZWJ and emoji in word segmentation, the other around graphemes.

I started investigating in #156. I don't know if i'll have time to finish it, help appreciated.

I recommend splitting out the Python3 changes into a landable PR, and we can separately keep looking in to the test failures.

@Manishearth Oh, yeah. Thanks for starting to work on the bugs uncovered by Unicode 17.
Makes sense - I updated the branch so that this MR only brings the updates to the Python scripts. Should I update the commits so that there is only going to be 1 commit, or are you going to squash these commits?
I will take a look at the #156 and see if I can help 🙂

@Manishearth

Copy link
Copy Markdown
Member

I'll squash

@Manishearth Manishearth merged commit 1441d3d into unicode-rs:master Mar 19, 2026
2 checks passed
@Martin005 Martin005 deleted the unicode-17 branch March 19, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants