Update Python scripts#155
Conversation
- Update to Unicode 17 and generate `tables.rs` + `testdata/mod` - Update `unicode.py` and `unicode_gen_breaktests.py` to explicitly work with UTF-8 files with LF new lines - Update `unicode_gen_breaktests.py` to write into `testdata/mod.rs` as the `testdata.rs` was moved into this new path
|
Hmm, the unicode.py update doesn't work on my local Ubuntu machine because it doesn't have Python 3.10 (whcih is needed for the encoding parameter). Can we skip that part? |
The |
|
@Martin005 They're new on |
Oh, you are right, it was added in Python 3.10. Sorry, I didn't check that as I expected it to just pass the keywords to |
|
Either way, this isn't mergeable until the state machine is updated to fix the bugs. I think we have two bugs in the reverse state machine, one around the handling of ZWJ and emoji in word segmentation, the other around graphemes. I started investigating in #156. I don't know if i'll have time to finish it, help appreciated. I recommend splitting out the Python3 changes into a landable PR, and we can separately keep looking in to the test failures. |
@Manishearth Oh, yeah. Thanks for starting to work on the bugs uncovered by Unicode 17. |
|
I'll squash |
unicode.pyandunicode_gen_breaktests.pyto write explicitly UTF-8 files with LF new linesunicode.py