Fix broken CP949 state machine#268
Conversation
2efbe00 to
b83279c
Compare
|
Do you have any test files that you could contribute that show that it failed before and would work now? |
|
I have added tests for coding state machines ( Drawbacks
Please let me know if there are something to improve, or if I have made some mistakes in the test code. |
|
I have limited testing charsets for the test case, which is added by this PR, as it would be better to be handled in another PR. |
a6fa082 to
9159b2c
Compare
|
Sorry for the long delay in getting back to this PR, but the additional tests you added here are really handy, and I am planning on using them to fix any other broken parts of the state machine. I have another PR that's kind of a catch-all for a big overhaul I'm doing on chardet at the moment (#99), so I'll likely change some things in this before the next release, but this is a great contribution so I'll finally merge it! |
Abstract
Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error.
This PR rewrites the state transition table, to comply the CP949 Specification.
Details
These are some cases, which a false-positive error can occur in the current implementation.
춉(0xAD68)The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.
힣(0xC652)The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.
Test
I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned
Success.When I have tested it against the current implementation (As-Is), it shows
Error! at byte 15479.I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)