Skip to content

Fix broken CP949 state machine#268

Merged
dan-blanchard merged 5 commits intochardet:mainfrom
HelloWorld017:feature/cp949-class8
Nov 13, 2025
Merged

Fix broken CP949 state machine#268
dan-blanchard merged 5 commits intochardet:mainfrom
HelloWorld017:feature/cp949-class8

Conversation

@HelloWorld017
Copy link
Copy Markdown
Contributor

@HelloWorld017 HelloWorld017 commented Jul 21, 2022

Abstract

Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error.
This PR rewrites the state transition table, to comply the CP949 Specification.

Details

These are some cases, which a false-positive error can occur in the current implementation.

  • (0xAD68)
    The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.

  • (0xC652)
    The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.

Test

I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned Success.
When I have tested it against the current implementation (As-Is), it shows Error! at byte 15479.

from chardet.codingstatemachine import CodingStateMachine
from chardet.mbcssm import CP949_SM_MODEL

sm = CodingStateMachine(CP949_SM_MODEL)

with open('./path/to/cp949-chars.txt', 'rb') as f:
    data = f.read()

for i, byte in enumerate(data):
    state = sm.next_state(byte)

    if state == 1:
        print("Error! at byte %d" % i)
        break

if state != 1:
  print("Success! :)")

I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)

@dan-blanchard dan-blanchard force-pushed the feature/cp949-class8 branch from 2efbe00 to b83279c Compare July 22, 2022 19:58
@dan-blanchard
Copy link
Copy Markdown
Member

Do you have any test files that you could contribute that show that it failed before and would work now?

@HelloWorld017
Copy link
Copy Markdown
Contributor Author

HelloWorld017 commented Jul 25, 2022

I have added tests for coding state machines (test_coding_state_machine).
It's working for the most machines and points out the bug of the current CP949 state machine.

Drawbacks

  • I have commented out the ISO-2022-CN and EUCTW as the python does not supports them.
  • Actually, for the unicode machines, it seems that it fails. I'm not certain whether it is the bug of the state machine, or the bug of the test code.
    These are the case where they fails:
    • State 5 + u+1b00 for UTF-16LE
    • State 6 + u+1b00 for UTF-16BE
    • State START + u+10000 for UTF-8

Please let me know if there are something to improve, or if I have made some mistakes in the test code.

@HelloWorld017
Copy link
Copy Markdown
Contributor Author

I have limited testing charsets for the test case, which is added by this PR, as it would be better to be handled in another PR.

@dan-blanchard
Copy link
Copy Markdown
Member

Sorry for the long delay in getting back to this PR, but the additional tests you added here are really handy, and I am planning on using them to fix any other broken parts of the state machine. I have another PR that's kind of a catch-all for a big overhaul I'm doing on chardet at the moment (#99), so I'll likely change some things in this before the next release, but this is a great contribution so I'll finally merge it!

@dan-blanchard dan-blanchard merged commit 1355907 into chardet:main Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants