Skip to content

Tokenizer might encounter some encoding errors #18

@mesax1

Description

@mesax1

For example, I obtained the following error when attempting to receive a response:

'charmap' codec can't encode character '\u0100' in position 2452: character maps to

The attempted solution that works in my case:

Modify the tokenizer.py file by adding encoding="utf-8" when attempting to read or write the tokenizer_file

    tokenizer_file = _get_tokenizer_filename()
    if not os.path.exists(tokenizer_file):
        response = httpx.get(CLAUDE_TOKENIZER_REMOTE_FILE)
        response.raise_for_status()
        with open(tokenizer_file, 'w', encoding="utf-8") as f:
            f.write(response.text)

    with open(tokenizer_file, 'r', encoding="utf-8") as f:
        return f.read()
        ```

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions