Tokenizer might encounter some encoding errors

For example, I obtained the following error when attempting to receive a response:

> 'charmap' codec can't encode character '\u0100' in position 2452: character maps to <undefined>


The attempted solution that works in my case:

Modify the tokenizer.py file by adding `encoding="utf-8"` when attempting to read or write the `tokenizer_file`

```def _get_cached_tokenizer_file_as_str() -> str:
    tokenizer_file = _get_tokenizer_filename()
    if not os.path.exists(tokenizer_file):
        response = httpx.get(CLAUDE_TOKENIZER_REMOTE_FILE)
        response.raise_for_status()
        with open(tokenizer_file, 'w', encoding="utf-8") as f:
            f.write(response.text)

    with open(tokenizer_file, 'r', encoding="utf-8") as f:
        return f.read()
        ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer might encounter some encoding errors #18

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tokenizer might encounter some encoding errors #18

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions