Skip to content

Add type annotations to the project and run mypy on CI#261

Merged
dan-blanchard merged 1 commit intochardet:masterfrom
jdufresne:mypy
Jun 29, 2022
Merged

Add type annotations to the project and run mypy on CI#261
dan-blanchard merged 1 commit intochardet:masterfrom
jdufresne:mypy

Conversation

@jdufresne
Copy link
Copy Markdown
Contributor

This helps consumers of chardet ensure that they are using the provided
API correctly. The project includes a py.typed file for PEP-561
compliance.

This also helps ensure internal correctness and consistency. Adding the
static type checking caught at least one suspect pattern in
chardet/hebrewprober.py where an int value was compared to
string (probably leftover from Python 2 support).

Type checking will run on all pull requests through GitHub actions and
pre-commit.

Copy link
Copy Markdown
Member

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a great contribution, but I think some of the types aren't quite right. Once those are fixed, we should be good to go though.



def detect(byte_str):
def detect(byte_str: str) -> ResultDict:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Union[bytes, bytearray], not str. This function explicitly does not accept str.

self._freq_chars = 0

def feed(self, char, char_len):
def feed(self, char: Sequence[int], char_len: int) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the places you have Sequence[int] should be Union[bytes, bytearray].

return self._best_guess_prober.language

def feed(self, byte_str):
def feed(self, byte_str: bytes) -> int:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every byte_str argument should be Union[bytes, bytearray].

self._state = None
_state: int

def __init__(self, lang_filter: int = 0) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lang_filter should be LanguageFilter. Although, the default of 0 probably makes that tricky. Honestly, all the enums in the chardet.enums module should probably be overhauled and made into proper Python 3 Enum or Flag types.

self._freq_chars = None
# Mapping table to get frequency order from char order (get from
# GetOrder())
_char_to_freq_order: Tuple[int, ...]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you getting rid of all the attribute initializations from __init__ (here and elsewhere)? In local testing, that would make it an AttributeError to try to access any of those.

return "Japanese"

def feed(self, byte_str):
def feed(self, byte_str: bytes) -> int:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This int is really a ProbingState.

@jdufresne
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review and great feedback! I believe I have applied all suggestions and this is ready for another round.

Copy link
Copy Markdown
Member

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting so much time into this project! This is mostly looking good. Just a few minor complaints at this point.

Also, you probably want to rebase from master, because I added a new prober file.

self.probers = []
self._best_guess_prober = None
self.probers: List[CharSetProber] = []
self.active: Dict[CharSetProber, bool] = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this dictionary when the probers already all have an active property?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One was not defined in charsetprober.py, so mypy reported an error. I reverted this change and added the missing property to the parent class.

filtered = bytearray()
in_tag = False
prev = 0
buf = memoryview(buf).cast("c")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this was before was added specifically as a performance improvement in #252. Calling ord() every time will slow this down a little.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. This was done to workaround typeshed bug python/typeshed#8182

I'll revert the change and then ignore the false positive.

chardet/enums.py Outdated


class ProbingState:
class ProbingState(Flag):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the other ones in this file except for LanguageFilter) should inherit from Enum, because they're not meaningfully combinable by bitwise operations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I changed this to an Enum in the latest revision.

This helps consumers of chardet ensure that they are using the provided
API correctly. The project includes a py.typed file for PEP-561
compliance.

This also helps ensure internal correctness and consistency. Adding the
static type checking caught at least one suspect pattern in
chardet/hebrewprober.py where an int value was compared to
string (probably leftover from Python 2 support).

Type checking will run on all pull requests through GitHub actions and
pre-commit.
@dan-blanchard dan-blanchard merged commit c4f7057 into chardet:master Jun 29, 2022
@jdufresne jdufresne deleted the mypy branch July 8, 2022 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants