add cchardet support and use chardet as fallback#4115
add cchardet support and use chardet as fallback#41152mf wants to merge 7 commits intopsf:masterfrom 2mf:master
Conversation
Lukasa
left a comment
There was a problem hiding this comment.
Yeah, we need to try to deduplicate this logic a lot. Do you mind doing that?
|
sure :-) |
requests/packages.py
Outdated
| if isinstance(_package, tuple): | ||
| package = _package[1] | ||
| try: | ||
| locals()[package] = __import__(package[0]) |
There was a problem hiding this comment.
Doesn't look like this line is correct, should be __import__(_package[0])? Might want to unpack the tuple right away and then use proper variable names so the logic is easier to follow.
Lukasa
left a comment
There was a problem hiding this comment.
There's an issue here with the warning that we're raising.
requests/__init__.py
Outdated
| assert minor >= 1 | ||
| assert patch >= 0 | ||
| except AssertionError: | ||
| warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0! Falling back to chardet') |
There was a problem hiding this comment.
This isn't specifying the upper pin, which is < 3.0.0.
nateprewitt
left a comment
There was a problem hiding this comment.
Hey @2mf, thanks for putting this together! While I'm not opposed to supporting cchardet, I do have some questions/comments on the current implementation.
tox.ini
Outdated
| [testenv] | ||
|
|
||
| deps = | ||
| py{26,27,33,34,35,36}-chardet: chardet |
There was a problem hiding this comment.
I'm not super fond of doubling the number of test instances we're running.
There was a problem hiding this comment.
I'm not quite sure, how I can test against ccharset and charset using existing cases
requests/__init__.py
Outdated
|
|
||
| import warnings | ||
|
|
||
| def get_version(package): |
There was a problem hiding this comment.
If we are going to abstract this out to a function, maybe we add it to the urllib3 import to reduce the duplicated code to a single place?
requests/__init__.py
Outdated
| warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0! Falling back to chardet') | ||
| raise | ||
|
|
||
| except (ImportError, SyntaxError, AssertionError) as e: |
There was a problem hiding this comment.
What's the scenario that raises a SyntaxError? Also if we're not using e for anything, we probably don't need to set it.
There was a problem hiding this comment.
I copied that from other places in requests, but from my view point it is not needed
| locals()[package] = __import__(package) | ||
| for package in ('urllib3', 'idna', ('chardet', 'cchardet')): | ||
| if isinstance(package, tuple): | ||
| package, alt_package = package |
There was a problem hiding this comment.
we need to be extremely careful about touching this code, at all.
There was a problem hiding this comment.
I agree. This looks safe, but it'd be so much better if this were testable.
|
This whole thing scares me. |
|
I for one am not wedded to doing it: I'm nervous about the amount of code it adds as well. |
|
@2mf do you have any metrics on how much faster cchardet is than chardet? e.g. is it worth the trouble? |
|
@kennethreitz a benchmark can be seen at https://pypi.python.org/pypi/cchardet
|
|
@kennethreitz @Lukasa shouldn't be the version matching in the requirements of setup.py? and |
|
That is significantly faster! |
remove unnecessary SyntaxError pin cchardet to >=2.1.0 <2.2.0
This reverts commit b231590.
No, @2mf it shouldn't. Requests needs very specific versions of dependencies. Some of those dependencies are popular and we need a way to tell the user "You're using a version that will horribly break your version of Requests" |
| assert minor < 2 | ||
| assert patch >= 0 | ||
| except AssertionError: | ||
| warnings.warn('Requests dependency \'cchardet\' must be version >= 2.1.0, < 2.2.0! Falling back to chardet') |
There was a problem hiding this comment.
This is not a dependency of Requests, it's an optional speed-up. Please revise
| # cchardet >= 2.1.0 | ||
| try: | ||
| assert major == 2 | ||
| assert minor >= 1 |
There was a problem hiding this comment.
Why not:
assert 1 <= minor < 2| except (ImportError, AssertionError): | ||
| # Check chardet for compatibility. | ||
| import chardet | ||
| major, minor, patch = get_version(chardet) |
There was a problem hiding this comment.
chardet shouldn't be imported in this except block. This will cause extremely confusing tracebacks for users if they can't import either.
| cmdclass={'test': PyTest}, | ||
| tests_require=test_requirements, | ||
| extras_require={ | ||
| 'cchardet:(python_version == "2.7" or python_version >= "3.4")': ['cchardet>=2.1.0'], |
There was a problem hiding this comment.
If we're enforcing (in __init__.py) that cchardet be between 2.1.0 and 2.2.0 then that should be written here.
| locals()[package] = __import__(package) | ||
| for package in ('urllib3', 'idna', ('chardet', 'cchardet')): | ||
| if isinstance(package, tuple): | ||
| package, alt_package = package |
There was a problem hiding this comment.
I agree. This looks safe, but it'd be so much better if this were testable.
And harder to install and the benchmark, as I understand it, is based off a very old version of chardet and hasn't been updated since. |
|
It sounds like we're against this change overall. Sorry @2mf. |
|
I'm on board with doing this, but the current methodology isn't 100% correct and I'm merely pointing out that those benchmarks are older and potentially misleading. @dan-blanchard has done a lot of fantastic work on chardet recently |
|
I feel the need to point out that chardet is in the midst of gaining support for more codecs and languages than cchardet supports. I'm retraining all the models for greater accuracy too. Speed isn't everything with encoding detection. |
|
Just for the sake of a recent benchmark
anyway, It is not worth to continue on this one while running against so many walls |
|
I hadn't actually looked at the benchmark code cChardet uses before, but it just runs the same small SJIS text file through both modules 100 times. I know cChardet will pretty much always be faster than chardet—in fact, if you look at old chardet issues, you'll even see that I once toyed with the idea of trying to merge the projects—but that is not an accurate benchmark, especially for something like requests where a good amount of the time is spent downloading the page content anyway. A requests-based benchmark would make sense for deciding if it makes a difference here. |
I created this pull-request for #4112