Convert single-byte charset probers to use nested dicts for language models by dan-blanchard · Pull Request #121 · chardet/chardet

dan-blanchard · 2017-04-24T19:52:47Z

This isolates one of the major changes in #99, which is changing our single-byte charset prober language model format to use nested dicts instead of giant lists and offset math. This makes the code much easier to understand and language model access takes about 60% of the time it used to.

# Current
In [40]: %timeit [lang_model_tuple[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 168 µs per loop

# Same as current but language model is a list instead of a tuple
In [41]: %timeit [lang_model_list[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 170 µs per loop

# Single dictionary, but keys are tuples
In [42]: %timeit [lang_model_tup_dict[(i, j)] for i, j in random_index_tuples]
10000 loops, best of 3: 140 µs per loop

# Nested dictionary like in this PR
In [43]: %timeit [lang_model_nested_dict[i][j] for i, j in random_index_tuples]
10000 loops, best of 3: 99.4 µs per loop

The language model conversion script I've included in this PR does not need to stick around in master long term; I just wanted it here for review, since looking through the code that converts the language models and seeing if that looks right is much easier than visually comparing giant language model files.

I'm still seeing some test failures on this branch where Hungarian is being over-predicted, so this isn't quite ready to merge yet, but I figured putting it up here someone might notice something I missed.

…e modules - Also provide conversion script

sigmavirus24

So, the only performance benefit to using iteritems anywhere is if you have a dictionary with millions of item pairs. If that is where we are (GitHub won't show these diffs) then that's fine, otherwise, I'd rather we just use .items() everywhere. Either way, this looks great. 🎉 🍰:sparkles:

…models (#121) * Convert single byte charset modules to use dicts of dicts for language modules - Also provide conversion script * Fix debug logging check * Keep Hungarian commented out until we retrain

gimyjendirx · 2021-01-20T07:38:14Z

This isolates one of the major changes in #99, which is changing our single-byte charset prober language model format to use nested dicts instead of giant lists and offset math. This makes the code much easier to understand and language model access takes about 60% of the time it used to.
# Current
In [40]: %timeit [lang_model_tuple[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 168 µs per loop

# Same as current but language model is a list instead of a tuple
In [41]: %timeit [lang_model_list[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 170 µs per loop

# Single dictionary, but keys are tuples
In [42]: %timeit [lang_model_tup_dict[(i, j)] for i, j in random_index_tuples]
10000 loops, best of 3: 140 µs per loop

# Nested dictionary like in this PR
In [43]: %timeit [lang_model_nested_dict[i][j] for i, j in random_index_tuples]
10000 loops, best of 3: 99.4 µs per loop
The language model conversion script I've included in this PR does not need to stick around in master long term; I just wanted it here for review, since looking through the code that converts the language models and seeing if that looks right is much easier than visually comparing giant language model files.

I'm still seeing some test failures on this branch where Hungarian is being over-predicted, so this isn't quite ready to merge yet, but I figured putting it up here someone might notice something I missed.

Marcopolo

Convert single byte charset modules to use dicts of dicts for languag…

02066af

…e modules - Also provide conversion script

dan-blanchard requested a review from sigmavirus24 April 24, 2017 19:52

dan-blanchard changed the title ~~Convert single-byte charset probers to use dicts of dicts for language models~~ Convert single-byte charset probers to use nested dicts for language models Apr 24, 2017

Fix debug logging check

b3ef05e

sigmavirus24 approved these changes Apr 25, 2017

View reviewed changes

Keep Hungarian commented out until we retrain

a49fdf5

dan-blanchard merged commit 6aeaeb4 into master Apr 27, 2017

dan-blanchard deleted the feature/new_style_sbcs_models branch April 27, 2017 17:44

This was referenced Mar 8, 2021

Bump chardet from 3.0.4 to 4.0.0 thermondo/stanley#816

Closed

build(deps): bump chardet from 3.0.4 to 4.0.0 negillett/exodus-gw#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert single-byte charset probers to use nested dicts for language models#121

Convert single-byte charset probers to use nested dicts for language models#121
dan-blanchard merged 3 commits intomasterfrom
feature/new_style_sbcs_models

dan-blanchard commented Apr 24, 2017

Uh oh!

sigmavirus24 left a comment

Uh oh!

gimyjendirx commented Jan 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dan-blanchard commented Apr 24, 2017

Uh oh!

sigmavirus24 left a comment

Choose a reason for hiding this comment

Uh oh!

gimyjendirx commented Jan 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants