Retraining and storing data

Currently we have a ton of encoding-specific data stored as constants all over the place.  This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.

The problems I see with the current approach are:
1.  Storing large amounts of data in code makes it much more difficult to read and separate out which files are just data from those that contain actual encoding/prober-specific code.
2.  Retraining the models we have (which are currently based on data from the late 90s) is difficult, because we would have to write a script that generates Python code. Yuck.
3.  It makes the barrier to entry for adding support for new encodings higher than it should be.  We should be able to have a tool that takes a bunch of text of a given encoding and generates the tables we need and determines things like the typical "positive ratio" (which is really the ratio of the token frequency of the 512 most common character bigram types to the total number of bigram tokens in a "typical" corpus) automatically.  The current layout of the code is very confusing to a new contributor (see point 1).
4.  Because retraining is difficult, chardet is going to get less accurate over time.  Speaking as an NLP researcher, I can confidently say that the genre of a text plays a big role in how likely certain character sequences are, and as time goes on the typical web text we see looks less and less like it did when Mozilla collected their original data.  Also, our accuracy for text that isn't from webpages is probably not that great.

So if we're agreement that the current approach is bad, how do we want to fix it?

I propose that we:
1.  Store the data in either JSON or YAML formats in the GitHub repository.  This would potentially allow us to share our data with chardet ports written in other languages (if they wanted to support our format).
2.  As part of the `setup.py install` process, convert the files to pickled dictionaries.
3.  Modify the prober initializers to take a path to either a pickled dictionary or a JSON/YAML file and load up that data at run-time.  Supporting both types of file would simplify development, since we could play around with models without having to constantly convert them to pickles.
4.  Modify `chardet.detect` to cache its `UniversalDetector` object so that we don't constantly create new prober objects and reload the pickles.

The only problem I see with this approach is that it will slow down `import chardet`, but loading pickles is usually pretty fast.

@sigmavirus24, what do you think?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retraining and storing data #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Retraining and storing data #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions