Skip to content

Retraining and storing data #48

@dan-blanchard

Description

@dan-blanchard

Currently we have a ton of encoding-specific data stored as constants all over the place. This was done to mirror the C code initially, but I propose that we start diverging from the C code in a more substantial way than we have in the past.

The problems I see with the current approach are:

  1. Storing large amounts of data in code makes it much more difficult to read and separate out which files are just data from those that contain actual encoding/prober-specific code.
  2. Retraining the models we have (which are currently based on data from the late 90s) is difficult, because we would have to write a script that generates Python code. Yuck.
  3. It makes the barrier to entry for adding support for new encodings higher than it should be. We should be able to have a tool that takes a bunch of text of a given encoding and generates the tables we need and determines things like the typical "positive ratio" (which is really the ratio of the token frequency of the 512 most common character bigram types to the total number of bigram tokens in a "typical" corpus) automatically. The current layout of the code is very confusing to a new contributor (see point 1).
  4. Because retraining is difficult, chardet is going to get less accurate over time. Speaking as an NLP researcher, I can confidently say that the genre of a text plays a big role in how likely certain character sequences are, and as time goes on the typical web text we see looks less and less like it did when Mozilla collected their original data. Also, our accuracy for text that isn't from webpages is probably not that great.

So if we're agreement that the current approach is bad, how do we want to fix it?

I propose that we:

  1. Store the data in either JSON or YAML formats in the GitHub repository. This would potentially allow us to share our data with chardet ports written in other languages (if they wanted to support our format).
  2. As part of the setup.py install process, convert the files to pickled dictionaries.
  3. Modify the prober initializers to take a path to either a pickled dictionary or a JSON/YAML file and load up that data at run-time. Supporting both types of file would simplify development, since we could play around with models without having to constantly convert them to pickles.
  4. Modify chardet.detect to cache its UniversalDetector object so that we don't constantly create new prober objects and reload the pickles.

The only problem I see with this approach is that it will slow down import chardet, but loading pickles is usually pretty fast.

@sigmavirus24, what do you think?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions