Improve Model serialization/deserialization by n1t0 · Pull Request #620 · huggingface/tokenizers

n1t0 · 2021-02-04T14:08:46Z

As we manually implement Serialize and Deserialize for the various models, we didn't include the #[serde(tag = "type")] we use everywhere else, so when deserializing we can only know what Model it is based on the various fields we see.
This used to work fine as long as these models were different enough, but it is not the case anymore with WordPiece and WordLevel that can be deserialized from the same serialized json.

This PR fixes this by adding the type in the serialization process, and using it if it is defined. This is also backward compatible because we don't make it mandatory, but we add a layer of verification based on the presence of the fields (mainly for WordPiece and WordLevel).

Improve Model serialization/deserialization

7a83428

n1t0 force-pushed the fix-model-serde branch from 6102814 to 7a83428 Compare February 4, 2021 14:26

n1t0 merged commit a8f7564 into master Feb 4, 2021

n1t0 deleted the fix-model-serde branch February 4, 2021 14:59

n1t0 mentioned this pull request Feb 8, 2021

Prepare for python v0.10.1 #625

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Model serialization/deserialization#620

Improve Model serialization/deserialization#620
n1t0 merged 1 commit intomasterfrom
fix-model-serde

n1t0 commented Feb 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

n1t0 commented Feb 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant