Changing Decoder trait to be more composable.#938
Conversation
McPatate
left a comment
There was a problem hiding this comment.
Few questions for comprehension.
Looks good overall !
tokenizers/src/decoders/wordpiece.rs
Outdated
| *token = format!(" {}", token); | ||
| } | ||
| if self.cleanup { | ||
| *token = cleanup(token.clone()); |
There was a problem hiding this comment.
Isn't it better API-wise to pass in a reference, &token and call a .to_owned() or .clone() inside the cleanup() method ?
Feels cleaner, the responsibility of the clone is imo cleanup's.
There was a problem hiding this comment.
I think it's almost always preferable to not clone on a user's behalf and let the user decide how best to pass owned instances (if you really require ownership).
It seems maybe this function doesn't actually require ownership, I'll change to reflect that.
There was a problem hiding this comment.
I see what you mean. W/out the clone, you will have to create a new instance of a String, which is somehow similar to cloning ^^
There was a problem hiding this comment.
Yes, but at least I am following the std::str::replace signature so.. :)
There was a problem hiding this comment.
If that is the rust way, so be it 😄
|
I'm sorry, I didn't have time to do a proper review. Nevertheless, I would be very interested to understand a bit better the changes that have been made 🙂 In particular, I see that the test target has changed (e.g. |
|
No not the final thing, basically the output of the tokenizer will be Only breaking changes are if you are using decoders directly. |
|
Thank you very much for your answer! It's really great! To put all the pieces together, am I right to understand that this last step is done here 😄 : tokenizers/tokenizers/src/tokenizer/mod.rs Line 781 in 2620531 |
|
Indeed ! You see Rust is not that hard ^_^ |
Fix #872