Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Very nice, missing a Python API + example of how to use in python!
| pub fn decode_stream(&self, skip_special_tokens: bool) -> DecodeStream<'_, M, N, PT, PP, D> { | ||
| DecodeStream::new(self, skip_special_tokens) | ||
| } |
There was a problem hiding this comment.
I think I'd rather we explicitly create this DecodeStream with DecodeStream::new(tokenizer, ...)
without adding this to the tokenizer funcs!
There was a problem hiding this comment.
As you wish, this follows the .iter() pattern in regular rust as it's more convient given the lifetime bound of the DecodeStream object.
https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#1204
It's really just sugard, I can happily remove.
There was a problem hiding this comment.
got it. No sounds good, was more thinking about the coming python api as well but in rust makes sense for sure
| self.prefix_index = new_prefix_index; | ||
| Ok(Some(new_text.to_string())) | ||
| } else { | ||
| Ok(None) |
There was a problem hiding this comment.
returning '�' might be more expected (at least it's not None so people can print it still?) or ''
There was a problem hiding this comment.
No it's wrong to do so. Invalid utf-8 is perfectly normal and should not be returned before enough token are accumulated (see the accent example) to provide valid utf-8. If valid utf-8 follows invalid utf-8 then both will be returned at the same time
There was a problem hiding this comment.
Producing "" is also wrong, since the token really didn't produce anything, not the empty string.
| } | ||
| let new_text = &string[self.prefix.len()..].to_string(); | ||
| let new_prefix_index = self.ids.len() - self.prefix_index; | ||
| self.ids = self.ids.drain(self.read_index..).collect(); |
There was a problem hiding this comment.
nice, the state is bound to be quite small with this!
There was a problem hiding this comment.
This is what we have in TGI, the overhead is indeed quite low. You're decoding twice as much (prefix + new text) and you have only a handful of extra tokens.
If you're OK with that, I'd keep 2 separate PRs to keep this PR with logic small enough. |
| let string = self | ||
| .tokenizer | ||
| .decode(self.ids.as_slice(), self.skip_special_tokens)?; | ||
| if string.len() > self.prefix.len() && !string.ends_with('�') { |
There was a problem hiding this comment.
I ran into streamed decoding issues as well and had the same solution in mind. However, I came to the conclusion that this solution has its own flaw: If you want to actually decode the � character because it is part of the completion, this code would assume it's an incomplete UTF-8 marker and not yield anything.
The only clean solution is to offer a Decoder::decode_u8 method. I could help out here, if desired.
Fixes #1666