Skip to content

Adding an API for decode streaming.#1677

Merged
Narsil merged 9 commits intomainfrom
decode_stream
Nov 15, 2024
Merged

Adding an API for decode streaming.#1677
Narsil merged 9 commits intomainfrom
decode_stream

Conversation

@Narsil
Copy link
Copy Markdown
Contributor

@Narsil Narsil commented Nov 6, 2024

Fixes #1666

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, missing a Python API + example of how to use in python!

Comment on lines +910 to +912
pub fn decode_stream(&self, skip_special_tokens: bool) -> DecodeStream<'_, M, N, PT, PP, D> {
DecodeStream::new(self, skip_special_tokens)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather we explicitly create this DecodeStream with DecodeStream::new(tokenizer, ...)
without adding this to the tokenizer funcs!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you wish, this follows the .iter() pattern in regular rust as it's more convient given the lifetime bound of the DecodeStream object.

https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#1204

It's really just sugard, I can happily remove.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. No sounds good, was more thinking about the coming python api as well but in rust makes sense for sure

self.prefix_index = new_prefix_index;
Ok(Some(new_text.to_string()))
} else {
Ok(None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning '�' might be more expected (at least it's not None so people can print it still?) or ''

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's wrong to do so. Invalid utf-8 is perfectly normal and should not be returned before enough token are accumulated (see the accent example) to provide valid utf-8. If valid utf-8 follows invalid utf-8 then both will be returned at the same time

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Producing "" is also wrong, since the token really didn't produce anything, not the empty string.

}
let new_text = &string[self.prefix.len()..].to_string();
let new_prefix_index = self.ids.len() - self.prefix_index;
self.ids = self.ids.drain(self.read_index..).collect();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, the state is bound to be quite small with this!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we have in TGI, the overhead is indeed quite low. You're decoding twice as much (prefix + new text) and you have only a handful of extra tokens.

@Narsil
Copy link
Copy Markdown
Contributor Author

Narsil commented Nov 7, 2024

Very nice, missing a Python API + example of how to use in python!

If you're OK with that, I'd keep 2 separate PRs to keep this PR with logic small enough.

@Narsil Narsil mentioned this pull request Nov 7, 2024
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Narsil Narsil merged commit 500db28 into main Nov 15, 2024
@Narsil Narsil deleted the decode_stream branch November 15, 2024 05:02
let string = self
.tokenizer
.decode(self.ids.as_slice(), self.skip_special_tokens)?;
if string.len() > self.prefix.len() && !string.ends_with('�') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker @Narsil

I ran into streamed decoding issues as well and had the same solution in mind. However, I came to the conclusion that this solution has its own flaw: If you want to actually decode the character because it is part of the completion, this code would assume it's an incomplete UTF-8 marker and not yield anything.

The only clean solution is to offer a Decoder::decode_u8 method. I could help out here, if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incremental Detokenization

3 participants