Adding an API for decode streaming. by Narsil · Pull Request #1677 · huggingface/tokenizers

Narsil · 2024-11-06T12:52:46Z

HuggingFaceDocBuilderDev · 2024-11-06T12:55:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Very nice, missing a Python API + example of how to use in python!

tokenizers/src/tokenizer/mod.rs

ArthurZucker · 2024-11-07T09:54:39Z

tokenizers/src/tokenizer/mod.rs

+    pub fn decode_stream(&self, skip_special_tokens: bool) -> DecodeStream<'_, M, N, PT, PP, D> {
+        DecodeStream::new(self, skip_special_tokens)
+    }


I think I'd rather we explicitly create this DecodeStream with DecodeStream::new(tokenizer, ...)
without adding this to the tokenizer funcs!

As you wish, this follows the .iter() pattern in regular rust as it's more convient given the lifetime bound of the DecodeStream object.

https://doc.rust-lang.org/src/alloc/collections/vec_deque/mod.rs.html#1204

It's really just sugard, I can happily remove.

got it. No sounds good, was more thinking about the coming python api as well but in rust makes sense for sure

ArthurZucker · 2024-11-07T09:56:15Z

tokenizers/src/tokenizer/mod.rs

+            self.prefix_index = new_prefix_index;
+            Ok(Some(new_text.to_string()))
+        } else {
+            Ok(None)


returning '�' might be more expected (at least it's not None so people can print it still?) or ''

No it's wrong to do so. Invalid utf-8 is perfectly normal and should not be returned before enough token are accumulated (see the accent example) to provide valid utf-8. If valid utf-8 follows invalid utf-8 then both will be returned at the same time

Producing "" is also wrong, since the token really didn't produce anything, not the empty string.

ArthurZucker · 2024-11-07T10:06:00Z

tokenizers/src/tokenizer/mod.rs

+            }
+            let new_text = &string[self.prefix.len()..].to_string();
+            let new_prefix_index = self.ids.len() - self.prefix_index;
+            self.ids = self.ids.drain(self.read_index..).collect();


nice, the state is bound to be quite small with this!

This is what we have in TGI, the overhead is indeed quite low. You're decoding twice as much (prefix + new text) and you have only a handful of extra tokens.

Narsil · 2024-11-07T12:57:02Z

Very nice, missing a Python API + example of how to use in python!

If you're OK with that, I'd keep 2 separate PRs to keep this PR with logic small enough.

ArthurZucker

Thanks!

ghost · 2024-11-18T15:11:55Z

tokenizers/src/tokenizer/mod.rs

+        let string = self
+            .tokenizer
+            .decode(self.ids.as_slice(), self.skip_special_tokens)?;
+        if string.len() > self.prefix.len() && !string.ends_with('�') {


@ArthurZucker @Narsil

I ran into streamed decoding issues as well and had the same solution in mind. However, I came to the conclusion that this solution has its own flaw: If you want to actually decode the � character because it is part of the completion, this code would assume it's an incomplete UTF-8 marker and not yield anything.

The only clean solution is to offer a Decoder::decode_u8 method. I could help out here, if desired.

Adding an API for decode streaming.

c97389b

Narsil added 4 commits November 6, 2024 20:55

Add another missing test case (proving the effect of state.)

c3578d4

Ellide lifetime.

bdcb2b9

Ellide bis.

af7d82e

Fixing the streaming implementation.

5a5406e

ArthurZucker reviewed Nov 7, 2024

View reviewed changes

Narsil added 4 commits November 7, 2024 21:34

Adding more docs.

18b999c

End of list.

c32a2c2

Fix internal link.

a326447

Skip doctest on Windows (no tokenizer file because no make)

218fd3b

Narsil mentioned this pull request Nov 7, 2024

Decode stream python #1678

Merged

ArthurZucker approved these changes Nov 14, 2024

View reviewed changes

Narsil merged commit 500db28 into main Nov 15, 2024

Narsil deleted the decode_stream branch November 15, 2024 05:02

ghost reviewed Nov 18, 2024

View reviewed changes

Conversation

Narsil commented Nov 6, 2024 • edited by ArthurZucker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 6, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil commented Nov 7, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Narsil commented Nov 6, 2024 •

edited by ArthurZucker

Loading