Decode regression

This is not a recent regression, and perhaps it won't be fixed for that reason, but I thought I'd file it anyway.

I maintain Go bindings for this library, and by sheer luck I had benchmarks when I started. At some point I've noticed a [regression](https://github.com/daulet/tokenizers/issues/14) in decoding, but only now got around to investigating it. Long story short, I've bisected this repo and root caused it to [this PR](https://github.com/huggingface/tokenizers/pull/938). Below is a benchmark used to find it.

Regression details:
```
decode                  time:   [3.9277 µs 3.9409 µs 3.9558 µs]
                        change: [+241.37% +242.64% +244.06%] (p = 0.00 < 0.05)
                        Performance has regressed.
```

While decode is pretty fast (order of microseconds), +240% slowdown is fairly big and I wonder if we can gain back that performance.

Benchmark code (`tokenizers/benches/decode_benchmark.rs`):
```
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tokenizers::tokenizer::Tokenizer;

fn decode(tokenizer:&Tokenizer, ids_slice: Vec<u32>, skip_special_tokens: bool) -> String {
    tokenizer.decode(ids_slice, skip_special_tokens).expect("failed to decode input")
}

fn criterion_benchmark(c: &mut Criterion) {
    let tokenizer = Tokenizer::from_file("./test/data/bert-base-uncased.json").expect("failed to create tokenizer");
    c.bench_function("decode", 
    |b| b.iter(
        || decode(&tokenizer, black_box([2829, 4419, 14523, 2058, 1996, 13971, 3899].to_vec()), black_box(true))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
```

Add this to `Cargo.toml` and run with `cargo bench decode`.
```
[[bench]]
name = "decode_benchmark"
harness = false
```

The tokenizer file is copied from [here](https://huggingface.co/google-bert/bert-base-uncased/raw/main/tokenizer.json).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode regression #1564

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decode regression #1564

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions