-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
Description
This is not a recent regression, and perhaps it won't be fixed for that reason, but I thought I'd file it anyway.
I maintain Go bindings for this library, and by sheer luck I had benchmarks when I started. At some point I've noticed a regression in decoding, but only now got around to investigating it. Long story short, I've bisected this repo and root caused it to this PR. Below is a benchmark used to find it.
Regression details:
decode time: [3.9277 µs 3.9409 µs 3.9558 µs]
change: [+241.37% +242.64% +244.06%] (p = 0.00 < 0.05)
Performance has regressed.
While decode is pretty fast (order of microseconds), +240% slowdown is fairly big and I wonder if we can gain back that performance.
Benchmark code (tokenizers/benches/decode_benchmark.rs):
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tokenizers::tokenizer::Tokenizer;
fn decode(tokenizer:&Tokenizer, ids_slice: Vec<u32>, skip_special_tokens: bool) -> String {
tokenizer.decode(ids_slice, skip_special_tokens).expect("failed to decode input")
}
fn criterion_benchmark(c: &mut Criterion) {
let tokenizer = Tokenizer::from_file("./test/data/bert-base-uncased.json").expect("failed to create tokenizer");
c.bench_function("decode",
|b| b.iter(
|| decode(&tokenizer, black_box([2829, 4419, 14523, 2058, 1996, 13971, 3899].to_vec()), black_box(true))));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
Add this to Cargo.toml and run with cargo bench decode.
[[bench]]
name = "decode_benchmark"
harness = false
The tokenizer file is copied from here.
Reactions are currently unavailable