Token level timestamps for long-form generation in Whisper#29148
Token level timestamps for long-form generation in Whisper#29148gante merged 7 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
gante
left a comment
There was a problem hiding this comment.
In general it looks good to me, but I'd like to defer the approval to our Whisper expert @sanchit-gandhi :D
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Thanks @zucchini-nlp - sounds good to me aligning the word-level timestamps on a per-segment level (rather than on the full sequence-level). We'll probably need to propagate some changes forward to the ASR pipeline so that it's compatible with these generation related changes, but we can do this in a follow-up PR
| self.assertListEqual(tokens_shape, token_timestamps_shape) | ||
|
|
||
| # fmt: off | ||
| EXPECTED_OUTPUT = [ |
There was a problem hiding this comment.
Are these values taken from the original repo? Or inspected from Transformers Whisper and deemed to be correct?
There was a problem hiding this comment.
Inspected from Transformers Whisper and deemed to be correct. I am not sure how the tests for short-form timestamps were written, so I just took an audio and copied model outputs.
|
Also cc @ArthurZucker for Whisper-related timestamps |
ArthurZucker
left a comment
There was a problem hiding this comment.
Let's add a small end to end test with explicit outputs, will help us in the long run! Otherwise LGTM
|
|
||
| for segment, exp_segment in zip(generate_outputs["segments"][0], EXPECTED_OUTPUT): | ||
| self.assertTrue(torch.allclose(segment["token_timestamps"], exp_segment)) | ||
|
|
There was a problem hiding this comment.
Could we add a small test like test_return_timestamps_in_preprocess in /Users/arthurzucker/Work/transformers/tests/pipelines/test_pipelines_automatic_speech_recognition.py just to make sure we have something explicit like
'chunks': [
{'text': ' Conquered', 'timestamp': (0.5, 1.2)},
{'text': ' returned', 'timestamp': (1.2, 1.64)},
{'text': ' to', 'timestamp': (1.64, 1.84)},
{'text': ' its', 'timestamp': (1.84, 2.02)},
{'text': ' place', 'timestamp': (2.02, 2.28)},
{'text': ' amidst', 'timestamp': (2.28, 2.8)},
{'text': ' the', 'timestamp': (2.8, 2.98)},
{'text': ' tents.', 'timestamp': (2.98, 3.48)},
],
```!
Sorry, a mistag. Got a review from Arthur, so we merged it :) |
What does this PR do?
Continuation of PR #28984. Adds token level timestamps for long-form generation. The previous PR had a quite different of way to add timestamps, specifically by calling
extract_timestampsfor each segment and each batch separately. I believe, it can be done in one batch, and then divided into segments the same way sequences are divided.The final timestamps are already aligned with the total length, so there is not need to add start_time for each segment. Although, I am not sure if that is what we want to have, so I can remove this "total duration alignment" is needed.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sanchit-gandhi
@patrickvonplaten
@gante ?