Use tiktoken by jongwook · Pull Request #1044 · openai/whisper

jongwook · 2023-03-07T10:25:10Z

Using tiktoken to replace HuggingFace Tokenizers allows faster tokenization and removing tensorflow as a transitive dependency.

A downside is that tiktoken does not yet provide aarch64 linux wheels while tokenizers is built even for ppc64le and s390x. So it may be a blocker for some users..

whisper/tokenizer.py

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

petterreinholdtsen · 2023-04-18T09:32:46Z

I am aware of a university running whisper on powerpc, at least, so their upgrade path will be blocked until tiktoken supports more architectures.

* use tiktoken==0.3.0 * formatting * tuple should be safer * Update whisper/tokenizer.py Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com> * use tiktoken 0.3.1 * reflecting suggestions * cleanup * bypassing load_tiktoken_bpe to avoid blobfile dep --------- Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

wsysuper · 2023-07-02T04:27:18Z

whisper/tokenizer.py

+    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
+    ranks = {
+        base64.b64decode(token): int(rank)
+        for token, rank in (line.split() for line in open(vocab_path) if line)


Here, you opened the file and left a unclosed handler.

* use tiktoken==0.3.0 * formatting * tuple should be safer * Update whisper/tokenizer.py Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com> * use tiktoken 0.3.1 * reflecting suggestions * cleanup * bypassing load_tiktoken_bpe to avoid blobfile dep --------- Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

jongwook added 3 commits March 7, 2023 02:24

use tiktoken==0.3.0

5e35893

formatting

39237a3

tuple should be safer

67e8805

Majdoddin reviewed Mar 9, 2023

View reviewed changes

whisper/tokenizer.py Outdated Show resolved Hide resolved

Majdoddin reviewed Mar 10, 2023

View reviewed changes

whisper/tokenizer.py Outdated Show resolved Hide resolved

jongwook and others added 6 commits March 13, 2023 01:18

Update whisper/tokenizer.py

117ed3e

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

Merge branch 'main' into use-tiktoken

06e59be

use tiktoken 0.3.1

2a14e80

reflecting suggestions

6869cd8

cleanup

72e5e67

bypassing load_tiktoken_bpe to avoid blobfile dep

a0bd014

jongwook merged commit 839639a into main Mar 13, 2023

jongwook deleted the use-tiktoken branch March 14, 2023 19:37

debloper mentioned this pull request Mar 20, 2023

docs(readme): remove instructions for installing huggingface tokenizer #1123

Closed

jumon mentioned this pull request Apr 1, 2023

Tokenizer object has no attribute 'tokenizer' jumon/whisper-punctuator#7

Closed

ivan-gorin mentioned this pull request Apr 6, 2023

conversion script pt to ggml not working ggml-org/whisper.cpp#724

Closed

llimllib mentioned this pull request Apr 10, 2023

the ggml conversion script is broken ggml-org/whisper.cpp#741

Closed

wsysuper reviewed Jul 2, 2023

View reviewed changes

kyakuno mentioned this pull request Dec 28, 2023

Update whisper decoding algorithm ailia-ai/ailia-models#1355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tiktoken#1044

Use tiktoken#1044
jongwook merged 9 commits intomainfrom
use-tiktoken

jongwook commented Mar 7, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

petterreinholdtsen commented Apr 18, 2023

Uh oh!

wsysuper Jul 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jongwook commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petterreinholdtsen commented Apr 18, 2023

Uh oh!

wsysuper Jul 2, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jongwook commented Mar 7, 2023 •

edited

Loading