Skip to content

Add CSV formatted output in transcript, using integer start/end times in milliseconds.#228

Merged
jongwook merged 11 commits intoopenai:mainfrom
NielsMayer:main
Jan 22, 2023
Merged

Add CSV formatted output in transcript, using integer start/end times in milliseconds.#228
jongwook merged 11 commits intoopenai:mainfrom
NielsMayer:main

Conversation

@NielsMayer
Copy link
Copy Markdown
Contributor

@NielsMayer NielsMayer commented Oct 2, 2022

This PR adds CSV output to Whisper transcription similar to the way #102 added SRT subtitle formatted output.

Each line of the resulting CSV file is formatted like:
<startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, "<transcript-including-commas>"

One of the reasons for using integer millisecond timings is to avoid regional incompatibilities with writing and reading floating point timings across language regions which use different characters - either "." or "," - as the decimal separator (c.f. #197). The CSV format with integer millisecond timings also allows for more efficient parsing and storage of Whisper results when read into other applications, in other languages like C++.

…ormatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>
column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".
@Yoyoma22
Copy link
Copy Markdown

Yoyoma22 commented Oct 27, 2022

The CSV seems to just multiply the range by 1000 to get ms resolution. Could we please have actual ms resolution for the speakers? If doing text-to-speech of a natural conversation, and trying to combine it with speaker recognition (pytorch for example), two persons may speak in the same second. We don't know who said what. Therefore, if we had ms resolution out of whisper, we could easily know who said what sentence in a natural conversation.

@ghost
Copy link
Copy Markdown

ghost commented Oct 29, 2022

To use Whisper for subtitles, Millisecond resolution would be a big plus.

@NielsMayer
Copy link
Copy Markdown
Contributor Author

@ksn-systems that is precisely why I made this change. See #233 for details.

@NielsMayer
Copy link
Copy Markdown
Contributor Author

@jongwook would it be possible to "approve the workflow" as I see this PR is stuck at "1 workflow awaiting approval".

Is there anything I should change or clarify in order to get this PR merged or get the "workflow awaiting approval" to be satisfied?

Please note a similar PR was merged for whisper.cpp ( ggml-org/whisper.cpp#340 ). Having this functionality in place for whisper allows for easier comparison of results between implementations, via easier importing into spreadsheets and databases supporting the CSV format.

@jongwook
Copy link
Copy Markdown
Collaborator

jongwook commented Jan 21, 2023

@NielsMayer Thanks for the PR! Would it work for you if I make this write_tsv instead? CSV format is not standardized and csv.reader and pandas.read_csv often create headache parsing quotes.

I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly.

@NielsMayer
Copy link
Copy Markdown
Contributor Author

@NielsMayer Thanks for the PR! Would it work for you if I make this write_tsv instead? CSV format is not standardized and csv.reader and pandas.read_csv often create headache parsing quotes.

I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly.

Yes, merging #333 makes sense. If you do that first, I will update this PR to conform to the changes made, e.g. add additional csv (or tsv) output format keyword.

W/r/t changing from CSV to TSV, that is fine by me as it would require a trivial change on my end. The reason why I chose CSV is that it seems more "standard"; although most of the programs automatically importing from CSV just as easily handle TSV.

I'm not familiar with the comma issues you mention in csv.reader or pandas.read_csv, however, do note that I updated the code for compatibility with importing into, e.g. openoffice, where string type is automatically recognized if delimited by '"' character. To prevent issues, Internal " in each CSV text line is replaced by two consecutive single quotes '' ... Note: print('"' + segment['text'].strip().replace('"', "''") + '"'

Chances are, such formatting and lack of special escape character means the existing solution would work with the readers you mention, @jongwook .

PS: I updated my repo for this PR to the latest head of repository from whisper, so once again, there is a "workflow awaiting approval" message...

@NielsMayer
Copy link
Copy Markdown
Contributor Author

FYI here's a whisper CSV file read into libreoffice on a linux desktop. Note the " replaced by ''
(orig source: https://rumble.com/v2619vq-bills-proliferate-to-criminalize-speech-darren-beattie-on-lex-and-brazil-fa.html )

3171020 | 3172820 | and he is up right now.
3172820 | 3176820 | [''System Updates,'' main theme music playing.]
3182620 | 3183760 | Great to be here.

@jongwook how on earth did whisper figure out that [''System Updates,'' main theme music playing.] -- I mean how did it figure out the name of the show ("remembered" the announcement of the show name at the beginning, sometimes an hour earlier?)

Sometimes however it gets the theme music wrong on the same show, but different episode: ( https://rumble.com/v24mywg-what-really-happened-in-brazil-yesterday-system-update-18.html )
`"[''The Daily Show Theme'']"

Likewise how is whisper figuring out where quotes start and end? It's kind of spooky actually! :-)

@jongwook jongwook merged commit f5bfe00 into openai:main Jan 22, 2023
@jongwook
Copy link
Copy Markdown
Collaborator

Thanks for accommodating the TSV suggestion! I merged a refactored version of #333 and edited this PR accordingly.

The issue about CSV is that, although CSV is more widely used and well known, even the simplest possible case like:

start, end, text
1234, 12345, "hello, world!"

results in very inconsistent user experience according to the program because of the lack of standardization around the quotation marks:

Apple Quick Look:
image

Apple Numbers:
image

csv.reader():
image

pandas.read_csv()
image

The latter two are the most common way to read CSV files in Python -- there are some combination of options to read the file as intended, but it's inconvenient and not practical to expect the users to use the "correct" configuration for all reader implementations.

Meanwhile, TSV doesn't need to deal with quoting because the field values are not allowed to contain tab characters.


Re: the second comment, because of the way that Whisper was trained, the model must have encountered the exact music and the text ["System Updates" main theme music playing.] multiple times during training. It's usually an undesired behavior, and we tried to mitigate this (rather hackily) by suppressing the [ character by default.

zackees pushed a commit to zackees/whisper that referenced this pull request May 5, 2023
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
ilanit1997 pushed a commit to ilanit1997/whisper that referenced this pull request May 16, 2023
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
heejipark23 pushed a commit to heejipark23/whisper that referenced this pull request Sep 21, 2025
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <jongwook@openai.com>
Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants