Add CSV formatted output in transcript, using integer start/end times in milliseconds.#228
Add CSV formatted output in transcript, using integer start/end times in milliseconds.#228jongwook merged 11 commits intoopenai:mainfrom
Conversation
…ormatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>
column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''".
|
The CSV seems to just multiply the range by 1000 to get ms resolution. Could we please have actual ms resolution for the speakers? If doing text-to-speech of a natural conversation, and trying to combine it with speaker recognition (pytorch for example), two persons may speak in the same second. We don't know who said what. Therefore, if we had ms resolution out of whisper, we could easily know who said what sentence in a natural conversation. |
|
To use Whisper for subtitles, Millisecond resolution would be a big plus. |
|
@ksn-systems that is precisely why I made this change. See #233 for details. |
|
@jongwook would it be possible to "approve the workflow" as I see this PR is stuck at "1 workflow awaiting approval". Is there anything I should change or clarify in order to get this PR merged or get the "workflow awaiting approval" to be satisfied? Please note a similar PR was merged for whisper.cpp ( ggml-org/whisper.cpp#340 ). Having this functionality in place for whisper allows for easier comparison of results between implementations, via easier importing into spreadsheets and databases supporting the CSV format. |
|
@NielsMayer Thanks for the PR! Would it work for you if I make this I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly. |
Yes, merging #333 makes sense. If you do that first, I will update this PR to conform to the changes made, e.g. add additional csv (or tsv) output format keyword. W/r/t changing from CSV to TSV, that is fine by me as it would require a trivial change on my end. The reason why I chose CSV is that it seems more "standard"; although most of the programs automatically importing from CSV just as easily handle TSV. I'm not familiar with the comma issues you mention in csv.reader or pandas.read_csv, however, do note that I updated the code for compatibility with importing into, e.g. openoffice, where string type is automatically recognized if delimited by '"' character. To prevent issues, Internal Chances are, such formatting and lack of special escape character means the existing solution would work with the readers you mention, @jongwook . PS: I updated my repo for this PR to the latest head of repository from whisper, so once again, there is a "workflow awaiting approval" message... |
|
FYI here's a whisper CSV file read into libreoffice on a linux desktop. Note the " replaced by '' @jongwook how on earth did whisper figure out that Sometimes however it gets the theme music wrong on the same show, but different episode: ( https://rumble.com/v24mywg-what-really-happened-in-brazil-yesterday-system-update-18.html ) Likewise how is whisper figuring out where quotes start and end? It's kind of spooky actually! :-) |
|
Thanks for accommodating the TSV suggestion! I merged a refactored version of #333 and edited this PR accordingly. The issue about CSV is that, although CSV is more widely used and well known, even the simplest possible case like: start, end, text
1234, 12345, "hello, world!"results in very inconsistent user experience according to the program because of the lack of standardization around the quotation marks: The latter two are the most common way to read CSV files in Python -- there are some combination of options to read the file as intended, but it's inconvenient and not practical to expect the users to use the "correct" configuration for all reader implementations. Meanwhile, TSV doesn't need to deal with quoting because the field values are not allowed to contain tab characters. Re: the second comment, because of the way that Whisper was trained, the model must have encountered the exact music and the text |
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <jongwook@openai.com> Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <jongwook@openai.com> Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <jongwook@openai.com> Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <jongwook@openai.com> Co-authored-by: Jong Wook Kim <jongwook@nyu.edu>




This PR adds CSV output to Whisper transcription similar to the way #102 added SRT subtitle formatted output.
Each line of the resulting CSV file is formatted like:
<startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, "<transcript-including-commas>"One of the reasons for using integer millisecond timings is to avoid regional incompatibilities with writing and reading floating point timings across language regions which use different characters - either "." or "," - as the decimal separator (c.f. #197). The CSV format with integer millisecond timings also allows for more efficient parsing and storage of Whisper results when read into other applications, in other languages like C++.