A Python tool for extracting transcripts from video and audio files.
- Extracts text and transcripts from video and audio sources.
- Downloads and converts transcripts from SharePoint/Stream links.
- Automatically creates a structured directory for processing media and storing results.
- Includes utilities for data processing.
- Supports web scraping for gathering transcripts from online sources.
- Can be built as a standalone executable using the provided Makefile.
- On first run, the bundled executable automatically generates all necessary folders.
main.py: Main entry point for running extraction scripts.DataProcessing/: Scripts and tools for processing extracted data.WebScraper/: Tools for scraping transcripts from the web.Utility/: Additional helper scripts.requirement.txt: List of Python dependencies.Makefile: Automates build, run, clean, install, and package operations.
- Video:
.mp4,.mov,.3gp,.avi,.mkv - Audio:
.mp3,.wav,.m4a
The system supports four distinct pipelines for transcript generation.
On the first run, the bundled executable automatically creates all necessary folders so you can drop your files into the right place.
flowchart LR
VR1[1.2-RawVIDEO]
AR1[1.1-RawAUDIO]
VR2[1.2-RawVIDEO]
AR2[1.1-RawAUDIO]
SP[1.3-RawLINK]
HTML_IN[3.1-HTML]
VR1 --> AR1
AR2 --> VR2
VR2 --> SV[2.2-SplittedVIDEO]
AR1 --> SA[2.1-SplittedAUDIO]
SA --> HTML[3.1-HTML]
SV --> HTML
SP --> JSON[3.1-JSON]
HTML --> Transcript[4-Transcript]
JSON --> Transcript
HTML_IN --> Transcript
This is the standard workflow when starting from audio or video files:
-
Video Conversion: Place raw video files in
1.2-RawVIDEO/.
→ Audio is automatically extracted into1.1-RawAUDIO/. -
Audio Input: You can also place audio files directly in
1.1-RawAUDIO/; they will be processed similarly to the converted ones. -
Audio Splitting: Files in
1.1-RawAUDIO/are split into segments and stored in2.1-SplittedAUDIO/.
Each group of splitted files is accompanied by a metadata file that contains information like the language of the audio. -
Audio Enhancement (optional): Enhanced audio (noise reduction, etc.) is stored in
EXTRA-EnhancedAUDIO/. -
Web Scraping & Conversion: Splitted audio files from
2.1-SplittedAUDIO/are processed into HTML outputs in3-HTML/. -
Transcript Generation: HTML files are converted into transcripts and saved in
4-Transcript/.
This alternate workflow starts from audio and goes through a video stage:
-
Audio Input: Place audio files in
1.1-RawAUDIO/. -
Audio-to-Video Conversion: Audio files are converted into video format and stored in
1.2-RawVIDEO/. -
Video Splitting: Files in
1.2-RawVIDEO/are split into segments and stored in2.2-SplittedVIDEO/.
Each group of splitted files is accompanied by a metadata file that contains information like the language of the video content. -
Web Scraping & Conversion: Splitted video files from
2.2-SplittedVIDEO/are processed into HTML outputs in3-HTML/. -
Transcript Generation: HTML files are converted into transcripts and saved in
4-Transcript/.
This pipeline downloads and converts transcripts directly from SharePoint/Stream links:
-
Link Input: Prepare a JSON file with SharePoint/Stream links in
1.3-RawLINK/.
The JSON can be in one of these formats:{ "Video Name 1": "https://...", "Video Name 2": "https://..." }or
[ {"name": "Video Name 1", "url": "https://..."}, {"name": "Video Name 2", "url": "https://..."} ]or
[["Video Name 1", "https://..."], ["Video Name 2", "https://..."]]
-
SharePoint Transcript Download: The system opens each link in a browser, extracts the transcript using Selenium, and saves the structured data as JSON in
3.1-JSON/. -
Login Handling: If you're not already logged in, the system will wait for you to log in to SharePoint/Stream (default timeout: 600 seconds).
-
Transcript Conversion: JSON transcript files are converted into concatenated markdown transcripts and saved in
4-Transcript/.
This lightweight pipeline skips all media processing and goes directly from HTML files to transcripts:
-
HTML Input: Place pre-processed HTML files in
3.1-HTML/.
(These might come from a previous run, external sources, or manual uploads.) -
Transcript Generation: HTML files are parsed and converted into transcripts, which are saved in
4-Transcript/.
This is useful when you already have HTML files and just need to extract and format the transcript text.
git clone https://github.com/VitoBarra/TextExtractorFromMedia.git
cd TextExtractorFromMediapip install -r requirements.txtThe included Makefile can create a standalone executable for your OS with the help of the makefile
| Target | Description | Notes |
|---|---|---|
all |
Default target; builds the executable. | Equivalent to make build. |
check |
Checks if PyInstaller is installed. | Exits with an error if PyInstaller is missing. |
build |
Builds the standalone executable. | Skips if the executable already exists. Adds --strip and --noconsole in release mode. |
run |
Runs the executable (or builds it first if missing). | Uses the OS-specific executable (MediaTranscriber.exe on Windows, MediaTranscriber on Linux/macOS). |
clean |
Removes only the executable. | Does not delete build folders or spec files. |
install |
Copies the executable to /usr/local/bin. |
Requires sudo. |
package |
Creates a tar.gz archive containing the executable. | Uses the dist folder and names the archive as MediaTranscriber-<VERSION>.tar.gz. |