PodTextify

The internet is large, but not unlimited. To train the next generation of AI, we need to move beyond basic text scraping. To this, we propose PodTextify, the first and only dataset of podcasts.

Installation:

clone the repo
install whisper.cpp with OpenVino
install the requirements

pip3 -r requirements.txt

Usage:

main.py --name='(name of podcast, or 'random' for random)' --number=(number of episodes to download) "

Description

PodTextify is the worlds first podcast-to-database program. The world is running out of high accuracy, human-written text to train LLM models on. With PodTextify, you can pass in a podcast name and download an arbitrary amount of episodes from a scraped iTunes listing directory. From these downloads, the program converts them to text files, cleans and grammar checks them, and outputs them into a model dataset parquet file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PodTextify

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PodTextify

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages