Get audio from text with python

Hello, let’s see how to make a script to get audio from text using Python.

Once you run the script below, first click on top left then o bottom right.

This script is a Python program for capturing a specific region of the screen, performing optical character recognition (OCR) on the captured image to extract text, and then converting that text into audio. Here’s a breakdown of the key components and functionalities:

Imports:

win32clipboard: A module for interacting with the Windows clipboard.

pyscreenshot: Used for taking screenshots.

os: Provides a way to interact with the operating system.

pynput.mouse.Listener: Monitors mouse events.

sys: Provides access to some variables used or maintained by the interpreter.

tkinter: GUI toolkit for creating a simple graphical user interface.

gtts (Google Text-to-Speech): Converts text into spoken words.

time: Provides various time-related functions.

glob: Finds all the pathnames matching a specified pattern.

PIL: Python Imaging Library, used here for image processing.

pytesseract: Wrapper for Google’s Tesseract-OCR Engine.

Functions:

grab(x, y, w, h): Captures a region of the screen defined by the coordinates (x, y) of the top-left corner and (w, h) of the bottom-right corner. Saves the screenshot as ‘im.png’ and performs OCR on it.

save(im): Saves the provided image and opens it using the default image viewer.

ocr(image, mp3=0): Performs OCR on the specified image file. If mp3 is set to 1, it also creates an MP3 file from the extracted text using Google Text-to-Speech.

create_mp3(text, lang=”en”): Converts the provided text to an MP3 file using Google Text-to-Speech.

clip(): Retrieves text data from the clipboard and attempts to create an MP3 file from it.

on_click(x, y, button, pressed): Callback function for mouse clicks. Records the coordinates of the first and second clicks and calls the grab function.

start(): Initiates the process by destroying the GUI window and prompting the user to click on the top-left and bottom-right corners of the region to capture.

GUI (tkinter):

Creates a simple GUI with two buttons: “Grab to audio” and “Audio from clipboard.”

The “Grab to audio” button triggers the start function, which in turn initiates the mouse click listener for capturing a specific region of the screen.

The “Audio from clipboard” button triggers the clip function, which attempts to create an MP3 file from the text in the clipboard.

Additional Notes:

The script includes an image viewer using tkinter to display images found in the current directory (slides).

The script uses the pytesseract library for OCR, which relies on the Tesseract OCR engine. Make sure Tesseract is installed on the system.

Some parts of the code are commented out, and there’s a commented-out “Help” button in the GUI.

This script essentially provides a basic graphical interface for capturing a region of the screen, extracting text from it, and converting that text into an audio file.

# grabscreen.py
import win32clipboard
import pyscreenshot as ImageGrab
import os
from pynput.mouse import Listener
import sys
import tkinter as tk
from gtts import gTTS
import time
from glob import glob
from PIL import Image, ImageTk
'''
        Grab a text from an image
        grabbed clickin on the left top corner
        and right down corner of the part of the screen
        with the text.
        It returns it in the console
        Then... it transform it into audio.

'''

import pytesseract


def grab(x, y, w, h):
    im = ImageGrab.grab(bbox=(x, y, w, h))
    save(im)
    ocr("im.png", mp3=1)


def save(im):
    im.save('im.png')
    os.startfile('im.png')

trycount = 0
def ocr(image, mp3=0):
    global trycount


    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
    text = pytesseract.image_to_string(image)
    print(text)
    if mp3 == 1:
        try:
            create_mp3(text)
        except:
            trycount += 1
            if trycount < 3:
                ocr()
            else:
                print("Some problems with connection maybe")
                trycount2 = 0

def create_mp3(text, lang="en"):
    s = gTTS(text, lang=lang)
    print("Wait a second...")
    time.sleep(3)
    s.save(f"text.mp3")
    os.system("text.mp3")

trycount2 = 0
def clip():
    global trycount2


    win32clipboard.OpenClipboard()
    data = win32clipboard.GetClipboardData()
    win32clipboard.CloseClipboard()
    try:
        create_mp3(data)
    except:
        trycount2 += 1
        if trycount2 < 3:
            ocr()
    else:
        print("Some problems with connection maybe")
        trycount2 = 0


click1 = 0
x1 = 0
y1 = 0
def on_click(x, y, button, pressed):
    global click1, x1, y1, listener
    
    if pressed:
        if click1 == 0:
            x1 = x
            y1 = y
            click1 = 1
        else:
            grab(x1, y1, x, y)
            listener.stop()
            sys.exit()
def start():
    global listener

    root.destroy()
    print("Click once on top left and once on bottom right")
    # with Listener(on_move=on_move, on_click=on_click, on_scroll=on_scroll) as listener:
    with Listener(on_click=on_click) as listener:
        listener.join()
        # listener.stop()
        # sys.exit()

root = tk.Tk()
root.title("GRAUTESC 2 - Text to Audio APP")
root.geometry("600x500")
but = tk.Button(root, text="Grab to audio", command=start, width=20, height=3, bg="gold")
but.pack()
butclip = tk.Button(root, text="Audio from clipboard", command=clip, width=20,height=3, bg="gold")
butclip.pack()

# # HELP
# buthelp = tk.Button(root, text="Help", command=clip, width=20, height=3, bg="gold")
# buthelp.pack()

counter = 0
def lab_print(event):
    ocr(slides[0], mp3=0)

    # global counter, slides, label

    # counter += 1
    # print(counter)
    # if counter < len(slides) - 1:
    #     img = tk.PhotoImage(file=slides[counter])
    #     label["image"] = img
    #     label.image = img
    #     label.pack()
    # else:
    #     counter = 0


# SLIDES
slides = [x for x in glob("*.png")]
image = Image.open(slides[0])
print(slides[0])
image = image.resize((200, 400), Image.ANTIALIAS)
img = ImageTk.PhotoImage(image=image)
label = tk.Label(root, image=img)
label.pack()
label.bind("<Button-1>", lab_print)

root.mainloop()