Frontend repo that connects to this backend server: https://github.com/bradleyyang/hacktrent-app
This repository provides a complete system for translating American Sign Language (ASL) video frames into English text, with an optional text-to-speech output layer.
It includes a trained gesture recognition model, a FastAPI backend for inference, and a fully containerized runtime for easy deployment.
| Feature | Description |
|---|---|
| Custom-trained ASL recognition model | Converts ASL hand gestures to English tokens. |
| REST API (FastAPI) | /predict to turn ASL image to text using the MediaPipe model then text to speech using ElevenLabs, /get_transcription to turn speech to text using Gemini, /health for status checking. |
| Dockerized Deployment | Run locally or on cloud providers like Render, Railway, DigitalOcean, etc. |
| Optional Text-to-Speech | Uses ElevenLabs to convert recognized text into natural-sounding speech. |
ASL_to_text/
├─ ASL_to_English/ # FastAPI server + original MediaPipe model repo
| ├─ routes.py # API endpoints
│ ├─ api_server.py # FastAPI app
│ ├─ api_calls.py # methods to call ElevenLabs and Gemini API
│ ├─ signtalk.py # MediaPipe model, including functions for training at server initialization and to use model on an image input
│ ├─ annotations/ # Label data / TF records (if present)
│ ├─ my_ssd_mobnet/ # Model pipeline / checkpoints / label_map
│ ├─ test/ # Test images / clips
│ ├─ Signlangtranslator.ipynb # Training / experimentation notebook
│ └─ realtime_image_collection.ipynb # Webcam image collection notebook
├─ requirements.txt # original python dep for training model (not needed unless you want to retrain model yourself)
├─ Dockerfile # Production container
├─ docker-compose.yml # Local orchestration (optional)
├─ .gitignore
├─ .dockerignore
└─ README.md # (this file)
python3 -m venv .venv
source .venv/bin/activate
pip install -r ASL_to_English/requirements_api.txt
pip install -r ASL_to_English/requirements_signtalk.txtpython server.pycurl -X POST "http://localhost:8000/predict" \
-F "file=@path/to/your/image.jpg"The model fine-tuned is forked from this repo: https://github.com/priiyaanjaalii0611/ASL_to_English
However, the training process, API backend, system architecture, and deployment workflow in this repository are independently developed.
docker build -t asl-recognition-api .docker run -p 8000:8000 \
-e ELEVENLABS_API_KEY="your_key_here" \
-e GEMINI_API_KEY="your_key_here" \
asl-recognition-apiAPI will be available at:
If you want the recognized text to be spoken aloud:
export ELEVENLABS_API_KEY="your_key_here"The /predict endpoint automatically returns audio along with the prediction.
- Webcam real-time streaming translation
- Larger vocabulary training
- Model that take inputs of video clips rather than still frames