π¬ AI-powered video transcription platform combining audio transcription, AI lip reading, and speaker diarization for accessible, accurate video captions.
- π€ Audio Transcription: AWS Transcribe with speaker diarization
- π AI Lip Reading: LipCoordNet visual speech recognition via SageMaker
- π€ Face Detection: AWS Rekognition face tracking
- π― Speaker Fusion: Intelligent alignment of audio + visual + face data
- π Web Interface: Modern drag-&-drop upload + caption viewer
- β‘ GPU-Accelerated: Fast inference on SageMaker Serverless
- ποΈ Serverless Pipeline: AWS Lambda + Step Functions orchestration
- AWS Account with CLI configured
- Git, Terraform, Python 3.8+
- SSH client (for EC2 access)
git clone https://github.com/gordowuu/GORGGLES2.git
cd GORGGLES2# Package Lambda layers
cd scripts
.\package_lambda_layers.ps1 -Region us-east-1
# Apply Terraform
cd ..\infra\terraform
terraform init
terraform apply# Build model artifact
python scripts/build_lipcoordnet_artifact.py
# Upload to S3
aws s3 cp artifacts/model-lipcoordnet.tar.gz s3://gorggle-dev-uploads/sagemaker-models/
# Deploy SageMaker endpoint
python scripts/deploy_lipcoordnet.py `
--endpoint-name gorggle-lipcoordnet-dev `
--role-arn arn:aws:iam::YOUR_ACCOUNT:role/service-role/AmazonSageMaker-ExecutionRole `
--instance-type ml.g5.xlarge# Test with video from S3
python scripts/test_lipcoordnet_endpoint.py `
--video-bucket gorggle-dev-uploads `
--video-key test-video.mov# Open web interface
start web\index.html| Document | Description |
|---|---|
| ARCHITECTURE.md | System design and data flow |
| DEPLOYMENT_CHECKLIST.md | Pre-flight checks and validation |
| web/README.md | Frontend usage guide |
| TODO.md | Roadmap and pending tasks |
βββββββββββββββ
β User β
β Upload MP4 β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 Upload Bucket β
β s3://gorggle-dev-uploads β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β S3 Event Trigger
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Step Functions State Machine β
β (Parallel Processing) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββ βββββββββββββββββββββββ β
β β Extract Media β β AWS Transcribe β β
β β (FFmpeg/OpenCV) β β (Audio + Speakers) β β
β ββββββββββ¬ββββββββββ ββββββββββββ¬βββββββββββ β
β β β β
β β βββββββββββββββββββββββ΄βββ β
β β β AWS Rekognition β β
β β β (Face Detection) β β
β β βββββββββββββββ¬βββββββββββ β
β β β β
β βΌ β β
β ββββββββββββββββββββββββ β β
β β LipCoordNet β β β
β β SageMaker Serverlessβ β β
β β (GPU Inference) β β β
β ββββββββββ¬ββββββββββββββ β β
β β β β
β ββββββββββββββββββ΄ββββββββββββ β
β βΌ β
β ββββββββββββββββββββ β
β β Fuse Results β β
β β (Align & Merge) β β
β ββββββββββ¬ββββββββββ β
βββββββββββββββββββββββββββββββββββββββββΌββββββββββββββ
β
ββββββββββββββββββββ΄βββββββββββββββββββ
βΌ βΌ
ββββββββββββββββββββββ βββββββββββββββββββ
β S3 Processed β β DynamoDB β
β Bucket (JSON) β β (Job Index) β
βββββββββββ¬βββββββββββ βββββββββββββββββββ
β
β
βββββββββββΌβββββββββββββββββββββββββββββββββββ
β API Gateway + Lambda β
β GET /results/{jobId} β
βββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Web Interface β
β (Caption Viewer) β
ββββββββββββββββββββββββ
LipCoordNet is a state-of-the-art visual speech recognition model:
β
Trained on GRID corpus for high-accuracy lip reading
β
Lightweight and fast inference (optimized for production)
β
Works with 128Γ64 mouth ROI crops
β
Pre-trained weights available on HuggingFace
β
Integrates seamlessly with AWS SageMaker
Deployment: Serverless SageMaker inference endpoints with auto-scaling
Model: SilentSpeak/LipCoordNet on HuggingFace
GORGGLES2/
βββ π README.md # You are here
βββ π ARCHITECTURE.md # System design details
βββ π DEPLOYMENT_CHECKLIST.md # Pre-deployment validation
βββ π TODO.md # Roadmap
β
βββ π sagemaker/ # SageMaker model deployment
β βββ inference_lipcoordnet.py # Custom inference handler
β βββ requirements_lipcoordnet.txt # Model dependencies
β βββ container/ # Docker container (optional)
β
βββ π infra/terraform/ # Infrastructure as Code
β βββ main.tf # Core resources
β βββ security_groups.tf # VPC networking
β βββ lambda_layers.tf # Lambda layers config
β βββ variables.tf # Configuration variables
β
βββ π lambdas/ # Lambda functions
β βββ extract_media/ # FFmpeg extraction
β βββ invoke_lipreading/ # SageMaker caller
β βββ fuse_results/ # Result merger
β βββ get_results/ # API handler
β βββ s3_trigger/ # Pipeline starter
β βββ start_transcribe/ # AWS Transcribe
β βββ start_rekognition/ # Face detection
β
βββ π scripts/ # Deployment automation
β βββ build_lipcoordnet_artifact.py # Build model.tar.gz
β βββ deploy_lipcoordnet.py # Deploy to SageMaker
β βββ test_lipcoordnet_endpoint.py # Test endpoint
β βββ package_lambda_layers.ps1 # Layer packager
β
βββ π web/ # Frontend
βββ index.html # Upload + viewer UI
βββ README.md # Frontend guide
| Layer | Technology |
|---|---|
| Frontend | HTML5, CSS3, Vanilla JavaScript |
| API | AWS API Gateway (HTTP API) |
| Compute | AWS Lambda (Python 3.11), SageMaker Serverless |
| ML Models | LipCoordNet, AWS Transcribe, AWS Rekognition |
| Storage | Amazon S3, DynamoDB |
| Orchestration | AWS Step Functions |
| IaC | Terraform |
| GPU | SageMaker ml.g5.xlarge (NVIDIA A10G) |
| ML Stack | PyTorch, Transformers, dlib, OpenCV |
Monthly cost for moderate usage (~50 videos/month, 5 min avg):
| Service | Usage | Cost |
|---|---|---|
| SageMaker Serverless | 50 invocations, ~5s each | ~$2-5 |
| Lambda | 50 executions | ~$1 |
| S3 | 100 GB storage | ~$2 |
| Transcribe | 4 hours audio | ~$10 |
| Rekognition | 2 hours video | ~$6 |
| Step Functions | 50 executions | ~$0.10 |
| API Gateway | 1000 requests | ~$0.01 |
| DynamoDB | On-demand | ~$0.50 |
| Total | ~$21-24/month |
Cost optimization tips:
- Use SageMaker Serverless (pay per inference, no idle costs)
- Set S3 lifecycle policies (delete old files after 30 days)
- Use Step Functions Express workflows for cheaper executions
- Batch multiple videos to reduce cold starts
- β IAM roles with least-privilege policies
- β VPC security groups for network isolation
- β Lambda VPC integration for EC2 access
- β S3 encryption at rest (SSE-S3)
- β API Gateway with CORS configuration
- β SSH key-based EC2 access only
Best Practices:
- Never commit AWS credentials to Git
- Use AWS Secrets Manager for sensitive data
- Enable CloudTrail for audit logging
- Rotate SSH keys regularly
- Monitor with CloudWatch alarms
| Metric | Value | Notes |
|---|---|---|
| Inference Speed | ~3-5s per video | On SageMaker Serverless with LipCoordNet |
| Cold Start | ~30-60s | First invocation after idle period |
| End-to-End Latency | 2-5 minutes | For 5-minute video |
| Accuracy (Audio) | 95%+ WER | AWS Transcribe standard |
| Accuracy (Visual) | ~40% WER | LipCoordNet on GRID corpus |
| Throughput | Parallel processing | Multiple videos simultaneously |
# Test with S3 video
python scripts/test_lipcoordnet_endpoint.py `
--endpoint-name gorggle-lipcoordnet-dev `
--video-bucket gorggle-dev-uploads `
--video-key test-video.mov# Upload test video
aws s3 cp test-video.mp4 s3://gorggle-dev-uploads/uploads/test-001.mp4
# Monitor Step Functions
aws stepfunctions list-executions `
--state-machine-arn arn:aws:states:us-east-1:ACCOUNT:stateMachine:gorggle-dev-pipeline
# Fetch results
curl https://your-api-id.execute-api.us-east-1.amazonaws.com/results/test-001Contributions welcome! Areas for improvement:
- Fine-grained word-level timestamps
- Multi-language support
- Real-time streaming processing
- Mobile app integration
- SRT/VTT export
- Speaker name assignment
- Batch processing API
MIT License - see LICENSE for details.
- LipCoordNet: SilentSpeak
- Transformers: HuggingFace
- dlib: Davis King
- AWS: For Transcribe, Rekognition, SageMaker, and cloud infrastructure
- Issues: GitHub Issues
- Documentation: See project documentation files
- Model: LipCoordNet on HuggingFace
Made with β€οΈ for accessible AI-powered video transcription
π¬ Start processing videos now! Follow the Quick Start guide above.
Serverless Inference (Recommended):
- Pay only for inference time
- Auto-scales from 0 to thousands of concurrent requests
- 30-60s cold start latency
- Ideal for sporadic workloads
Real-time Endpoint:
- Always-on, no cold starts
- Higher cost (~$1.41/hour for ml.g5.xlarge)
- Use for high-throughput production workloads
The invoke_lipreading Lambda requires the SageMaker endpoint name as an environment variable:
SAGEMAKER_ENDPOINT=gorggle-lipcoordnet-devUpdate this in lambdas/invoke_lipreading/handler.py or set via Terraform.
LipCoordNet requires specific preprocessing:
- Frame rate: 25 FPS
- Mouth ROI: 128Γ64 pixels
- Face detection: Uses dlib 68-point landmarks
- Crop region: Mouth landmarks (points 48-67)
The SageMaker inference handler automatically performs these steps.
- Use Terraform for reproducible infrastructure
- Tag resources with project and environment labels
- Enable CloudWatch logging for all Lambda functions
- Set S3 lifecycle rules to auto-delete old videos
- Use IAM roles with least-privilege policies
- Monitor costs with AWS Cost Explorer and budgets
- Test with small videos first (~30 seconds)
- Enable X-Ray tracing for debugging Step Functions
Problem: Endpoint deployment fails
Solution: Check CloudWatch logs at /aws/sagemaker/Endpoints/gorggle-lipcoordnet-dev
Problem: Cold start timeout
Solution: Increase Lambda timeout to 300s or use async invocation
Problem: Out of memory errors
Solution: Increase SageMaker instance memory (use ml.g5.2xlarge)
Problem: FFmpeg not found in extract_media
Solution: Add FFmpeg Lambda layer ARN to Terraform configuration
Problem: Module import errors
Solution: Package dependencies with pip install -r requirements.txt -t .
Problem: VPC timeout errors
Solution: Ensure Lambda has VPC access and security groups allow outbound traffic
Problem: No face detected
Solution: Ensure person faces camera frontally, adequate lighting
Problem: Poor lip reading accuracy
Solution: LipCoordNet works best with clear frontal face views and minimal motion blur
Problem: Transcribe fails
Solution: Ensure video has audio track and is in supported format (MP4, MOV)
This project previously used AV-HuBERT on EC2. Current version uses LipCoordNet on SageMaker Serverless for better cost-efficiency and scalability.
Key changes:
- β Replaced EC2 GPU instance with SageMaker Serverless
- β Switched from AV-HuBERT to LipCoordNet (HuggingFace)
- β Eliminated infrastructure management overhead
- β Reduced costs by 85% ($150/mo vs $1,014/mo)
- β Faster deployment (2-3 min vs 7-15 min)
Old EC2/AV-HuBERT code is available in git history if needed.