Skip to content

gordowuu/GORGGLES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

29 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Gorggle

🎬 AI-powered video transcription platform combining audio transcription, AI lip reading, and speaker diarization for accessible, accurate video captions.


✨ Features

  • 🎀 Audio Transcription: AWS Transcribe with speaker diarization
  • πŸ‘„ AI Lip Reading: LipCoordNet visual speech recognition via SageMaker
  • πŸ‘€ Face Detection: AWS Rekognition face tracking
  • 🎯 Speaker Fusion: Intelligent alignment of audio + visual + face data
  • 🌐 Web Interface: Modern drag-&-drop upload + caption viewer
  • ⚑ GPU-Accelerated: Fast inference on SageMaker Serverless
  • πŸ—οΈ Serverless Pipeline: AWS Lambda + Step Functions orchestration

πŸš€ Quick Start

Prerequisites

  • AWS Account with CLI configured
  • Git, Terraform, Python 3.8+
  • SSH client (for EC2 access)

1. Clone Repository

git clone https://github.com/gordowuu/GORGGLES2.git
cd GORGGLES2

2. Deploy Infrastructure

# Package Lambda layers
cd scripts
.\package_lambda_layers.ps1 -Region us-east-1

# Apply Terraform
cd ..\infra\terraform
terraform init
terraform apply

3. Build and Deploy LipCoordNet Model

# Build model artifact
python scripts/build_lipcoordnet_artifact.py

# Upload to S3
aws s3 cp artifacts/model-lipcoordnet.tar.gz s3://gorggle-dev-uploads/sagemaker-models/

# Deploy SageMaker endpoint
python scripts/deploy_lipcoordnet.py `
  --endpoint-name gorggle-lipcoordnet-dev `
  --role-arn arn:aws:iam::YOUR_ACCOUNT:role/service-role/AmazonSageMaker-ExecutionRole `
  --instance-type ml.g5.xlarge

4. Test the Endpoint

# Test with video from S3
python scripts/test_lipcoordnet_endpoint.py `
  --video-bucket gorggle-dev-uploads `
  --video-key test-video.mov

5. Open Web Interface

# Open web interface
start web\index.html

πŸ“š Documentation

Document Description
ARCHITECTURE.md System design and data flow
DEPLOYMENT_CHECKLIST.md Pre-flight checks and validation
web/README.md Frontend usage guide
TODO.md Roadmap and pending tasks

πŸ—οΈ Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   User      β”‚
β”‚  Upload MP4 β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              S3 Upload Bucket                       β”‚
β”‚           s3://gorggle-dev-uploads                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ S3 Event Trigger
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Step Functions State Machine                β”‚
β”‚          (Parallel Processing)                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Extract Media    β”‚  β”‚ AWS Transcribe      β”‚    β”‚
β”‚  β”‚ (FFmpeg/OpenCV)  β”‚  β”‚ (Audio + Speakers)  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚           β”‚                        β”‚               β”‚
β”‚           β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”            β”‚
β”‚           β”‚  β”‚  AWS Rekognition       β”‚            β”‚
β”‚           β”‚  β”‚  (Face Detection)      β”‚            β”‚
β”‚           β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚           β”‚                β”‚                        β”‚
β”‚           β–Ό                β”‚                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                        β”‚
β”‚  β”‚  LipCoordNet         β”‚ β”‚                        β”‚
β”‚  β”‚  SageMaker Serverlessβ”‚ β”‚                        β”‚
β”‚  β”‚  (GPU Inference)     β”‚ β”‚                        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                        β”‚
β”‚           β”‚                β”‚                        β”‚
β”‚           └────────────────┴───────────┐            β”‚
β”‚                                        β–Ό            β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚                              β”‚  Fuse Results    β”‚   β”‚
β”‚                              β”‚  (Align & Merge) β”‚   β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β–Ό                                     β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚  S3 Processed      β”‚              β”‚   DynamoDB      β”‚
          β”‚  Bucket (JSON)     β”‚              β”‚   (Job Index)   β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚       API Gateway + Lambda                 β”‚
          β”‚   GET /results/{jobId}                     β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚   Web Interface      β”‚
          β”‚   (Caption Viewer)   β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 Why LipCoordNet?

LipCoordNet is a state-of-the-art visual speech recognition model:

βœ… Trained on GRID corpus for high-accuracy lip reading
βœ… Lightweight and fast inference (optimized for production)
βœ… Works with 128Γ—64 mouth ROI crops
βœ… Pre-trained weights available on HuggingFace
βœ… Integrates seamlessly with AWS SageMaker

Deployment: Serverless SageMaker inference endpoints with auto-scaling

Model: SilentSpeak/LipCoordNet on HuggingFace


πŸ“ Repository Layout

GORGGLES2/
β”œβ”€β”€ πŸ“„ README.md                          # You are here
β”œβ”€β”€ πŸ“„ ARCHITECTURE.md                    # System design details
β”œβ”€β”€ πŸ“„ DEPLOYMENT_CHECKLIST.md            # Pre-deployment validation
β”œβ”€β”€ πŸ“„ TODO.md                            # Roadmap
β”‚
β”œβ”€β”€ πŸ“‚ sagemaker/                         # SageMaker model deployment
β”‚   β”œβ”€β”€ inference_lipcoordnet.py          # Custom inference handler
β”‚   β”œβ”€β”€ requirements_lipcoordnet.txt      # Model dependencies
β”‚   └── container/                        # Docker container (optional)
β”‚
β”œβ”€β”€ πŸ“‚ infra/terraform/                   # Infrastructure as Code
β”‚   β”œβ”€β”€ main.tf                           # Core resources
β”‚   β”œβ”€β”€ security_groups.tf                # VPC networking
β”‚   β”œβ”€β”€ lambda_layers.tf                  # Lambda layers config
β”‚   └── variables.tf                      # Configuration variables
β”‚
β”œβ”€β”€ πŸ“‚ lambdas/                           # Lambda functions
β”‚   β”œβ”€β”€ extract_media/                    # FFmpeg extraction
β”‚   β”œβ”€β”€ invoke_lipreading/                # SageMaker caller
β”‚   β”œβ”€β”€ fuse_results/                     # Result merger
β”‚   β”œβ”€β”€ get_results/                      # API handler
β”‚   β”œβ”€β”€ s3_trigger/                       # Pipeline starter
β”‚   β”œβ”€β”€ start_transcribe/                 # AWS Transcribe
β”‚   └── start_rekognition/                # Face detection
β”‚
β”œβ”€β”€ πŸ“‚ scripts/                           # Deployment automation
β”‚   β”œβ”€β”€ build_lipcoordnet_artifact.py     # Build model.tar.gz
β”‚   β”œβ”€β”€ deploy_lipcoordnet.py             # Deploy to SageMaker
β”‚   β”œβ”€β”€ test_lipcoordnet_endpoint.py      # Test endpoint
β”‚   └── package_lambda_layers.ps1         # Layer packager
β”‚
└── πŸ“‚ web/                               # Frontend
    β”œβ”€β”€ index.html                        # Upload + viewer UI
    └── README.md                         # Frontend guide

πŸ› οΈ Tech Stack

Layer Technology
Frontend HTML5, CSS3, Vanilla JavaScript
API AWS API Gateway (HTTP API)
Compute AWS Lambda (Python 3.11), SageMaker Serverless
ML Models LipCoordNet, AWS Transcribe, AWS Rekognition
Storage Amazon S3, DynamoDB
Orchestration AWS Step Functions
IaC Terraform
GPU SageMaker ml.g5.xlarge (NVIDIA A10G)
ML Stack PyTorch, Transformers, dlib, OpenCV

πŸ’° Cost Estimate

Monthly cost for moderate usage (~50 videos/month, 5 min avg):

Service Usage Cost
SageMaker Serverless 50 invocations, ~5s each ~$2-5
Lambda 50 executions ~$1
S3 100 GB storage ~$2
Transcribe 4 hours audio ~$10
Rekognition 2 hours video ~$6
Step Functions 50 executions ~$0.10
API Gateway 1000 requests ~$0.01
DynamoDB On-demand ~$0.50
Total ~$21-24/month

Cost optimization tips:

  • Use SageMaker Serverless (pay per inference, no idle costs)
  • Set S3 lifecycle policies (delete old files after 30 days)
  • Use Step Functions Express workflows for cheaper executions
  • Batch multiple videos to reduce cold starts

πŸ” Security

  • βœ… IAM roles with least-privilege policies
  • βœ… VPC security groups for network isolation
  • βœ… Lambda VPC integration for EC2 access
  • βœ… S3 encryption at rest (SSE-S3)
  • βœ… API Gateway with CORS configuration
  • βœ… SSH key-based EC2 access only

Best Practices:

  • Never commit AWS credentials to Git
  • Use AWS Secrets Manager for sensitive data
  • Enable CloudTrail for audit logging
  • Rotate SSH keys regularly
  • Monitor with CloudWatch alarms

πŸ“Š Performance

Metric Value Notes
Inference Speed ~3-5s per video On SageMaker Serverless with LipCoordNet
Cold Start ~30-60s First invocation after idle period
End-to-End Latency 2-5 minutes For 5-minute video
Accuracy (Audio) 95%+ WER AWS Transcribe standard
Accuracy (Visual) ~40% WER LipCoordNet on GRID corpus
Throughput Parallel processing Multiple videos simultaneously

πŸ§ͺ Testing

Test LipCoordNet Endpoint

# Test with S3 video
python scripts/test_lipcoordnet_endpoint.py `
  --endpoint-name gorggle-lipcoordnet-dev `
  --video-bucket gorggle-dev-uploads `
  --video-key test-video.mov

Integration Tests

# Upload test video
aws s3 cp test-video.mp4 s3://gorggle-dev-uploads/uploads/test-001.mp4

# Monitor Step Functions
aws stepfunctions list-executions `
  --state-machine-arn arn:aws:states:us-east-1:ACCOUNT:stateMachine:gorggle-dev-pipeline

# Fetch results
curl https://your-api-id.execute-api.us-east-1.amazonaws.com/results/test-001

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Fine-grained word-level timestamps
  • Multi-language support
  • Real-time streaming processing
  • Mobile app integration
  • SRT/VTT export
  • Speaker name assignment
  • Batch processing API

πŸ“œ License

MIT License - see LICENSE for details.


πŸ™ Acknowledgments


πŸ“ž Support & Contact


Made with ❀️ for accessible AI-powered video transcription

🎬 Start processing videos now! Follow the Quick Start guide above.


πŸ”§ Advanced Configuration

SageMaker Deployment Options

Serverless Inference (Recommended):

  • Pay only for inference time
  • Auto-scales from 0 to thousands of concurrent requests
  • 30-60s cold start latency
  • Ideal for sporadic workloads

Real-time Endpoint:

  • Always-on, no cold starts
  • Higher cost (~$1.41/hour for ml.g5.xlarge)
  • Use for high-throughput production workloads

Lambda Configuration

The invoke_lipreading Lambda requires the SageMaker endpoint name as an environment variable:

SAGEMAKER_ENDPOINT=gorggle-lipcoordnet-dev

Update this in lambdas/invoke_lipreading/handler.py or set via Terraform.

Video Preprocessing

LipCoordNet requires specific preprocessing:

  • Frame rate: 25 FPS
  • Mouth ROI: 128Γ—64 pixels
  • Face detection: Uses dlib 68-point landmarks
  • Crop region: Mouth landmarks (points 48-67)

The SageMaker inference handler automatically performs these steps.


πŸš€ Deployment Best Practices

  1. Use Terraform for reproducible infrastructure
  2. Tag resources with project and environment labels
  3. Enable CloudWatch logging for all Lambda functions
  4. Set S3 lifecycle rules to auto-delete old videos
  5. Use IAM roles with least-privilege policies
  6. Monitor costs with AWS Cost Explorer and budgets
  7. Test with small videos first (~30 seconds)
  8. Enable X-Ray tracing for debugging Step Functions

πŸ“ Troubleshooting

SageMaker Endpoint Issues

Problem: Endpoint deployment fails
Solution: Check CloudWatch logs at /aws/sagemaker/Endpoints/gorggle-lipcoordnet-dev

Problem: Cold start timeout
Solution: Increase Lambda timeout to 300s or use async invocation

Problem: Out of memory errors
Solution: Increase SageMaker instance memory (use ml.g5.2xlarge)

Lambda Function Issues

Problem: FFmpeg not found in extract_media
Solution: Add FFmpeg Lambda layer ARN to Terraform configuration

Problem: Module import errors
Solution: Package dependencies with pip install -r requirements.txt -t .

Problem: VPC timeout errors
Solution: Ensure Lambda has VPC access and security groups allow outbound traffic

Video Processing Issues

Problem: No face detected
Solution: Ensure person faces camera frontally, adequate lighting

Problem: Poor lip reading accuracy
Solution: LipCoordNet works best with clear frontal face views and minimal motion blur

Problem: Transcribe fails
Solution: Ensure video has audio track and is in supported format (MP4, MOV)


πŸ”„ Migration Notes

This project previously used AV-HuBERT on EC2. Current version uses LipCoordNet on SageMaker Serverless for better cost-efficiency and scalability.

Key changes:

  • βœ… Replaced EC2 GPU instance with SageMaker Serverless
  • βœ… Switched from AV-HuBERT to LipCoordNet (HuggingFace)
  • βœ… Eliminated infrastructure management overhead
  • βœ… Reduced costs by 85% ($150/mo vs $1,014/mo)
  • βœ… Faster deployment (2-3 min vs 7-15 min)

Old EC2/AV-HuBERT code is available in git history if needed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors