π₯ MoveMean β Natural Language Video Understanding Platform
π Inspiration
We wanted to explore how machines can understand the meaning of videos β not just by detecting scenes or objects, but by truly interpreting the context and conversation within them. Our goal was to make it possible to talk with videos β to ask questions, get insights, and navigate directly to the moments that matter.
π‘ What It Does
MoveMean lets users upload a video and interact with it using natural language.
You can ask:
βWhere does the speaker talk about AI ethics?β
βSummarize the discussion about machine learning.β
The platform responds instantly, showing relevant timestamps, summaries, or answers β just like chatting with the video itself.
βοΈ How We Built It
We designed a serverless architecture to keep things scalable and cost-effective:
Authentication & API Layer
- Users authenticate through Amazon Cognito.
- Requests are routed securely via API Gateway.
- Users authenticate through Amazon Cognito.
Video Upload & Processing
- When a video is uploaded, it triggers an AWS Lambda function that:
- Uses Amazon Transcribe for speech-to-text.
- Sends results to S3 for storage.
- Communicates asynchronously via SQS to manage load.
- When a video is uploaded, it triggers an AWS Lambda function that:
Analysis & Knowledge Storage
- Another Lambda controller processes both motion and transcript data.
- It stores structured insights into an Amazon Bedrock Knowledge Base.
- DynamoDB tracks job progress to prevent redundant executions.
Retrieval & Interaction
- When a user asks a question, the system:
- Searches the Bedrock knowledge base using semantic search.
- Runs an LLM inference (through Amazon Bedrock) on the retrieved chunks.
- Returns the most relevant segments or responses.
- When a user asks a question, the system:
π§© Challenges We Ran Into
- Managing asynchronous processing across multiple AWS services.
- Handling large video files efficiently while staying within serverless limits.
- Fine-tuning the retrieval pipeline to align transcription text with video timestamps accurately.
π Accomplishments Weβre Proud Of
- Built a fully serverless, event-driven video understanding pipeline.
- Integrated speech-to-text, semantic search, and LLM reasoning in one workflow.
- Achieved real-time interaction with videos using natural language.
π What We Learned
- How to orchestrate multiple AWS components (S3, Lambda, Bedrock, SQS, Cognito, DynamoDB) into a cohesive data pipeline.
- Deepened our understanding of retrieval-augmented generation (RAG) and video-based AI workflows.
- Learned how to handle asynchronous event-driven architectures for media intelligence use cases.
π What's Next for MoveMean
- Add multi-language transcription and translation support.
- Integrate visual understanding models (e.g., scene detection, facial recognition).
- Enable real-time video Q&A during live streams.
- Develop a frontend dashboard for interactive video insights and summaries.
π§ Tech Stack
- AWS Lambda β Serverless compute
- Amazon S3 β Video and transcript storage
- Amazon Transcribe β Speech-to-text
- Amazon Bedrock β Knowledge base and LLM inference
- Amazon SQS β Asynchronous communication
- Amazon DynamoDB β Job tracking
- Amazon Cognito β Authentication
- API Gateway β Request management
Built With
- amazon-dynamodb
- apigateway
- lambda
- nextjs
- s3
- sqs
- terraform
- transcribe
Log in or sign up for Devpost to join the conversation.