πŸŽ₯ MoveMean β€” Natural Language Video Understanding Platform

🌟 Inspiration

We wanted to explore how machines can understand the meaning of videos β€” not just by detecting scenes or objects, but by truly interpreting the context and conversation within them. Our goal was to make it possible to talk with videos β€” to ask questions, get insights, and navigate directly to the moments that matter.


πŸ’‘ What It Does

MoveMean lets users upload a video and interact with it using natural language.
You can ask:

β€œWhere does the speaker talk about AI ethics?”
β€œSummarize the discussion about machine learning.”

The platform responds instantly, showing relevant timestamps, summaries, or answers β€” just like chatting with the video itself.


βš™οΈ How We Built It

We designed a serverless architecture to keep things scalable and cost-effective:

  1. Authentication & API Layer

    • Users authenticate through Amazon Cognito.
    • Requests are routed securely via API Gateway.
  2. Video Upload & Processing

    • When a video is uploaded, it triggers an AWS Lambda function that:
      • Uses Amazon Transcribe for speech-to-text.
      • Sends results to S3 for storage.
      • Communicates asynchronously via SQS to manage load.
  3. Analysis & Knowledge Storage

    • Another Lambda controller processes both motion and transcript data.
    • It stores structured insights into an Amazon Bedrock Knowledge Base.
    • DynamoDB tracks job progress to prevent redundant executions.
  4. Retrieval & Interaction

    • When a user asks a question, the system:
      • Searches the Bedrock knowledge base using semantic search.
      • Runs an LLM inference (through Amazon Bedrock) on the retrieved chunks.
      • Returns the most relevant segments or responses.

🧩 Challenges We Ran Into

  • Managing asynchronous processing across multiple AWS services.
  • Handling large video files efficiently while staying within serverless limits.
  • Fine-tuning the retrieval pipeline to align transcription text with video timestamps accurately.

πŸ† Accomplishments We’re Proud Of

  • Built a fully serverless, event-driven video understanding pipeline.
  • Integrated speech-to-text, semantic search, and LLM reasoning in one workflow.
  • Achieved real-time interaction with videos using natural language.

πŸ“š What We Learned

  • How to orchestrate multiple AWS components (S3, Lambda, Bedrock, SQS, Cognito, DynamoDB) into a cohesive data pipeline.
  • Deepened our understanding of retrieval-augmented generation (RAG) and video-based AI workflows.
  • Learned how to handle asynchronous event-driven architectures for media intelligence use cases.

πŸš€ What's Next for MoveMean

  • Add multi-language transcription and translation support.
  • Integrate visual understanding models (e.g., scene detection, facial recognition).
  • Enable real-time video Q&A during live streams.
  • Develop a frontend dashboard for interactive video insights and summaries.

🧠 Tech Stack

  • AWS Lambda – Serverless compute
  • Amazon S3 – Video and transcript storage
  • Amazon Transcribe – Speech-to-text
  • Amazon Bedrock – Knowledge base and LLM inference
  • Amazon SQS – Asynchronous communication
  • Amazon DynamoDB – Job tracking
  • Amazon Cognito – Authentication
  • API Gateway – Request management

Built With

Share this project:

Updates