A command-line multimodal AI video analysis tool that extracts frames, transcribes audio, and performs intelligent analysis using Llama 4 models. Features a two-step workflow: analyze videos then chat interactively about the results.
- πΌοΈ Frame Extraction: Extract video frames at custom intervals using OpenCV
- π΅ Audio Transcription: Transcribe video audio using OpenAI Whisper
- π€ Multimodal AI Analysis: Analyze both visual and audio content with Llama 4
- π¬ Interactive Chat: Natural language querying of analysis results
- π Multiple Analysis Modes: Comprehensive, overview, frames-only, or transcript-only
- π Secure API Management: Environment-based API key configuration
- π Dual Output: Human-readable text + machine-readable JSON results
- β‘ CLI Interface: Simple command-line tools with flexible options
- Python 3.8+
- FFmpeg (required for Whisper audio processing):
# macOS brew install ffmpeg # Ubuntu/Debian sudo apt install ffmpeg # Windows: Download from https://ffmpeg.org/
pip install -r requirements.txtCreate your environment file:
cp .env.example .envEdit .env and add your Llama API key:
LLAMA4_API_KEY=your_api_key_here
python -c "
from openai import OpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(
api_key=os.getenv('LLAMA4_API_KEY'),
base_url='https://api.llama.com/compat/v1/'
)
response = client.chat.completions.create(
model='Llama-4-Maverick-17B-128E-Instruct-FP8',
messages=[{'role': 'user', 'content': 'Hello!'}]
)
print('β
API connection successful!')
"Try the interactive chat with real example data:
# Test the interactive chat interface immediately
python interactive_video_chat.py examples/videoNetworking_llama_analysis.jsonThis uses a pre-processed networking conversation analysis, so you can:
- β Test the chat interface without API setup
- β See sample questions and responses
- β Understand output format before processing your own videos
- β Demo the system to others instantly
$ python interactive_video_chat.py examples/videoNetworking_llama_analysis.json
π¬ Video Analysis Chat - Ask me anything about the video!
Commands: 'quit', 'exit', 'clear', 'context', 'help'
============================================================
π¬ You: What were the main topics discussed?
π€ Llama: [Response based on the networking conversation analysis...]
π¬ You: What networking advice would you give?
π€ Llama: [Insights about the conversation effectiveness...]
π¬ You: help
π Available Commands:
- quit/exit/q: End the chat
- clear: Clear conversation history
- context: Show video details
- help: Show this help
π‘ Example Questions:
- "What were the main topics discussed?"
- "How did the participants' body language change?"
- "What networking advice would you give?"
- "Summarize the key insights"examples/videoNetworking_llama_analysis.json- Complete analysis data for chat interfaceexamples/videoNetworking_llama_analysis.txt- Human-readable analysis resultsexamples/videoNetworking_transcript.txt- Raw transcript for reference
# Start the demo
python interactive_video_chat.py examples/videoNetworking_llama_analysis.json
# Try asking:
"What were the main topics discussed?"
"How effective was this networking conversation?"
"What follow-up actions were mentioned?"
"What could have been improved?"
"Summarize the key insights from this conversation"Quick transcript analysis:
python llama_video_analyzer.py data/your_video.MOV --mode transcript_onlyVisual frame analysis:
python llama_video_analyzer.py data/your_video.MOV --mode frames_onlyComplete multimodal analysis:
python llama_video_analyzer.py data/your_video.MOV --mode comprehensiveFast overview (recommended for demos):
python llama_video_analyzer.py data/your_video.MOV --mode overviewpython llama_video_analyzer.py <video_file> [options]
Required:
video_file Path to video file (MP4, MOV, AVI, etc.)
Options:
--interval SECONDS Frame extraction interval (default: 20)
--whisper MODEL Whisper model: tiny,base,small,medium,large (default: base)
--mode MODE Analysis mode: comprehensive,frames_only,transcript_only,overview
--output FILE Output file prefix
Examples:
# High-quality analysis
python llama_video_analyzer.py meeting.MOV --interval 10 --whisper medium
# Quick demo mode
python llama_video_analyzer.py presentation.MP4 --mode overview --interval 30
# Custom output filename
python llama_video_analyzer.py interview.MOV --output job_interview_analysis
# Transcript only for fast text analysis
python llama_video_analyzer.py call.MOV --mode transcript_only --whisper large| Mode | Speed | API Calls | Use Case |
|---|---|---|---|
| transcript_only | β‘ Fast | 1 | Text analysis, quick insights |
| overview | π Medium | 1 | Demo-ready multimodal analysis |
| frames_only | β±οΈ Medium | N frames | Visual-focused analysis |
| comprehensive | π Detailed | N+2 calls | Complete research analysis |
Each analysis generates two files:
filename_llama_analysis.txt- Human-readable resultsfilename_llama_analysis.json- Machine-readable data
# 1. Analyze networking video
python llama_video_analyzer.py data/networking_call.MOV --mode comprehensive
# 2. View results
cat networking_call_llama_analysis.txt
# 3. Process JSON data
python -c "import json; data=json.load(open('networking_call_llama_analysis.json')); print(f'Frames: {data[\"frames_extracted\"]}, Transcript: {data[\"transcript_length\"]} chars')"{
"video_path": "data/networking_video.MOV",
"frames_extracted": 5,
"transcript_length": 2196,
"analysis": {
"individual_frames": [...],
"comprehensive": "...",
"transcript_only": "..."
}
}CLI Command β Video Input β [Frame Extractor] β Base64 Images
β
[Whisper] β Transcript
β
[Llama 4] β Analysis
β
[Output] β .txt + .json files
- Resolution: Auto-resize 1920x1080 β 1280x720
- Format: JPEG with base64 encoding
- Timestamps: Precise frame timing metadata
- Intervals: Configurable extraction frequency
- Engine: OpenAI Whisper (local processing)
- Models: tiny, base, small, medium, large
- Formats: Supports all major video formats
- Quality: Automatic audio extraction via FFmpeg
- Context: Full transcript provided to each frame analysis
- Focus Areas: Networking, meetings, professional communication
- Output: Structured insights on dynamics, body language, effectiveness
401 Authentication Error:
# Check API key is loaded
python -c "import os; from dotenv import load_dotenv; load_dotenv(); print('Key loaded:', bool(os.getenv('LLAMA4_API_KEY')))"500 Inference Error (too many frames):
# Use fewer frames
python llama_video_analyzer.py video.MOV --mode overview --interval 60FFmpeg Not Found:
# Install FFmpeg first
brew install ffmpeg # macOS
sudo apt install ffmpeg # LinuxFile Not Found:
# Check video file path
ls -la data/your_video.MOVPermission Issues:
# Make script executable
chmod +x llama_video_analyzer.py# Step 1: Quick transcript check
python llama_video_analyzer.py data/meeting.MOV --mode transcript_only
# Step 2: If transcript looks good, run full analysis
python llama_video_analyzer.py data/meeting.MOV --mode comprehensive --whisper mediumThis project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper for speech-to-text capabilities
- Llama 4 for multimodal AI analysis
- OpenCV for video frame processing
Pure CLI power for video analysis! π