feat(tools): add video analysis via ffmpeg frame extraction + vision LLM by Add1ct1ve · Pull Request #2294 · NousResearch/hermes-agent

Add1ct1ve · 2026-03-21T11:28:46Z

What does this PR do?

Adds end-to-end video analysis support for the Hermes agent. When a user sends a video on Telegram, it is downloaded, cached locally, and the agent is given a context note with the file path and instructions to use the new video_analyze tool. The tool extracts frames using ffmpeg, encodes them as base64, and sends them to the configured vision LLM for multi-frame analysis.

This follows the same pattern as the existing image analysis pipeline (download → cache → enrich → tool), but intentionally skips auto-analysis since processing video frames is more expensive than a single image.

Type of Change

✨ New feature (non-breaking change that adds functionality)

Changes Made

gateway/platforms/base.py — Added VIDEO_CACHE_DIR, get_video_cache_dir(), cache_video_from_bytes(), cleanup_video_cache() following the existing image/audio/document cache pattern
gateway/platforms/telegram.py — Added elif msg.video: download block with 20MB Telegram Bot API size check, MIME-to-extension mapping, and cache write
tools/video_analysis_tool.py (new) — Frame extraction via ffmpeg (fps=1/<interval>, scaled to 768px, JPEG quality 2), tiered interval strategy (<1min: 1s, 1-5min: 5s, 5min+: 10s, hard cap 30 frames), async vision LLM call, tool registration
model_tools.py — Added "tools.video_analysis_tool" to _discover_tools()
toolsets.py — Added "video_analyze" to _HERMES_CORE_TOOLS, "vision" toolset, and "hermes-acp" preset
gateway/run.py — Added video context enrichment block (tells agent a video is available + how to analyze it), added cleanup_video_cache() to hourly cron ticker
tests/test_video_analysis_tool.py (new) — 29 unit tests covering frame interval calculation, ffmpeg detection, async tool flow, handler/schema/registry integration, and requirements checks
website/docs/reference/tools-reference.md — Added video_analyze to vision toolset documentation, fixed pre-existing duplicate sections
website/docs/reference/toolsets-reference.md — Added video_analyze to vision, hermes-acp, and hermes-cli toolset listings

How to Test

Install ffmpeg (winget install ffmpeg / brew install ffmpeg / apt install ffmpeg)
Start the gateway with Telegram configured
Send a short video (<20MB) to the bot on Telegram
Verify it downloads to ~/.hermes/video_cache/
The agent should receive a context note and invoke video_analyze if the user's caption asks about the video
Verify frames are extracted, vision LLM is called, and analysis is returned

Edge cases:

Send a video >20MB → should get a size limit message, no download attempted
Remove ffmpeg from PATH → video_analyze tool should not appear in the agent's toolset (check_fn returns False)
Send a video with no caption → agent receives context note and can decide whether to analyze

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: Windows 11 Pro

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Videos sent on Telegram are now downloaded to ~/.hermes/video_cache/, and the agent receives a context note pointing it to the video_analyze tool. The tool extracts frames with ffmpeg (tiered interval, max 30 frames, scaled to 768px), encodes them as base64, and sends them to the configured vision model for multi-frame analysis. Gated on ffmpeg + ffprobe + vision provider availability — the tool silently hides itself from the agent's toolset when requirements are not met.

Add1ct1ve · 2026-03-21T11:29:42Z

disclaimer this is fully vibe coded based on a need i had like 4 hours ago

Add 29 unit tests covering frame interval calculation, ffmpeg detection, async tool flow, handler prompt construction, schema validation, registry integration, and requirements checks. Update tools-reference and toolsets-reference docs to include video_analyze in the vision toolset. Fix pre-existing duplicate vision/web sections in tools-reference.

nidhishgajjar · 2026-04-30T17:44:53Z

Orb Code Review (powered by GLM-4.7 on Orb Cloud)

Summary

This PR adds a new video_analyze tool to Hermes that extracts frames from videos using ffmpeg/ffprobe and analyzes them via a centralized vision LLM. The implementation includes: (1) video caching infrastructure in gateway/platforms/base.py, (2) Telegram platform integration for video downloads (with 20MB limit), (3) context enrichment for video messages, (4) automated video cache cleanup, (5) comprehensive test coverage, and (6) tool registration and documentation updates.

Architecture

The PR follows the existing Hermes architecture well:

Video caching infrastructure: Adds VIDEO_CACHE_DIR, get_video_cache_dir(), cache_video_from_bytes(), and cleanup_video_cache() to gateway/platforms/base.py, following the same pattern as image/audio/document caching
Telegram platform integration: Extends TelegramAdapter._handle_attachment() to download videos from Telegram Bot API, respecting the 20MB limit and detecting MIME types for proper file extensions
Context enrichment: Modifies GatewayRunner._build_message_context() to add a note for video messages, guiding users to use the video_analyze tool with the cached video path
Automated cleanup: Integrates video cache cleanup into the cron ticker, running every hour alongside image/document cache cleanup
Tool implementation: Creates tools/video_analysis_tool.py with well-structured helpers (_has_ffmpeg(), _has_ffprobe(), _get_video_duration(), _calculate_frame_interval(), _extract_frames()) and main video_analyze_tool() function
Comprehensive tests: Adds tests/test_video_analysis_tool.py with extensive coverage including frame interval calculation, ffmpeg detection, tool registration, and schema validation
Tool registration: Registers video_analyze in the vision toolset via model_tools.py and updates documentation

Issues

Major Issues

1. Hardcoded Frame Quality May Result in Loss of Important Details

Location: tools/video_analysis_tool.py:90

def _extract_frames(video_path: str, output_dir: str, interval: float) -> list[str]:
    subprocess.run(
        [
            "ffmpeg", "-i", video_path,
            "-vf", f"fps=1/{interval},scale=768:-1",
            "-q:v", "2",
            pattern,
        ],
        ...
    )

Issue: The frame extraction hardcodes scale=768:-1 (scale width to 768, preserve aspect ratio) and quality -q:v 2.

Impact:

Loss of detail: 768px width may be insufficient for detecting fine details (e.g., text in videos, small objects, subtle movements)
Quality tradeoff: -q:v 2 is a relatively low quality setting that may introduce compression artifacts
Fixed resolution: Users cannot adjust resolution based on video content or use case (e.g., medical videos might need higher resolution)

Suggested Fix:

Make resolution and quality configurable:

def _extract_frames(
    video_path: str, 
    output_dir: str, 
    interval: float, 
    scale_width: int = 1280,  # Default to HD
    quality: int = 3,  # Default to moderate quality
) -> list[str]:
    subprocess.run(
        [
            "ffmpeg", "-i", video_path,
            "-vf", f"fps=1/{interval},scale={scale_width}:-1",
            "-q:v", str(quality),
            pattern,
        ],
        ...
    )

And update the tool signature:

async def video_analyze_tool(
    video_path: str,
    user_prompt: str,
    model: Optional[str] = None,
    scale_width: int = 1280,  # Added
    quality: int = 3,  # Added
) -> str:

2. Timeout Values May Be Insufficient for Long Videos

Location: tools/video_analysis_tool.py:49 and tools/video_analysis_tool.py:95

def _get_video_duration(path: str) -> float:
    result = subprocess.run(
        ["ffprobe", "-v", "error", ...],
        timeout=30,  # Only 30 seconds
    )

def _extract_frames(video_path: str, output_dir: str, interval: float) -> list[str]:
    subprocess.run(
        ["ffmpeg", "-i", video_path, ...],
        timeout=120,  # Only 120 seconds
    )

Issue: The timeout values are fixed and may be insufficient:

ffprobe timeout: 30 seconds may be too short for very large videos (>1GB) or slow storage
ffmpeg timeout: 120 seconds may be insufficient for long videos at high quality or slow systems

Impact:

False errors: Legitimate videos may be reported as timeouts
Incomplete analysis: Users may get partial results without knowing the analysis failed due to timeout
Hard to debug: No way to distinguish between a real timeout and a slow system

Suggested Fix:

Make timeout scale with video duration:

def _get_video_duration(path: str) -> float:
    # First, get approximate duration quickly (smaller timeout)
    result = subprocess.run(
        ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "csv=p=0", path],
        capture_output=True,
        text=True,
        timeout=30,
    )
    duration = float(result.stdout.strip())
    
    # Scale timeout with duration (5x safety margin)
    timeout = max(30, int(duration * 5))
    return duration

def _extract_frames(video_path: str, output_dir: str, interval: float, estimated_duration: float) -> list[str]:
    # Scale timeout with estimated duration and frame count
    num_frames = int(estimated_duration / interval)
    # Allow 5 seconds per frame plus 60 second overhead
    timeout = max(120, num_frames * 5 + 60)
    
    subprocess.run([...], timeout=timeout)

And update the call to _extract_frames to pass the estimated duration:

# In video_analyze_tool()
frames = await asyncio.to_thread(
    _extract_frames, video_path, tmp_dir, interval, duration,  # Pass duration
)

3. Video Download Limit Is Hardcoded

Location: gateway/platforms/telegram.py:88

# Telegram Bot API limit: 20 MB for file downloads
MAX_VIDEO_BYTES = 20 * 1024 * 1024
if msg.video.file_size and msg.video.file_size > MAX_VIDEO_BYTES:
    ...

Issue: The 20MB video limit is hardcoded in the Telegram adapter. This may be:

Too restrictive for users with high-bandwidth connections
Inconsistent with other platforms' limits
Not configurable per deployment or use case

Impact:

User experience: Users with legitimate videos >20MB cannot analyze them
Platform inconsistency: Other platforms (Discord, WhatsApp) may have different limits
Hard to adjust: Requires code changes to change the limit

Suggested Fix:

Make the limit configurable:

# In gateway/platforms/base.py
VIDEO_MAX_BYTES = int(os.getenv("VIDEO_MAX_MB", "20")) * 1024 * 1024

# In telegram.py
if msg.video.file_size and msg.video.file_size > VIDEO_MAX_BYTES:

Or document it as a platform-specific consideration:

# Telegram Bot API limit: 20 MB for file downloads
# Note: This is a Telegram-specific limitation. Other platforms may have
# different limits. Consider adjusting this value based on your deployment.
MAX_VIDEO_BYTES = 20 * 1024 * 1024

4. Frame Cap May Be Too Restrictive for Complex Videos

Location: tools/video_analysis_tool.py:523

def _calculate_frame_interval(duration: float) -> float:
    MAX_FRAMES = 30  # Hard cap
    
    if duration < 60:
        interval = 1.0
    elif duration < 300:
        interval = 5.0
    else:
        interval = 10.0
    
    # Enforce max-frames cap
    estimated_frames = duration / interval
    if estimated_frames > MAX_FRAMES:
        interval = duration / MAX_FRAMES

Issue: The 30-frame cap means:

For a 10-minute video (600s): interval = 10s, only 60 frames sampled
For a 30-minute video (1800s): interval = 60s, only 30 frames sampled

This may be insufficient for:

Detecting subtle changes over long periods
Understanding temporal relationships in long videos
Detecting events that occur at specific timestamps not aligned with the sampling interval

Impact:

Limited temporal resolution: Important moments may be missed
Context loss: Long videos may lack sufficient context
Deterministic sampling: Users cannot increase frame count even if they want to pay the computational cost

Suggested Fix:

Make the cap configurable or increase it significantly:

VIDEO_MAX_FRAMES = int(os.getenv("VIDEO_MAX_FRAMES", "100"))  # Default to 100, allow override

def _calculate_frame_interval(duration: float, max_frames: int = VIDEO_MAX_FRAMES) -> float:
    # ... existing logic ...
    
    # Use configured max instead of hardcoded 30
    estimated_frames = duration / interval
    if estimated_frames > max_frames:
        interval = duration / max_frames
    
    return interval

Or provide multiple quality tiers:

def _calculate_frame_interval(duration: float, quality: str = "medium") -> float:
    # quality: "low", "medium", "high", "max"
    quality_settings = {
        "low": {"max_frames": 30, "interval_multiplier": 1.0},
        "medium": {"max_frames": 60, "interval_multiplier": 5.0},
        "high": {"max_frames": 120, "interval_multiplier": 10.0},
        "max": {"max_frames": 300, "interval_multiplier": 30.0},
    }
    settings = quality_settings[quality]
    
    # ... calculate interval using settings ...

Warnings

5. Frame Extraction Pattern Matching May Miss Files

Location: tools/video_analysis_tool.py:98-99

frames = sorted(
    str(p) for p in Path(output_dir).glob("frame_*.jpg")
)
return frames

Issue: The glob pattern frame_*.jpg assumes all frame files will have the exact prefix. However, if ffmpeg creates files with slight variations in naming (e.g., due to concurrent runs, filesystem issues), some frames may be missed.

Impact:

Missing frames: Not all extracted frames may be returned
Inconsistent results: Frame count in results may not match actual extracted frames
Hard to debug: Difficult to detect when frames are silently missed

Suggested Fix:

Use a more robust pattern or check the actual ffmpeg output:

# Option 1: More robust glob pattern
frames = sorted(
    str(p) for p in Path(output_dir).glob("frame_[0-9]*.jpg")
)

# Option 2: Parse ffmpeg output to get exact frame paths
import re

def _extract_frames(video_path: str, output_dir: str, interval: float) -> list[str]:
    result = subprocess.run(
        [...],
        capture_output=True,
        text=True,
        check=True,
    )
    
    # Parse ffmpeg output to extract frame filenames
    frame_files = []
    for line in result.stderr.split('\n'):
        if 'frame_' in line:
            # Extract filename using regex
            match = re.search(r'(frame_\d+\.jpg)', line)
            if match:
                frame_files.append(output_dir + '/' + match.group(0))
    
    return sorted(frame_files)

6. Error Messages Could Be More Specific for Debugging

Location: tools/video_analysis_tool.py:626-629

if not frames:
    return json.dumps({
        "success": False,
        "error": "No frames extracted from video",
        "analysis": "ffmpeg did not produce any frames from the video.",
    })

Issue: The error message "No frames extracted from video" is generic and doesn't help with debugging. It doesn't indicate:

Whether ffmpeg ran successfully
What error ffmpeg encountered (if any)
Whether the video file was valid
What the actual ffmpeg output was

Impact:

Poor debugging: Users cannot determine why frame extraction failed
False errors: May be reported as "no frames" when there was actually a different error
Time waste: Debugging requires manual inspection of logs

Suggested Fix:

Provide more detailed error information:

if not frames:
    return json.dumps({
        "success": False,
        "error": (
            "No frames extracted from video. "
            "This could be due to: (1) ffmpeg execution error, "
            "(2) invalid video format, (3) timeout, or "
            "(4) insufficient disk space. Check logs for details."
        ),
        "ffmpeg_exit_code": exit_code,  # Add if available
        "ffmpeg_stderr": stderr_output,  # Add if available
        "analysis": None,
    })

Or capture and return ffmpeg's actual error:

def _extract_frames(...):
    result = subprocess.run(..., capture_output=True, text=True, check=True)
    
    if result.returncode != 0:
        raise FrameExtractionError(
            f"ffmpeg failed with exit code {result.returncode}: {result.stderr}"
        )
    
    return sorted(Path(output_dir).glob("frame_*.jpg"))

7. Test Coverage for Error Paths Could Be Expanded

Location: tests/test_video_analysis_tool.py

The tests are comprehensive but could benefit from:

Testing with actually invalid video files (corrupt, wrong format)
Testing ffmpeg timeout scenarios
Testing with zero-length videos
Testing with videos at the exact tier boundaries (60s, 300s)
Testing with very long videos to verify cap behavior
Testing concurrent video analysis requests

Impact:

Untested edge cases: Certain error scenarios may not be caught
Regression risk: Changes to ffmpeg parameters may break existing behavior

Suggested Fix:

Add tests for edge cases:

class TestVideoAnalyzeTool:
    # ... existing tests ...
    
    @pytest.mark.asyncio
    @patch("tools.video_analysis_tool._has_ffmpeg", return_value=True)
    @patch("tools.video_analysis_tool._has_ffprobe", return_value=True)
    async def test_corrupt_video(self, _probe, _ff, tmp_path):
        """Test handling of corrupt video files."""
        # Create a file with invalid video data
        corrupt_video = tmp_path / "corrupt.mp4"
        corrupt_video.write_bytes(b"NOT_A_VIDEO_FILE")
        
        result = json.loads(await video_analyze_tool(str(corrupt_video), "describe"))
        assert result["success"] is False
        assert "invalid" in result["error"].lower()
    
    @pytest.mark.asyncio
    async def test_zero_length_video(self, _probe, _ff, tmp_path):
        """Test handling of zero-length videos."""
        zero_video = tmp_path / "zero.mp4"
        zero_video.write_bytes(b"")
        
        result = json.loads(await video_analyze_tool(str(zero_video), "describe"))
        assert result["success"] is False
        assert "empty" in result["error"].lower() or "zero" in result["error"].lower()
    
    @pytest.mark.asyncio
    @patch("tools.video_analysis_tool._get_video_duration", return_value=60.0)
    async def test_exact_boundary_60s(self, _probe, _ff, _dur, tmp_path):
        """Test video at exact 60s boundary."""
        video_path = _make_video(tmp_path)
        # Should use medium tier (5s interval)
        result = json.loads(await video_analyze_tool(video_path, "describe"))
        assert result["success"] is True
        assert result["duration"] == 60.0
        # 60/5 = 12 frames

Cross-file Impact

Public API Changes

Modified:

gateway/platforms/base.py - Added video cache infrastructure (4 functions, 1 constant)
gateway/platforms/telegram.py - Added video download handling in _handle_attachment()
gateway/run.py - Modified _build_message_context() to enrich video messages; added video cache cleanup to cron ticker
model_tools.py - Added tools.video_analysis_tool to _discover_tools()
toolsets.py - Added video_analyze to vision toolset

Added:

tools/video_analysis_tool.py - New video analysis tool with 304 lines
tests/test_video_analysis_tool.py - New test file with 454 lines

Callers Affected

Direct Callers:

Platform adapters that call _build_message_context() in GatewayRunner
Users of Telegram platform with video attachments
Cron ticker cleanup function

Indirect Callers:

Vision LLM providers that process multi-image messages (now can handle video frames)
Future platform adapters that want video support
Users who configure Hermes with custom toolsets

Dependencies

Modified:

gateway/platforms/telegram.py - Imports cache_video_from_bytes from base
gateway/run.py - Imports cleanup_video_cache from base
model_tools.py - Imports tools.video_analysis_tool

No External Dependencies Added:

Uses subprocess for ffmpeg/ffprobe (assumes system installation)
No new Python packages required

Test Coverage

Added:

Frame interval calculation tests (10 test cases)
FFmpeg/FFprobe detection tests (4 test cases)
Video analyze tool tests (8 test cases including success, failure, and edge cases)
Schema validation tests (2 test cases)
Tool registration tests (2 test cases)
Requirement check tests (3 test cases)

Missing:

Tests for corrupt/invalid video files
Tests for zero-length videos
Tests for exact tier boundary cases (60s, 300s)
Tests for concurrent video analysis
Tests for timeout scenarios
Tests for very long videos (>30 minutes)
Integration tests with actual video files

Dependencies

No New External Dependencies: Uses system ffmpeg and ffprobe (assumes they are installed)

Modified:

gateway/platforms/telegram.py - Now depends on video cache infrastructure
gateway/run.py - Now depends on video cache cleanup

User Impact

Positive:

New capability: Users can now analyze videos via Hermes
Smart frame sampling: Tiered strategy balances detail vs computational cost
Video caching: Avoids repeated downloads of the same video
Automated cleanup: Prevents disk space bloat from cached videos
Well-integrated: Follows existing patterns for consistency
Comprehensive tests: Good test coverage for core functionality
Platform support: Works with Telegram video attachments

Negative:

External dependency: Requires ffmpeg/ffprobe to be installed on the system
Fixed quality: Cannot adjust frame resolution or quality for different use cases
Restrictive frame cap: 30 frames may miss important temporal information
Fixed download limit: 20MB limit may be too restrictive for some users
Potential timeouts: Fixed timeout values may fail on slow systems or large videos
Generic errors: Error messages could be more specific for debugging

Assessment

✅ Approve with suggestions

This is a well-implemented feature that adds valuable video analysis capabilities to Hermes. The code follows existing architectural patterns, includes comprehensive tests, and is properly integrated with the Telegram platform. The video caching infrastructure and automated cleanup are particularly good additions.

However, there are several opportunities for improvement that would make the feature more robust and flexible:

Improvements to consider:

Make frame quality/resolution configurable - Current hardcoded values may lose important details
Scale timeouts with video duration - Fixed timeouts may fail on legitimate long videos
Make video download limit configurable - Allow adjustment per deployment
Increase or make configurable the 30-frame cap - May miss important temporal information
Improve error specificity - More detailed error messages would help debugging
Add edge case tests - Corrupt videos, zero-length videos, boundary conditions

These are suggestions, not blockers. The current implementation is functional and well-tested. The feature is ready to merge, and the suggested improvements can be addressed in follow-up PRs.

Recommended Path Forward:

Merge this PR as it adds valuable functionality with good code quality
File follow-up issues for the suggested improvements, prioritized by use case
Consider user feedback after deployment to understand if the current parameters (resolution, frame cap, timeouts) work well in practice
Document external dependency in README or setup docs (ffmpeg/ffprobe requirement)

This is a solid addition to Hermes that extends its capabilities in a thoughtful and well-tested manner.

teknium1 · 2026-05-11T05:33:07Z

Superceded by @alt-glitch

alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/tools Tool registry, model_tools, toolsets comp/gateway Gateway runner, session dispatch, delivery tool/vision Vision analysis and image generation platform/telegram Telegram bot adapter labels Apr 30, 2026

alt-glitch mentioned this pull request May 3, 2026

feat: add video_analyze tool for native video understanding #19301

Merged

3 tasks

teknium1 closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): add video analysis via ffmpeg frame extraction + vision LLM#2294

feat(tools): add video analysis via ffmpeg frame extraction + vision LLM#2294
Add1ct1ve wants to merge 2 commits into
NousResearch:mainfrom
Add1ct1ve:feat/video-analysis

Add1ct1ve commented Mar 21, 2026 •

edited

Loading

Uh oh!

Add1ct1ve commented Mar 21, 2026

Uh oh!

nidhishgajjar commented Apr 30, 2026

Uh oh!

teknium1 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Add1ct1ve commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Uh oh!

Add1ct1ve commented Mar 21, 2026

Uh oh!

nidhishgajjar commented Apr 30, 2026

Summary

Architecture

Issues

Major Issues

Warnings

Cross-file Impact

Public API Changes

Callers Affected

Dependencies

Test Coverage

Dependencies

User Impact

Assessment

Uh oh!

teknium1 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add1ct1ve commented Mar 21, 2026 •

edited

Loading