Skip to content

feat(tools): add video analysis via ffmpeg frame extraction + vision LLM#2294

Closed
Add1ct1ve wants to merge 2 commits into
NousResearch:mainfrom
Add1ct1ve:feat/video-analysis
Closed

feat(tools): add video analysis via ffmpeg frame extraction + vision LLM#2294
Add1ct1ve wants to merge 2 commits into
NousResearch:mainfrom
Add1ct1ve:feat/video-analysis

Conversation

@Add1ct1ve

@Add1ct1ve Add1ct1ve commented Mar 21, 2026

Copy link
Copy Markdown

What does this PR do?

Adds end-to-end video analysis support for the Hermes agent. When a user sends a video on Telegram, it is downloaded, cached locally, and the agent is given a context note with the file path and instructions to use the new video_analyze tool. The tool extracts frames using ffmpeg, encodes them as base64, and sends them to the configured vision LLM for multi-frame analysis.

This follows the same pattern as the existing image analysis pipeline (download → cache → enrich → tool), but intentionally skips auto-analysis since processing video frames is more expensive than a single image.

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)

Changes Made

  • gateway/platforms/base.py — Added VIDEO_CACHE_DIR, get_video_cache_dir(), cache_video_from_bytes(), cleanup_video_cache() following the existing image/audio/document cache pattern
  • gateway/platforms/telegram.py — Added elif msg.video: download block with 20MB Telegram Bot API size check, MIME-to-extension mapping, and cache write
  • tools/video_analysis_tool.py (new) — Frame extraction via ffmpeg (fps=1/<interval>, scaled to 768px, JPEG quality 2), tiered interval strategy (<1min: 1s, 1-5min: 5s, 5min+: 10s, hard cap 30 frames), async vision LLM call, tool registration
  • model_tools.py — Added "tools.video_analysis_tool" to _discover_tools()
  • toolsets.py — Added "video_analyze" to _HERMES_CORE_TOOLS, "vision" toolset, and "hermes-acp" preset
  • gateway/run.py — Added video context enrichment block (tells agent a video is available + how to analyze it), added cleanup_video_cache() to hourly cron ticker
  • tests/test_video_analysis_tool.py (new) — 29 unit tests covering frame interval calculation, ffmpeg detection, async tool flow, handler/schema/registry integration, and requirements checks
  • website/docs/reference/tools-reference.md — Added video_analyze to vision toolset documentation, fixed pre-existing duplicate sections
  • website/docs/reference/toolsets-reference.md — Added video_analyze to vision, hermes-acp, and hermes-cli toolset listings

How to Test

  1. Install ffmpeg (winget install ffmpeg / brew install ffmpeg / apt install ffmpeg)
  2. Start the gateway with Telegram configured
  3. Send a short video (<20MB) to the bot on Telegram
  4. Verify it downloads to ~/.hermes/video_cache/
  5. The agent should receive a context note and invoke video_analyze if the user's caption asks about the video
  6. Verify frames are extracted, vision LLM is called, and analysis is returned

Edge cases:

  • Send a video >20MB → should get a size limit message, no download attempted
  • Remove ffmpeg from PATH → video_analyze tool should not appear in the agent's toolset (check_fn returns False)
  • Send a video with no caption → agent receives context note and can decide whether to analyze

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Windows 11 Pro

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Videos sent on Telegram are now downloaded to ~/.hermes/video_cache/,
and the agent receives a context note pointing it to the video_analyze
tool.  The tool extracts frames with ffmpeg (tiered interval, max 30
frames, scaled to 768px), encodes them as base64, and sends them to
the configured vision model for multi-frame analysis.

Gated on ffmpeg + ffprobe + vision provider availability — the tool
silently hides itself from the agent's toolset when requirements are
not met.
@Add1ct1ve

Copy link
Copy Markdown
Author

disclaimer this is fully vibe coded based on a need i had like 4 hours ago

Add 29 unit tests covering frame interval calculation, ffmpeg detection,
async tool flow, handler prompt construction, schema validation, registry
integration, and requirements checks.

Update tools-reference and toolsets-reference docs to include video_analyze
in the vision toolset. Fix pre-existing duplicate vision/web sections in
tools-reference.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/tools Tool registry, model_tools, toolsets comp/gateway Gateway runner, session dispatch, delivery tool/vision Vision analysis and image generation platform/telegram Telegram bot adapter labels Apr 30, 2026
@nidhishgajjar

Copy link
Copy Markdown

Orb Code Review (powered by GLM-4.7 on Orb Cloud)

Summary

This PR adds a new video_analyze tool to Hermes that extracts frames from videos using ffmpeg/ffprobe and analyzes them via a centralized vision LLM. The implementation includes: (1) video caching infrastructure in gateway/platforms/base.py, (2) Telegram platform integration for video downloads (with 20MB limit), (3) context enrichment for video messages, (4) automated video cache cleanup, (5) comprehensive test coverage, and (6) tool registration and documentation updates.

Architecture

The PR follows the existing Hermes architecture well:

  1. Video caching infrastructure: Adds VIDEO_CACHE_DIR, get_video_cache_dir(), cache_video_from_bytes(), and cleanup_video_cache() to gateway/platforms/base.py, following the same pattern as image/audio/document caching

  2. Telegram platform integration: Extends TelegramAdapter._handle_attachment() to download videos from Telegram Bot API, respecting the 20MB limit and detecting MIME types for proper file extensions

  3. Context enrichment: Modifies GatewayRunner._build_message_context() to add a note for video messages, guiding users to use the video_analyze tool with the cached video path

  4. Automated cleanup: Integrates video cache cleanup into the cron ticker, running every hour alongside image/document cache cleanup

  5. Tool implementation: Creates tools/video_analysis_tool.py with well-structured helpers (_has_ffmpeg(), _has_ffprobe(), _get_video_duration(), _calculate_frame_interval(), _extract_frames()) and main video_analyze_tool() function

  6. Comprehensive tests: Adds tests/test_video_analysis_tool.py with extensive coverage including frame interval calculation, ffmpeg detection, tool registration, and schema validation

  7. Tool registration: Registers video_analyze in the vision toolset via model_tools.py and updates documentation

Issues

Major Issues

1. Hardcoded Frame Quality May Result in Loss of Important Details

Location: tools/video_analysis_tool.py:90

def _extract_frames(video_path: str, output_dir: str, interval: float) -> list[str]:
    subprocess.run(
        [
            "ffmpeg", "-i", video_path,
            "-vf", f"fps=1/{interval},scale=768:-1",
            "-q:v", "2",
            pattern,
        ],
        ...
    )

Issue: The frame extraction hardcodes scale=768:-1 (scale width to 768, preserve aspect ratio) and quality -q:v 2.

Impact:

  • Loss of detail: 768px width may be insufficient for detecting fine details (e.g., text in videos, small objects, subtle movements)
  • Quality tradeoff: -q:v 2 is a relatively low quality setting that may introduce compression artifacts
  • Fixed resolution: Users cannot adjust resolution based on video content or use case (e.g., medical videos might need higher resolution)

Suggested Fix:

Make resolution and quality configurable:

def _extract_frames(
    video_path: str, 
    output_dir: str, 
    interval: float, 
    scale_width: int = 1280,  # Default to HD
    quality: int = 3,  # Default to moderate quality
) -> list[str]:
    subprocess.run(
        [
            "ffmpeg", "-i", video_path,
            "-vf", f"fps=1/{interval},scale={scale_width}:-1",
            "-q:v", str(quality),
            pattern,
        ],
        ...
    )

And update the tool signature:

async def video_analyze_tool(
    video_path: str,
    user_prompt: str,
    model: Optional[str] = None,
    scale_width: int = 1280,  # Added
    quality: int = 3,  # Added
) -> str:

2. Timeout Values May Be Insufficient for Long Videos

Location: tools/video_analysis_tool.py:49 and tools/video_analysis_tool.py:95

def _get_video_duration(path: str) -> float:
    result = subprocess.run(
        ["ffprobe", "-v", "error", ...],
        timeout=30,  # Only 30 seconds
    )

def _extract_frames(video_path: str, output_dir: str, interval: float) -> list[str]:
    subprocess.run(
        ["ffmpeg", "-i", video_path, ...],
        timeout=120,  # Only 120 seconds
    )

Issue: The timeout values are fixed and may be insufficient:

  • ffprobe timeout: 30 seconds may be too short for very large videos (>1GB) or slow storage
  • ffmpeg timeout: 120 seconds may be insufficient for long videos at high quality or slow systems

Impact:

  • False errors: Legitimate videos may be reported as timeouts
  • Incomplete analysis: Users may get partial results without knowing the analysis failed due to timeout
  • Hard to debug: No way to distinguish between a real timeout and a slow system

Suggested Fix:

Make timeout scale with video duration:

def _get_video_duration(path: str) -> float:
    # First, get approximate duration quickly (smaller timeout)
    result = subprocess.run(
        ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "csv=p=0", path],
        capture_output=True,
        text=True,
        timeout=30,
    )
    duration = float(result.stdout.strip())
    
    # Scale timeout with duration (5x safety margin)
    timeout = max(30, int(duration * 5))
    return duration

def _extract_frames(video_path: str, output_dir: str, interval: float, estimated_duration: float) -> list[str]:
    # Scale timeout with estimated duration and frame count
    num_frames = int(estimated_duration / interval)
    # Allow 5 seconds per frame plus 60 second overhead
    timeout = max(120, num_frames * 5 + 60)
    
    subprocess.run([...], timeout=timeout)

And update the call to _extract_frames to pass the estimated duration:

# In video_analyze_tool()
frames = await asyncio.to_thread(
    _extract_frames, video_path, tmp_dir, interval, duration,  # Pass duration
)

3. Video Download Limit Is Hardcoded

Location: gateway/platforms/telegram.py:88

# Telegram Bot API limit: 20 MB for file downloads
MAX_VIDEO_BYTES = 20 * 1024 * 1024
if msg.video.file_size and msg.video.file_size > MAX_VIDEO_BYTES:
    ...

Issue: The 20MB video limit is hardcoded in the Telegram adapter. This may be:

  • Too restrictive for users with high-bandwidth connections
  • Inconsistent with other platforms' limits
  • Not configurable per deployment or use case

Impact:

  • User experience: Users with legitimate videos >20MB cannot analyze them
  • Platform inconsistency: Other platforms (Discord, WhatsApp) may have different limits
  • Hard to adjust: Requires code changes to change the limit

Suggested Fix:

Make the limit configurable:

# In gateway/platforms/base.py
VIDEO_MAX_BYTES = int(os.getenv("VIDEO_MAX_MB", "20")) * 1024 * 1024

# In telegram.py
if msg.video.file_size and msg.video.file_size > VIDEO_MAX_BYTES:

Or document it as a platform-specific consideration:

# Telegram Bot API limit: 20 MB for file downloads
# Note: This is a Telegram-specific limitation. Other platforms may have
# different limits. Consider adjusting this value based on your deployment.
MAX_VIDEO_BYTES = 20 * 1024 * 1024

4. Frame Cap May Be Too Restrictive for Complex Videos

Location: tools/video_analysis_tool.py:523

def _calculate_frame_interval(duration: float) -> float:
    MAX_FRAMES = 30  # Hard cap
    
    if duration < 60:
        interval = 1.0
    elif duration < 300:
        interval = 5.0
    else:
        interval = 10.0
    
    # Enforce max-frames cap
    estimated_frames = duration / interval
    if estimated_frames > MAX_FRAMES:
        interval = duration / MAX_FRAMES

Issue: The 30-frame cap means:

  • For a 10-minute video (600s): interval = 10s, only 60 frames sampled
  • For a 30-minute video (1800s): interval = 60s, only 30 frames sampled

This may be insufficient for:

  • Detecting subtle changes over long periods
  • Understanding temporal relationships in long videos
  • Detecting events that occur at specific timestamps not aligned with the sampling interval

Impact:

  • Limited temporal resolution: Important moments may be missed
  • Context loss: Long videos may lack sufficient context
  • Deterministic sampling: Users cannot increase frame count even if they want to pay the computational cost

Suggested Fix:

Make the cap configurable or increase it significantly:

VIDEO_MAX_FRAMES = int(os.getenv("VIDEO_MAX_FRAMES", "100"))  # Default to 100, allow override

def _calculate_frame_interval(duration: float, max_frames: int = VIDEO_MAX_FRAMES) -> float:
    # ... existing logic ...
    
    # Use configured max instead of hardcoded 30
    estimated_frames = duration / interval
    if estimated_frames > max_frames:
        interval = duration / max_frames
    
    return interval

Or provide multiple quality tiers:

def _calculate_frame_interval(duration: float, quality: str = "medium") -> float:
    # quality: "low", "medium", "high", "max"
    quality_settings = {
        "low": {"max_frames": 30, "interval_multiplier": 1.0},
        "medium": {"max_frames": 60, "interval_multiplier": 5.0},
        "high": {"max_frames": 120, "interval_multiplier": 10.0},
        "max": {"max_frames": 300, "interval_multiplier": 30.0},
    }
    settings = quality_settings[quality]
    
    # ... calculate interval using settings ...

Warnings

5. Frame Extraction Pattern Matching May Miss Files

Location: tools/video_analysis_tool.py:98-99

frames = sorted(
    str(p) for p in Path(output_dir).glob("frame_*.jpg")
)
return frames

Issue: The glob pattern frame_*.jpg assumes all frame files will have the exact prefix. However, if ffmpeg creates files with slight variations in naming (e.g., due to concurrent runs, filesystem issues), some frames may be missed.

Impact:

  • Missing frames: Not all extracted frames may be returned
  • Inconsistent results: Frame count in results may not match actual extracted frames
  • Hard to debug: Difficult to detect when frames are silently missed

Suggested Fix:

Use a more robust pattern or check the actual ffmpeg output:

# Option 1: More robust glob pattern
frames = sorted(
    str(p) for p in Path(output_dir).glob("frame_[0-9]*.jpg")
)

# Option 2: Parse ffmpeg output to get exact frame paths
import re

def _extract_frames(video_path: str, output_dir: str, interval: float) -> list[str]:
    result = subprocess.run(
        [...],
        capture_output=True,
        text=True,
        check=True,
    )
    
    # Parse ffmpeg output to extract frame filenames
    frame_files = []
    for line in result.stderr.split('\n'):
        if 'frame_' in line:
            # Extract filename using regex
            match = re.search(r'(frame_\d+\.jpg)', line)
            if match:
                frame_files.append(output_dir + '/' + match.group(0))
    
    return sorted(frame_files)

6. Error Messages Could Be More Specific for Debugging

Location: tools/video_analysis_tool.py:626-629

if not frames:
    return json.dumps({
        "success": False,
        "error": "No frames extracted from video",
        "analysis": "ffmpeg did not produce any frames from the video.",
    })

Issue: The error message "No frames extracted from video" is generic and doesn't help with debugging. It doesn't indicate:

  • Whether ffmpeg ran successfully
  • What error ffmpeg encountered (if any)
  • Whether the video file was valid
  • What the actual ffmpeg output was

Impact:

  • Poor debugging: Users cannot determine why frame extraction failed
  • False errors: May be reported as "no frames" when there was actually a different error
  • Time waste: Debugging requires manual inspection of logs

Suggested Fix:

Provide more detailed error information:

if not frames:
    return json.dumps({
        "success": False,
        "error": (
            "No frames extracted from video. "
            "This could be due to: (1) ffmpeg execution error, "
            "(2) invalid video format, (3) timeout, or "
            "(4) insufficient disk space. Check logs for details."
        ),
        "ffmpeg_exit_code": exit_code,  # Add if available
        "ffmpeg_stderr": stderr_output,  # Add if available
        "analysis": None,
    })

Or capture and return ffmpeg's actual error:

def _extract_frames(...):
    result = subprocess.run(..., capture_output=True, text=True, check=True)
    
    if result.returncode != 0:
        raise FrameExtractionError(
            f"ffmpeg failed with exit code {result.returncode}: {result.stderr}"
        )
    
    return sorted(Path(output_dir).glob("frame_*.jpg"))

7. Test Coverage for Error Paths Could Be Expanded

Location: tests/test_video_analysis_tool.py

The tests are comprehensive but could benefit from:

  • Testing with actually invalid video files (corrupt, wrong format)
  • Testing ffmpeg timeout scenarios
  • Testing with zero-length videos
  • Testing with videos at the exact tier boundaries (60s, 300s)
  • Testing with very long videos to verify cap behavior
  • Testing concurrent video analysis requests

Impact:

  • Untested edge cases: Certain error scenarios may not be caught
  • Regression risk: Changes to ffmpeg parameters may break existing behavior

Suggested Fix:

Add tests for edge cases:

class TestVideoAnalyzeTool:
    # ... existing tests ...
    
    @pytest.mark.asyncio
    @patch("tools.video_analysis_tool._has_ffmpeg", return_value=True)
    @patch("tools.video_analysis_tool._has_ffprobe", return_value=True)
    async def test_corrupt_video(self, _probe, _ff, tmp_path):
        """Test handling of corrupt video files."""
        # Create a file with invalid video data
        corrupt_video = tmp_path / "corrupt.mp4"
        corrupt_video.write_bytes(b"NOT_A_VIDEO_FILE")
        
        result = json.loads(await video_analyze_tool(str(corrupt_video), "describe"))
        assert result["success"] is False
        assert "invalid" in result["error"].lower()
    
    @pytest.mark.asyncio
    async def test_zero_length_video(self, _probe, _ff, tmp_path):
        """Test handling of zero-length videos."""
        zero_video = tmp_path / "zero.mp4"
        zero_video.write_bytes(b"")
        
        result = json.loads(await video_analyze_tool(str(zero_video), "describe"))
        assert result["success"] is False
        assert "empty" in result["error"].lower() or "zero" in result["error"].lower()
    
    @pytest.mark.asyncio
    @patch("tools.video_analysis_tool._get_video_duration", return_value=60.0)
    async def test_exact_boundary_60s(self, _probe, _ff, _dur, tmp_path):
        """Test video at exact 60s boundary."""
        video_path = _make_video(tmp_path)
        # Should use medium tier (5s interval)
        result = json.loads(await video_analyze_tool(video_path, "describe"))
        assert result["success"] is True
        assert result["duration"] == 60.0
        # 60/5 = 12 frames

Cross-file Impact

Public API Changes

Modified:

  • gateway/platforms/base.py - Added video cache infrastructure (4 functions, 1 constant)
  • gateway/platforms/telegram.py - Added video download handling in _handle_attachment()
  • gateway/run.py - Modified _build_message_context() to enrich video messages; added video cache cleanup to cron ticker
  • model_tools.py - Added tools.video_analysis_tool to _discover_tools()
  • toolsets.py - Added video_analyze to vision toolset

Added:

  • tools/video_analysis_tool.py - New video analysis tool with 304 lines
  • tests/test_video_analysis_tool.py - New test file with 454 lines

Callers Affected

Direct Callers:

  • Platform adapters that call _build_message_context() in GatewayRunner
  • Users of Telegram platform with video attachments
  • Cron ticker cleanup function

Indirect Callers:

  • Vision LLM providers that process multi-image messages (now can handle video frames)
  • Future platform adapters that want video support
  • Users who configure Hermes with custom toolsets

Dependencies

Modified:

  • gateway/platforms/telegram.py - Imports cache_video_from_bytes from base
  • gateway/run.py - Imports cleanup_video_cache from base
  • model_tools.py - Imports tools.video_analysis_tool

No External Dependencies Added:

  • Uses subprocess for ffmpeg/ffprobe (assumes system installation)
  • No new Python packages required

Test Coverage

Added:

  • Frame interval calculation tests (10 test cases)
  • FFmpeg/FFprobe detection tests (4 test cases)
  • Video analyze tool tests (8 test cases including success, failure, and edge cases)
  • Schema validation tests (2 test cases)
  • Tool registration tests (2 test cases)
  • Requirement check tests (3 test cases)

Missing:

  • Tests for corrupt/invalid video files
  • Tests for zero-length videos
  • Tests for exact tier boundary cases (60s, 300s)
  • Tests for concurrent video analysis
  • Tests for timeout scenarios
  • Tests for very long videos (>30 minutes)
  • Integration tests with actual video files

Dependencies

No New External Dependencies: Uses system ffmpeg and ffprobe (assumes they are installed)

Modified:

  • gateway/platforms/telegram.py - Now depends on video cache infrastructure
  • gateway/run.py - Now depends on video cache cleanup

User Impact

Positive:

  • New capability: Users can now analyze videos via Hermes
  • Smart frame sampling: Tiered strategy balances detail vs computational cost
  • Video caching: Avoids repeated downloads of the same video
  • Automated cleanup: Prevents disk space bloat from cached videos
  • Well-integrated: Follows existing patterns for consistency
  • Comprehensive tests: Good test coverage for core functionality
  • Platform support: Works with Telegram video attachments

Negative:

  • External dependency: Requires ffmpeg/ffprobe to be installed on the system
  • Fixed quality: Cannot adjust frame resolution or quality for different use cases
  • Restrictive frame cap: 30 frames may miss important temporal information
  • Fixed download limit: 20MB limit may be too restrictive for some users
  • Potential timeouts: Fixed timeout values may fail on slow systems or large videos
  • Generic errors: Error messages could be more specific for debugging

Assessment

Approve with suggestions

This is a well-implemented feature that adds valuable video analysis capabilities to Hermes. The code follows existing architectural patterns, includes comprehensive tests, and is properly integrated with the Telegram platform. The video caching infrastructure and automated cleanup are particularly good additions.

However, there are several opportunities for improvement that would make the feature more robust and flexible:

Improvements to consider:

  1. Make frame quality/resolution configurable - Current hardcoded values may lose important details
  2. Scale timeouts with video duration - Fixed timeouts may fail on legitimate long videos
  3. Make video download limit configurable - Allow adjustment per deployment
  4. Increase or make configurable the 30-frame cap - May miss important temporal information
  5. Improve error specificity - More detailed error messages would help debugging
  6. Add edge case tests - Corrupt videos, zero-length videos, boundary conditions

These are suggestions, not blockers. The current implementation is functional and well-tested. The feature is ready to merge, and the suggested improvements can be addressed in follow-up PRs.

Recommended Path Forward:

  1. Merge this PR as it adds valuable functionality with good code quality
  2. File follow-up issues for the suggested improvements, prioritized by use case
  3. Consider user feedback after deployment to understand if the current parameters (resolution, frame cap, timeouts) work well in practice
  4. Document external dependency in README or setup docs (ffmpeg/ffprobe requirement)

This is a solid addition to Hermes that extends its capabilities in a thoughtful and well-tested manner.

@teknium1

Copy link
Copy Markdown
Contributor

Superceded by @alt-glitch

@teknium1 teknium1 closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have platform/telegram Telegram bot adapter tool/vision Vision analysis and image generation type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants