Improve filesystem based hook architecture #1720

pirate · 2025-12-27T20:18:22Z

Summary

Related issues

Changes these areas

Phase 1: Database migration for new ArchiveResult fields - Add output_str (TextField) for human-readable summary - Add output_json (JSONField) for structured metadata - Add output_files (JSONField) for dict of {relative_path: {}} - Add output_size (BigIntegerField) for total bytes - Add output_mimetypes (CharField) for CSV of mimetypes - Add binary FK to InstalledBinary (optional) - Migrate existing 'output' field to new split fields Phase 3: Update run_hook() for JSONL parsing - Support new JSONL format (any line with {type: 'ModelName', ...}) - Maintain backwards compatibility with RESULT_JSON= format - Add plugin metadata to each parsed record - Detect background hooks with .bg. suffix in filename - Add find_binary_for_cmd() helper function - Add create_model_record() for processing side-effect records Phase 6: Update ArchiveResult.run() - Handle background hooks (return immediately when result is None) - Process 'records' from HookResult for side-effect models - Use new output fields (output_str, output_json, output_files, etc.) - Call create_model_record() for InstalledBinary, Machine updates Phase 7: Add background hook support - Add is_background_hook() method to ArchiveResult - Add check_background_completed() to check if process exited - Add finalize_background_hook() to collect results from completed hooks - Update SnapshotMachine.is_finished() to check/finalize background hooks - Update _populate_output_fields() to walk directory and populate stats Also updated references to old 'output' field in: - admin_archiveresults.py - statemachines.py - templatetags/core_tags.py

Phase 4 Plugin Audit Progress: - Audited all 6 Dependency hooks (all already compliant) - Audited all 11 Crawl Validate hooks (all already compliant) - Updated 8 Python Snapshot hooks to clean JSONL format - Updated 1 JS Snapshot hook (title.js) to clean JSONL format Snapshot hooks updated to remove: - RESULT_JSON= prefix - Extra output lines (START_TS=, END_TS=, DURATION=, VERSION=, OUTPUT=, STATUS=) Now output clean JSONL: {"type": "ArchiveResult", "status": "...", "output_str": "..."} Added implementation report to TODO_hook_architecture.md documenting: - All completed phases (1, 3, 6, 7) - Plugin audit results with status tables - Remaining 13 JS hooks that need updating - Files modified list

- Update 12 remaining JS snapshot hooks to output clean JSONL - Remove RESULT_JSON= prefix, START_TS=, END_TS=, STATUS= output - Rename 3 background hooks with .bg. suffix: - consolelog -> on_Snapshot__21_consolelog.bg.js - ssl -> on_Snapshot__23_ssl.bg.js - responses -> on_Snapshot__24_responses.bg.js - Update TODO_hook_architecture.md with completion status

- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks

- All install hooks now respect their respective XYZ_BINARY env vars (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.) - Support both absolute paths (/usr/bin/wget2) and binary names (wget2) - Dynamic bin_name used in Dependency JSONL output - Updated 11 install hooks to follow the new pattern - Mark checklist items as complete in TODO_hook_architecture.md

All snapshot hooks now: - Read XYZ_BINARY env vars and use in cmd - Output exactly one clean JSONL line (no RESULT_JSON= prefix) - No extra output lines (VERSION=, START_TS=, etc.) - Only provide allowed fields - Don't include computed fields - Python hooks include cmd array with binary path

- Add test_hooks.py with 31 unit tests covering: - Background hook detection (.bg. suffix) - JSONL parsing (clean format and legacy RESULT_JSON= format) - Install hook XYZ_BINARY env var handling - Hook discovery and sorting - get_extractor_name() function - Hook execution with real subprocesses - Install hook output format compliance - Snapshot hook output format compliance - Plugin metadata addition - Update TODO_hook_architecture.md to mark all tasks complete: - Tests: 31 tests in archivebox/tests/test_hooks.py - Migrations: 0029 and 0030 applied successfully All phases of the hook architecture implementation are now complete.

Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.

claude added 8 commits December 27, 2025 08:38

Rename validate hooks to install hooks

8c846b7

- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks

pirate merged commit a38624a into dev Dec 27, 2025
2 checks passed

pirate deleted the claude/implement-hook-architecture-ahCXy branch December 27, 2025 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve filesystem based hook architecture #1720

Improve filesystem based hook architecture #1720

pirate commented Dec 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Improve filesystem based hook architecture #1720

Improve filesystem based hook architecture #1720

Conversation

pirate commented Dec 27, 2025

Summary

Related issues

Changes these areas

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants