-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Improve filesystem based hook architecture #1720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+2,007
−796
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Phase 1: Database migration for new ArchiveResult fields
- Add output_str (TextField) for human-readable summary
- Add output_json (JSONField) for structured metadata
- Add output_files (JSONField) for dict of {relative_path: {}}
- Add output_size (BigIntegerField) for total bytes
- Add output_mimetypes (CharField) for CSV of mimetypes
- Add binary FK to InstalledBinary (optional)
- Migrate existing 'output' field to new split fields
Phase 3: Update run_hook() for JSONL parsing
- Support new JSONL format (any line with {type: 'ModelName', ...})
- Maintain backwards compatibility with RESULT_JSON= format
- Add plugin metadata to each parsed record
- Detect background hooks with .bg. suffix in filename
- Add find_binary_for_cmd() helper function
- Add create_model_record() for processing side-effect records
Phase 6: Update ArchiveResult.run()
- Handle background hooks (return immediately when result is None)
- Process 'records' from HookResult for side-effect models
- Use new output fields (output_str, output_json, output_files, etc.)
- Call create_model_record() for InstalledBinary, Machine updates
Phase 7: Add background hook support
- Add is_background_hook() method to ArchiveResult
- Add check_background_completed() to check if process exited
- Add finalize_background_hook() to collect results from completed hooks
- Update SnapshotMachine.is_finished() to check/finalize background hooks
- Update _populate_output_fields() to walk directory and populate stats
Also updated references to old 'output' field in:
- admin_archiveresults.py
- statemachines.py
- templatetags/core_tags.py
Phase 4 Plugin Audit Progress:
- Audited all 6 Dependency hooks (all already compliant)
- Audited all 11 Crawl Validate hooks (all already compliant)
- Updated 8 Python Snapshot hooks to clean JSONL format
- Updated 1 JS Snapshot hook (title.js) to clean JSONL format
Snapshot hooks updated to remove:
- RESULT_JSON= prefix
- Extra output lines (START_TS=, END_TS=, DURATION=, VERSION=, OUTPUT=, STATUS=)
Now output clean JSONL:
{"type": "ArchiveResult", "status": "...", "output_str": "..."}
Added implementation report to TODO_hook_architecture.md documenting:
- All completed phases (1, 3, 6, 7)
- Plugin audit results with status tables
- Remaining 13 JS hooks that need updating
- Files modified list
- Update 12 remaining JS snapshot hooks to output clean JSONL - Remove RESULT_JSON= prefix, START_TS=, END_TS=, STATUS= output - Rename 3 background hooks with .bg. suffix: - consolelog -> on_Snapshot__21_consolelog.bg.js - ssl -> on_Snapshot__23_ssl.bg.js - responses -> on_Snapshot__24_responses.bg.js - Update TODO_hook_architecture.md with completion status
- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks
- All install hooks now respect their respective XYZ_BINARY env vars (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.) - Support both absolute paths (/usr/bin/wget2) and binary names (wget2) - Dynamic bin_name used in Dependency JSONL output - Updated 11 install hooks to follow the new pattern - Mark checklist items as complete in TODO_hook_architecture.md
All snapshot hooks now: - Read XYZ_BINARY env vars and use in cmd - Output exactly one clean JSONL line (no RESULT_JSON= prefix) - No extra output lines (VERSION=, START_TS=, etc.) - Only provide allowed fields - Don't include computed fields - Python hooks include cmd array with binary path
- Add test_hooks.py with 31 unit tests covering: - Background hook detection (.bg. suffix) - JSONL parsing (clean format and legacy RESULT_JSON= format) - Install hook XYZ_BINARY env var handling - Hook discovery and sorting - get_extractor_name() function - Hook execution with real subprocesses - Install hook output format compliance - Snapshot hook output format compliance - Plugin metadata addition - Update TODO_hook_architecture.md to mark all tasks complete: - Tests: 31 tests in archivebox/tests/test_hooks.py - Migrations: 0029 and 0030 applied successfully All phases of the hook architecture implementation are now complete.
Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Related issues
Changes these areas