-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
v0.8 Release Candidate #1311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
v0.8 Release Candidate #1311
+75,085
−20,126
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks
- All install hooks now respect their respective XYZ_BINARY env vars (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.) - Support both absolute paths (/usr/bin/wget2) and binary names (wget2) - Dynamic bin_name used in Dependency JSONL output - Updated 11 install hooks to follow the new pattern - Mark checklist items as complete in TODO_hook_architecture.md
All snapshot hooks now: - Read XYZ_BINARY env vars and use in cmd - Output exactly one clean JSONL line (no RESULT_JSON= prefix) - No extra output lines (VERSION=, START_TS=, etc.) - Only provide allowed fields - Don't include computed fields - Python hooks include cmd array with binary path
- Add test_hooks.py with 31 unit tests covering: - Background hook detection (.bg. suffix) - JSONL parsing (clean format and legacy RESULT_JSON= format) - Install hook XYZ_BINARY env var handling - Hook discovery and sorting - get_extractor_name() function - Hook execution with real subprocesses - Install hook output format compliance - Snapshot hook output format compliance - Plugin metadata addition - Update TODO_hook_architecture.md to mark all tasks complete: - Tests: 31 tests in archivebox/tests/test_hooks.py - Migrations: 0029 and 0030 applied successfully All phases of the hook architecture implementation are now complete.
Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
… add chrome kill helper script
This implements the hook concurrency plan from TODO_hook_concurrency.md: ## Schema Changes - Add Snapshot.current_step (IntegerField 0-9, default=0) - Create migration 0034_snapshot_current_step.py - Fix uuid_compat imports in migrations 0032 and 0003 ## Core Logic - Add extract_step(hook_name) utility - extracts step from __XX_ pattern - Add is_background_hook(hook_name) utility - checks for .bg. suffix - Update Snapshot.create_pending_archiveresults() to create one AR per hook - Update ArchiveResult.run() to handle hook_name field - Add Snapshot.advance_step_if_ready() method for step advancement - Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready() ## Worker Coordination - Update ArchiveResultWorker.get_queue() for step-based filtering - ARs are only claimable when their step <= snapshot.current_step ## Hook Renumbering - Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53, title→54, readability→55, headers→55, mercury→56, htmltotext→57 - Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg, forumdl→65.bg, papersdl→66.bg - Step 7 (URL extraction): parse_* hooks moved to 70-75 Background hooks (.bg suffix) don't block step advancement, enabling long-running downloads to continue while other hooks proceed.
All hook utility tests pass (extract_step, is_background_hook, discover_hooks). Model fields and methods verified (current_step, hook_name, advance_step_if_ready).
Restored 10 folder status functions that were accidentally removed: - get_indexed_folders, get_archived_folders, get_unarchived_folders - get_present_folders, get_valid_folders, get_invalid_folders - get_duplicate_folders, get_orphaned_folders - get_corrupted_folders, get_unrecognized_folders These are required by archivebox_status.py for the status command. Added safety checks for non-existent archive directories.
This reverts commit 32bcf08.
Remove imports of deleted folder utility functions and rewrite status command to query Snapshot model directly. This aligns with the fs_version refactor where the DB is the single source of truth. - Use Snapshot.objects queries for indexed/archived/unarchived counts - Scan filesystem directly for present/orphaned directory counts - Simplify output to focus on essential status information
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias
- Move hardcoded default args from Python to config.json YTDLP_ARGS - Add get_ytdlp_args() function to read from YTDLP_ARGS env var - Keep format arg with max_size in code (depends on YTDLP_MAX_SIZE) - YTDLP_ARGS can be overridden as JSON array in environment
- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
Member
Author
|
I'm closing this as I've made a ton of changes on |
|
Is there a PR I can follow? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP Release Candidate for ArchiveBox version
0.8.0.Try this release early using
dockerorpip:New Features
generic_jsonlparser (thanks @jimwins!)feedparserfor RSS parsing (thanks @jimwins!)Snapshotdetail page header expanded/collapsed state./data/archive/,/data, and/data/archivein Docker and warn if running low on disk space/browserschown on Dockerarmv7entrypoint failingis_staffandis_superuserflags during LDAP first authBufixes
RESOLUTIONbeing ignored when using Chrome headless in Docker:80or:443port is present in the original URL/var/spool/cron/crontabspermissions when mounting it via DockerWarning
This release drops Docker support for
arm/v7(e.g. older 32-bit Raspberry Pis). You can still run ArchiveBox using thepip-install method, or build your own Docker images, but we will no longer-offer pre-built images for older CPUs.