v0.8 Release Candidate #1311

pirate · 2024-01-05T03:59:59Z

WIP Release Candidate for ArchiveBox version 0.8.0.

Try this release early using docker or pip:

# with docker (pre-built)
docker pull archivebox/archivebox:dev

# with docker (built from source)
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev

# with pip (built from source)
pip install 'git+https://github.com/pirate/ArchiveBox@dev'

New Features

support for NFS/SMB/S3/B2/Google Drive/Dropbox/etc. Remote Storage
upgrade to Django 4.2 (thanks @jimwins!)
add new generic_jsonl parser (thanks @jimwins!)
switch to feedparser for RSS parsing (thanks @jimwins!)
remember Snapshot detail page header expanded/collapsed state
allow more restrictive NFS permission coercion on ./data/archive
check /, /data, and /data/archive in Docker and warn if running low on disk space
fix /browsers chown on Docker armv7 entrypoint failing
disable chrome automatic self-updating when running headless
Add ability to populate is_staff and is_superuser flags during LDAP first auth
add gitea and other domains to default GIT_DOMAINS list to run git archiving on
bump dependency versions

Bufixes

fix RESOLUTION being ignored when using Chrome headless in Docker
fix sorting by Size / Files in the Admin Snapshots list page UI
fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
fix wget extractor not finding output when :80 or :443 port is present in the original URL
fix /var/spool/cron/crontabs permissions when mounting it via Docker

Warning

This release drops Docker support for arm/v7 (e.g. older 32-bit Raspberry Pis). You can still run ArchiveBox using the pip-install method, or build your own Docker images, but we will no longer-offer pre-built images for older CPUs.

- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks

- All install hooks now respect their respective XYZ_BINARY env vars (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.) - Support both absolute paths (/usr/bin/wget2) and binary names (wget2) - Dynamic bin_name used in Dependency JSONL output - Updated 11 install hooks to follow the new pattern - Mark checklist items as complete in TODO_hook_architecture.md

All snapshot hooks now: - Read XYZ_BINARY env vars and use in cmd - Output exactly one clean JSONL line (no RESULT_JSON= prefix) - No extra output lines (VERSION=, START_TS=, etc.) - Only provide allowed fields - Don't include computed fields - Python hooks include cmd array with binary path

- Add test_hooks.py with 31 unit tests covering: - Background hook detection (.bg. suffix) - JSONL parsing (clean format and legacy RESULT_JSON= format) - Install hook XYZ_BINARY env var handling - Hook discovery and sorting - get_extractor_name() function - Hook execution with real subprocesses - Install hook output format compliance - Snapshot hook output format compliance - Plugin metadata addition - Update TODO_hook_architecture.md to mark all tasks complete: - Tests: 31 tests in archivebox/tests/test_hooks.py - Migrations: 0029 and 0030 applied successfully All phases of the hook architecture implementation are now complete.

Replace old `output` field with new fields across the codebase: - output_str: Human-readable output summary - output_json: Structured metadata (optional) - output_files: Dict of output files with metadata - output_size: Total size in bytes - output_mimetypes: CSV of file mimetypes Files updated: - api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields - api/v1_core.py: Update ArchiveResultFilterSchema to search output_str - cli/archivebox_extract.py: Use output_str in CLI output - core/admin_archiveresults.py: Update admin fields, search, and fieldsets - core/admin_archiveresults.py: Fix output_html variable name bug in output_summary - misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields - plugins/extractor_utils.py: Update ExtractorResult helper class The embed_path() method already uses output_files and output_str, so snapshot detail page and template tags work correctly.

# Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

… add chrome kill helper script

This implements the hook concurrency plan from TODO_hook_concurrency.md: ## Schema Changes - Add Snapshot.current_step (IntegerField 0-9, default=0) - Create migration 0034_snapshot_current_step.py - Fix uuid_compat imports in migrations 0032 and 0003 ## Core Logic - Add extract_step(hook_name) utility - extracts step from __XX_ pattern - Add is_background_hook(hook_name) utility - checks for .bg. suffix - Update Snapshot.create_pending_archiveresults() to create one AR per hook - Update ArchiveResult.run() to handle hook_name field - Add Snapshot.advance_step_if_ready() method for step advancement - Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready() ## Worker Coordination - Update ArchiveResultWorker.get_queue() for step-based filtering - ARs are only claimable when their step <= snapshot.current_step ## Hook Renumbering - Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53, title→54, readability→55, headers→55, mercury→56, htmltotext→57 - Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg, forumdl→65.bg, papersdl→66.bg - Step 7 (URL extraction): parse_* hooks moved to 70-75 Background hooks (.bg suffix) don't block step advancement, enabling long-running downloads to continue while other hooks proceed.

All hook utility tests pass (extract_step, is_background_hook, discover_hooks). Model fields and methods verified (current_step, hook_name, advance_step_if_ready).

Restored 10 folder status functions that were accidentally removed: - get_indexed_folders, get_archived_folders, get_unarchived_folders - get_present_folders, get_valid_folders, get_invalid_folders - get_duplicate_folders, get_orphaned_folders - get_corrupted_folders, get_unrecognized_folders These are required by archivebox_status.py for the status command. Added safety checks for non-existent archive directories.

This reverts commit 32bcf08.

Remove imports of deleted folder utility functions and rewrite status command to query Snapshot model directly. This aligns with the fs_version refactor where the DB is the single source of truth. - Use Snapshot.objects queries for indexed/archived/unarchived counts - Scan filesystem directly for present/orphaned directory counts - Simplify output to focus on essential status information

# Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

…more fixes

- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias

- Move hardcoded default args from Python to config.json YTDLP_ARGS - Add get_ytdlp_args() function to read from YTDLP_ARGS env var - Keep format arg with max_size in code (depends on YTDLP_MAX_SIZE) - YTDLP_ARGS can be overridden as JSON array in environment

- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/ - Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py - Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases - Update templates CSS classes: media-* → ytdlp-* - Fix gallerydl bug: remove incorrect dependency on media plugin output - Update all codebase references to use YTDLP_* and SAVE_YTDLP - Add backwards compatibility test for MEDIA_ENABLED alias  # Summary  # Related issues  # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

pirate · 2025-12-29T19:53:42Z

I'm closing this as I've made a ton of changes on dev and am now targeting the next release to be 0.9.0 instead.

Finkregh · 2025-12-29T20:19:17Z

Is there a PR I can follow?

pirate temporarily deployed to github-pages January 5, 2024 04:00 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 6, 2024 00:27 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 6, 2024 00:57 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 6, 2024 00:58 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 6, 2024 01:05 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 6, 2024 01:22 — with GitHub Actions Inactive

pirate added status: wip Work is in-progress / has already been partially completed expected: next release labels Jan 10, 2024

pirate self-assigned this Jan 10, 2024

pirate temporarily deployed to github-pages January 10, 2024 04:39 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 10, 2024 04:47 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 10, 2024 05:12 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 10, 2024 05:22 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 12, 2024 01:27 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 12, 2024 03:08 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 12, 2024 03:53 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 12, 2024 04:08 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 12, 2024 13:25 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 14, 2024 00:50 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 14, 2024 00:54 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 15, 2024 20:37 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 16, 2024 08:38 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 17, 2024 02:16 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 19, 2024 03:49 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 19, 2024 05:01 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 19, 2024 05:20 — with GitHub Actions Inactive

pirate temporarily deployed to github-pages January 19, 2024 08:17 — with GitHub Actions Inactive

pirate had a problem deploying to github-pages January 19, 2024 08:18 — with GitHub Actions Error

pirate had a problem deploying to github-pages January 19, 2024 08:27 — with GitHub Actions Error

pirate temporarily deployed to github-pages January 19, 2024 08:28 — with GitHub Actions Inactive

claude and others added 27 commits December 27, 2025 10:06

Rename validate hooks to install hooks

8c846b7

- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_* - This better reflects what these hooks actually do (check/install binaries) - Update TODO_hook_architecture.md to reflect renamed hooks

tweak concurrency for more speed

9b533ad

way better plugin hooks system wip

50e527e

rename extractor to plugin everywhere

bd265c0

move todos

d2e65cf

continue renaming extractor to plugin, add plan for hook concurrency,…

4ccb086

… add chrome kill helper script

minor bugfixes

b1e3546

Mark hook renumbering testing as complete in TODO

6b3c872

All hook utility tests pass (extract_step, is_background_hook, discover_hooks). Model fields and methods verified (current_step, hook_name, advance_step_if_ready).

Revert "Restore missing folder utility functions"

767458e

This reverts commit 32bcf08.

fix final_status uneeded

6d991a0

wip

f0aa19f

improve plugin tests and config

1e4d3ff

use full dotted paths for all archivebox imports, add migrations and …

f4e7820

…more fixes

add ci for parallel tests

9487f8a

much better tests and add page ui

30c60ee

pirate closed this Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.8 Release Candidate #1311

v0.8 Release Candidate #1311

pirate commented Jan 5, 2024 •

edited

Loading

Uh oh!

pirate commented Dec 29, 2025

Uh oh!

Finkregh commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Uh oh!

v0.8 Release Candidate #1311

v0.8 Release Candidate #1311

Conversation

pirate commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Features

Bufixes

Uh oh!

pirate commented Dec 29, 2025

Uh oh!

Finkregh commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

pirate commented Jan 5, 2024 •

edited

Loading