Skip to content

Conversation

@pirate
Copy link
Member

@pirate pirate commented Jan 5, 2024

WIP Release Candidate for ArchiveBox version 0.8.0.

Try this release early using docker or pip:

# with docker (pre-built)
docker pull archivebox/archivebox:dev

# with docker (built from source)
docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev

# with pip (built from source)
pip install 'git+https://github.com/pirate/ArchiveBox@dev'

New Features

  • support for NFS/SMB/S3/B2/Google Drive/Dropbox/etc. Remote Storage
  • upgrade to Django 4.2 (thanks @jimwins!)
  • add new generic_jsonl parser (thanks @jimwins!)
  • switch to feedparser for RSS parsing (thanks @jimwins!)
  • remember Snapshot detail page header expanded/collapsed state
  • allow more restrictive NFS permission coercion on ./data/archive
  • check /, /data, and /data/archive in Docker and warn if running low on disk space
  • fix /browsers chown on Docker armv7 entrypoint failing
  • disable chrome automatic self-updating when running headless
  • Add ability to populate is_staff and is_superuser flags during LDAP first auth
  • add gitea and other domains to default GIT_DOMAINS list to run git archiving on
  • bump dependency versions

Bufixes

  • fix RESOLUTION being ignored when using Chrome headless in Docker
  • fix sorting by Size / Files in the Admin Snapshots list page UI
  • fix spinner icon showing on some Snapshots instead of favicon when only a few extractors are enabled
  • fix yt-dlp sometimes failing to archive media due to filenames being too long or containing special characters
  • fix wget extractor not finding output when :80 or :443 port is present in the original URL
  • fix /var/spool/cron/crontabs permissions when mounting it via Docker

Warning

This release drops Docker support for arm/v7 (e.g. older 32-bit Raspberry Pis). You can still run ArchiveBox using the pip-install method, or build your own Docker images, but we will no longer-offer pre-built images for older CPUs.

@pirate pirate added status: wip Work is in-progress / has already been partially completed expected: next release labels Jan 10, 2024
@pirate pirate self-assigned this Jan 10, 2024
claude and others added 27 commits December 27, 2025 10:06
- Rename 13 on_Crawl__00_validate_* hooks to on_Crawl__00_install_*
- This better reflects what these hooks actually do (check/install binaries)
- Update TODO_hook_architecture.md to reflect renamed hooks
- All install hooks now respect their respective XYZ_BINARY env vars
  (e.g., WGET_BINARY, CHROME_BINARY, YTDLP_BINARY, etc.)
- Support both absolute paths (/usr/bin/wget2) and binary names (wget2)
- Dynamic bin_name used in Dependency JSONL output
- Updated 11 install hooks to follow the new pattern
- Mark checklist items as complete in TODO_hook_architecture.md
All snapshot hooks now:
- Read XYZ_BINARY env vars and use in cmd
- Output exactly one clean JSONL line (no RESULT_JSON= prefix)
- No extra output lines (VERSION=, START_TS=, etc.)
- Only provide allowed fields
- Don't include computed fields
- Python hooks include cmd array with binary path
- Add test_hooks.py with 31 unit tests covering:
  - Background hook detection (.bg. suffix)
  - JSONL parsing (clean format and legacy RESULT_JSON= format)
  - Install hook XYZ_BINARY env var handling
  - Hook discovery and sorting
  - get_extractor_name() function
  - Hook execution with real subprocesses
  - Install hook output format compliance
  - Snapshot hook output format compliance
  - Plugin metadata addition

- Update TODO_hook_architecture.md to mark all tasks complete:
  - Tests: 31 tests in archivebox/tests/test_hooks.py
  - Migrations: 0029 and 0030 applied successfully

All phases of the hook architecture implementation are now complete.
Replace old `output` field with new fields across the codebase:
- output_str: Human-readable output summary
- output_json: Structured metadata (optional)
- output_files: Dict of output files with metadata
- output_size: Total size in bytes
- output_mimetypes: CSV of file mimetypes

Files updated:
- api/v1_core.py: Update MinimalArchiveResultSchema to expose new fields
- api/v1_core.py: Update ArchiveResultFilterSchema to search output_str
- cli/archivebox_extract.py: Use output_str in CLI output
- core/admin_archiveresults.py: Update admin fields, search, and fieldsets
- core/admin_archiveresults.py: Fix output_html variable name bug in output_summary
- misc/jsonl.py: Update archiveresult_to_jsonl() to include new fields
- plugins/extractor_utils.py: Update ExtractorResult helper class

The embed_path() method already uses output_files and output_str,
so snapshot detail page and template tags work correctly.
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
All hook utility tests pass (extract_step, is_background_hook, discover_hooks).
Model fields and methods verified (current_step, hook_name, advance_step_if_ready).
Restored 10 folder status functions that were accidentally removed:
- get_indexed_folders, get_archived_folders, get_unarchived_folders
- get_present_folders, get_valid_folders, get_invalid_folders
- get_duplicate_folders, get_orphaned_folders
- get_corrupted_folders, get_unrecognized_folders

These are required by archivebox_status.py for the status command.
Added safety checks for non-existent archive directories.
Remove imports of deleted folder utility functions and rewrite
status command to query Snapshot model directly. This aligns with
the fs_version refactor where the DB is the single source of truth.

- Use Snapshot.objects queries for indexed/archived/unarchived counts
- Scan filesystem directly for present/orphaned directory counts
- Simplify output to focus on essential status information
<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/
- Rename hook script on_Snapshot__63_media.bg.py → on_Snapshot__63_ytdlp.bg.py
- Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases
- Update templates CSS classes: media-* → ytdlp-*
- Fix gallerydl bug: remove incorrect dependency on media plugin output
- Update all codebase references to use YTDLP_* and SAVE_YTDLP
- Add backwards compatibility test for MEDIA_ENABLED alias
- Move hardcoded default args from Python to config.json YTDLP_ARGS
- Add get_ytdlp_args() function to read from YTDLP_ARGS env var
- Keep format arg with max_size in code (depends on YTDLP_MAX_SIZE)
- YTDLP_ARGS can be overridden as JSON array in environment
- Rename archivebox/plugins/media/ → archivebox/plugins/ytdlp/
- Rename hook script on_Snapshot__63_media.bg.py →
on_Snapshot__63_ytdlp.bg.py
- Update config.json: YTDLP_* as primary keys, MEDIA_* as x-aliases
- Update templates CSS classes: media-* → ytdlp-*
- Fix gallerydl bug: remove incorrect dependency on media plugin output
- Update all codebase references to use YTDLP_* and SAVE_YTDLP
- Add backwards compatibility test for MEDIA_ENABLED alias

<!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line
length changes. -->

# Summary

<!--e.g. This PR fixes ABC or adds the ability to do XYZ...-->

# Related issues

<!-- e.g. #123 or Roadmap goal #
https://github.com/pirate/ArchiveBox/wiki/Roadmap -->

# Changes these areas

- [ ] Bugfixes
- [ ] Feature behavior
- [ ] Command line interface
- [ ] Configuration options
- [ ] Internal architecture
- [ ] Snapshot data layout on disk
@pirate
Copy link
Member Author

pirate commented Dec 29, 2025

I'm closing this as I've made a ton of changes on dev and am now targeting the next release to be 0.9.0 instead.

@pirate pirate closed this Dec 29, 2025
@Finkregh
Copy link

Is there a PR I can follow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

expected: next release status: wip Work is in-progress / has already been partially completed

Projects

None yet

Development

Successfully merging this pull request may close these issues.