Skip to content

Conversation

@pirate
Copy link
Member

@pirate pirate commented Dec 28, 2025

Summary

Related issues

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

This implements the hook concurrency plan from TODO_hook_concurrency.md:

## Schema Changes
- Add Snapshot.current_step (IntegerField 0-9, default=0)
- Create migration 0034_snapshot_current_step.py
- Fix uuid_compat imports in migrations 0032 and 0003

## Core Logic
- Add extract_step(hook_name) utility - extracts step from __XX_ pattern
- Add is_background_hook(hook_name) utility - checks for .bg. suffix
- Update Snapshot.create_pending_archiveresults() to create one AR per hook
- Update ArchiveResult.run() to handle hook_name field
- Add Snapshot.advance_step_if_ready() method for step advancement
- Integrate with SnapshotMachine.is_finished() to call advance_step_if_ready()

## Worker Coordination
- Update ArchiveResultWorker.get_queue() for step-based filtering
- ARs are only claimable when their step <= snapshot.current_step

## Hook Renumbering
- Step 5 (DOM extraction): singlefile→50, screenshot→51, pdf→52, dom→53,
  title→54, readability→55, headers→55, mercury→56, htmltotext→57
- Step 6 (post-DOM): wget→61, git→62, media→63.bg, gallerydl→64.bg,
  forumdl→65.bg, papersdl→66.bg
- Step 7 (URL extraction): parse_* hooks moved to 70-75

Background hooks (.bg suffix) don't block step advancement, enabling
long-running downloads to continue while other hooks proceed.
All hook utility tests pass (extract_step, is_background_hook, discover_hooks).
Model fields and methods verified (current_step, hook_name, advance_step_if_ready).
Restored 10 folder status functions that were accidentally removed:
- get_indexed_folders, get_archived_folders, get_unarchived_folders
- get_present_folders, get_valid_folders, get_invalid_folders
- get_duplicate_folders, get_orphaned_folders
- get_corrupted_folders, get_unrecognized_folders

These are required by archivebox_status.py for the status command.
Added safety checks for non-existent archive directories.
Remove imports of deleted folder utility functions and rewrite
status command to query Snapshot model directly. This aligns with
the fs_version refactor where the DB is the single source of truth.

- Use Snapshot.objects queries for indexed/archived/unarchived counts
- Scan filesystem directly for present/orphaned directory counts
- Simplify output to focus on essential status information
@pirate pirate merged commit 54f91c1 into dev Dec 28, 2025
2 checks passed
@pirate pirate deleted the claude/review-snapshot-archive-RJOkr branch December 28, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants