Skip to content

feat(sync): sort files newest-first for faster salience on recent content#964

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/sync-newest-first
Closed

feat(sync): sort files newest-first for faster salience on recent content#964
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/sync-newest-first

Conversation

@garrytan-agents

@garrytan-agents garrytan-agents commented May 13, 2026

Copy link
Copy Markdown
Contributor

Problem

Sync processes files in git diff order (alphabetical), so meetings/2020-* gets embedded before meetings/2026-*. After a burst of writes, new pages can be invisible to search for hours while older pages in the alphabet process first.

Real-world impact: divorce attorney consultation pages written on May 11 were never found by gbrain search because the sync lock got stuck, and when sync finally ran, it would have processed them last (alphabetically after thousands of older pages).

Fix

Sort addsAndMods descending in both:

  • Incremental sync (sync.ts line 690) — the git-diff path
  • Full import (import.ts) — the runImport walker

Brain paths are date-prefixed by convention (meetings/2026-05-13-*, daily/2026-05-13.md), so lexicographic descending naturally prioritizes recent content.

Impact

  • Zero behavior change for completed syncs (same pages processed, just different order)
  • Massive salience improvement for interrupted/slow syncs — newest pages become searchable first
  • Two lines of actual logic (.sort() calls), rest is comments

Testing

  • Verified sort order with sample brain paths
  • No existing tests assert processing order (order was undefined/alphabetical before)

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…tent

Problem: sync processes files in git-diff order (alphabetical), so
meetings/2020-* embeds before meetings/2026-*. After a burst of writes,
new pages can be invisible to search for hours while older pages process first.

Fix: sort addsAndMods descending in both incremental sync and full import.
Brain paths are date-prefixed by convention, so lexicographic descending
naturally prioritizes recent content.

This ensures the most relevant pages become searchable first.
@garrytan

Copy link
Copy Markdown
Owner

Cherry-picked into #988 on the garrytan/bangalore-v4 branch so CI runs fully (garrytan-agents PRs don't get full CI). Closing this PR; continuing review/merge there.

@garrytan garrytan closed this May 14, 2026
garrytan added a commit that referenced this pull request May 15, 2026
Replace gbrain import's positional `processedIndex` checkpoint with a
path-set checkpoint via `src/core/import-checkpoint.ts`. A file is only
"done" when its processFile returns success — failed files never enter
the set, parallel workers can't lose slow files, and sort-order changes
don't drop the newest N files on resume.

Three bug classes fixed:
- Parallel import + slow worker = silent file drop on crash-resume
- Failed file = checkpoint advanced past it, never retried until manual clear
- Sort-order flip (v0.33.x) = cross-version resume drops newest N files

Old positional checkpoints are detected on first resume and discarded
with a stderr log line. Re-walking is cheap because content_hash
short-circuits unchanged files.

Also extracts the descending-lex sort into src/core/sort-newest-first.ts
so import.ts and sync.ts share a single source of truth.

Tests:
- test/sort-newest-first.test.ts (5 hermetic cases)
- test/import-checkpoint.test.ts (18 unit cases over the helpers)
- test/import-resume.test.ts (refactored — GBRAIN_HOME isolation,
  drives runImport against PGLite, 5 integration cases including
  SLUG_MISMATCH retry regression)

Includes the original sort-newest-first contribution from
@garrytan-agents's PR #964 (commit 8dbcf6a).
garrytan added a commit that referenced this pull request May 15, 2026
…drop + failed-file-skip + sort-flip bugs (#988)

* feat(sync): sort files newest-first for faster salience on recent content

Problem: sync processes files in git-diff order (alphabetical), so
meetings/2020-* embeds before meetings/2026-*. After a burst of writes,
new pages can be invisible to search for hours while older pages process first.

Fix: sort addsAndMods descending in both incremental sync and full import.
Brain paths are date-prefixed by convention, so lexicographic descending
naturally prioritizes recent content.

This ensures the most relevant pages become searchable first.

* feat(import): path-based checkpoint resume + sort-newest-first helper

Replace gbrain import's positional `processedIndex` checkpoint with a
path-set checkpoint via `src/core/import-checkpoint.ts`. A file is only
"done" when its processFile returns success — failed files never enter
the set, parallel workers can't lose slow files, and sort-order changes
don't drop the newest N files on resume.

Three bug classes fixed:
- Parallel import + slow worker = silent file drop on crash-resume
- Failed file = checkpoint advanced past it, never retried until manual clear
- Sort-order flip (v0.33.x) = cross-version resume drops newest N files

Old positional checkpoints are detected on first resume and discarded
with a stderr log line. Re-walking is cheap because content_hash
short-circuits unchanged files.

Also extracts the descending-lex sort into src/core/sort-newest-first.ts
so import.ts and sync.ts share a single source of truth.

Tests:
- test/sort-newest-first.test.ts (5 hermetic cases)
- test/import-checkpoint.test.ts (18 unit cases over the helpers)
- test/import-resume.test.ts (refactored — GBRAIN_HOME isolation,
  drives runImport against PGLite, 5 integration cases including
  SLUG_MISMATCH retry regression)

Includes the original sort-newest-first contribution from
@garrytan-agents's PR #964 (commit 8dbcf6a).

* chore: bump version and changelog (v0.34.2.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: update project documentation for v0.34.2.0

Add CLAUDE.md Key Files entries for the path-based import checkpoint
work: new entries for src/core/import-checkpoint.ts and
src/core/sort-newest-first.ts, plus a dedicated src/commands/import.ts
entry covering the v0.34.2.0 refactor. Update src/commands/sync.ts
entry to reference sortNewestFirst. Regenerate llms-full.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(tests): swap banned /data/brain placeholder for /tmp/example-brain

scripts/check-privacy.sh banlist includes /data/brain/ (legacy private
OpenClaw fork layout). New test files must not use it — CI privacy
guard caught this on PR #988's first push.

No behavior change. test/import-checkpoint.test.ts is unit-level with
no fs access; the dir string is just an identity marker for the
loadCheckpoint dir-mismatch guard.

---------

Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants