Skip to content

archive/ content is ingested but silently hard-excluded from all search by default (in-repo should mean findable) #1777

@garrytan-agents

Description

@garrytan-agents

Content under archive/ is silently unfindable: ingested but hard-excluded from all search by default

Summary

A markdown page that lives in the brain repo under archive/ cannot be found by query/search/ask unless the caller explicitly opts the prefix back in. There is no signal at call time that an entire subtree was withheld from the result set. For a large brain this produces a class of "the page exists in the repo but the agent swears it doesn't exist" failures — the agent queries, gets nothing, and confidently concludes the content isn't there.

The guiding principle this violates: if it's committed to the brain repo, it should be embedded, graphed, and findable by default. Findability should be opt-out, not opt-in.

Root cause

Two independent gates, only the second of which is the problem:

  1. Ingest gateisSyncable / pruneDir in src/core/sync.ts. Prune set is small (node_modules, .raw, ops). archive/ is not pruned here, so archive pages are eligible for embedding + graphing on sync. (Good.)

  2. Search gateDEFAULT_HARD_EXCLUDES in src/core/search/source-boost.ts:

    export const DEFAULT_HARD_EXCLUDES: string[] = [
      'test/',
      'archive/',
      'attachments/',
      '.raw/',
    ];

    resolveHardExcludes() unions this into every search and subtracts matching slug prefixes from results unless a caller passes include_slug_prefixes: ['archive/']. So archive/ pages are embedded but invisible to the default search path, with no diagnostic emitted.

test/, attachments/, and .raw/ are legitimately noise. archive/ is not categorically noise — it routinely holds high-signal historical content (imported conversation exports, prior-system logs, older notes) that users absolutely expect to retrieve.

Impact

  • Whole high-value subtrees are unreachable via the primary retrieval surface; only raw grep over the working tree finds them.
  • The exclusion is invisible: an empty result set is indistinguishable from "withheld by a hardcoded prefix," so callers (human or agent) wrongly conclude the content does not exist.
  • Scales with brain size — the bigger and older the brain, the more lands in archive/ and disappears.

Repro

  1. Have any page at archive/.../something.md that is synced/embedded.
  2. gbrain search "<an exact phrase from that page>" → no results.
  3. gbrain query "<topic of that page>" → no results.
  4. The page is present on disk and (verify) present in pages; only the search gate hides it.

Proposed fix

Prefer demotion over exclusion so the content stays findable but never dominates curated pages:

  • Remove 'archive/' from DEFAULT_HARD_EXCLUDES (keep test/, attachments/, .raw/).
  • Add a demotion boost instead, e.g. in DEFAULT_SOURCE_BOOSTS:
    'archive/': 0.5, // findable, ranked below curated content
  • Run a sync/backfill to confirm existing archive/ pages are embedded + graphed (not just disk-present).

Secondary hardening (independent of the above):

  • Make withholding observable. When resolveHardExcludes() drops results, return a count/flag in the response payload (e.g. excluded_by_prefix: { "archive/": N }) so callers can tell "no matches" apart from "matches hidden by policy." Silent subtraction is the deeper bug.
  • Optionally expose an effective-config command (gbrain config excludes) that prints the active hard-exclude + boost map, so the retrieval policy isn't buried in source.

Acceptance criteria

  • A synced page under archive/ is returned by query/search/ask by default (ranked, possibly demoted), with no special caller flag.
  • test/ / attachments/ / .raw/ remain excluded.
  • Default search responses surface a machine-readable signal whenever any results were withheld by a hard-exclude prefix.
  • gbrain doctor (or equivalent) can report how many embedded pages are currently hidden from default search by prefix policy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions