Content under archive/ is silently unfindable: ingested but hard-excluded from all search by default
Summary
A markdown page that lives in the brain repo under archive/ cannot be found by query/search/ask unless the caller explicitly opts the prefix back in. There is no signal at call time that an entire subtree was withheld from the result set. For a large brain this produces a class of "the page exists in the repo but the agent swears it doesn't exist" failures — the agent queries, gets nothing, and confidently concludes the content isn't there.
The guiding principle this violates: if it's committed to the brain repo, it should be embedded, graphed, and findable by default. Findability should be opt-out, not opt-in.
Root cause
Two independent gates, only the second of which is the problem:
-
Ingest gate — isSyncable / pruneDir in src/core/sync.ts. Prune set is small (node_modules, .raw, ops). archive/ is not pruned here, so archive pages are eligible for embedding + graphing on sync. (Good.)
-
Search gate — DEFAULT_HARD_EXCLUDES in src/core/search/source-boost.ts:
export const DEFAULT_HARD_EXCLUDES: string[] = [
'test/',
'archive/',
'attachments/',
'.raw/',
];
resolveHardExcludes() unions this into every search and subtracts matching slug prefixes from results unless a caller passes include_slug_prefixes: ['archive/']. So archive/ pages are embedded but invisible to the default search path, with no diagnostic emitted.
test/, attachments/, and .raw/ are legitimately noise. archive/ is not categorically noise — it routinely holds high-signal historical content (imported conversation exports, prior-system logs, older notes) that users absolutely expect to retrieve.
Impact
- Whole high-value subtrees are unreachable via the primary retrieval surface; only raw
grep over the working tree finds them.
- The exclusion is invisible: an empty result set is indistinguishable from "withheld by a hardcoded prefix," so callers (human or agent) wrongly conclude the content does not exist.
- Scales with brain size — the bigger and older the brain, the more lands in
archive/ and disappears.
Repro
- Have any page at
archive/.../something.md that is synced/embedded.
gbrain search "<an exact phrase from that page>" → no results.
gbrain query "<topic of that page>" → no results.
- The page is present on disk and (verify) present in
pages; only the search gate hides it.
Proposed fix
Prefer demotion over exclusion so the content stays findable but never dominates curated pages:
- Remove
'archive/' from DEFAULT_HARD_EXCLUDES (keep test/, attachments/, .raw/).
- Add a demotion boost instead, e.g. in
DEFAULT_SOURCE_BOOSTS:
'archive/': 0.5, // findable, ranked below curated content
- Run a sync/backfill to confirm existing
archive/ pages are embedded + graphed (not just disk-present).
Secondary hardening (independent of the above):
- Make withholding observable. When
resolveHardExcludes() drops results, return a count/flag in the response payload (e.g. excluded_by_prefix: { "archive/": N }) so callers can tell "no matches" apart from "matches hidden by policy." Silent subtraction is the deeper bug.
- Optionally expose an effective-config command (
gbrain config excludes) that prints the active hard-exclude + boost map, so the retrieval policy isn't buried in source.
Acceptance criteria
- A synced page under
archive/ is returned by query/search/ask by default (ranked, possibly demoted), with no special caller flag.
test/ / attachments/ / .raw/ remain excluded.
- Default search responses surface a machine-readable signal whenever any results were withheld by a hard-exclude prefix.
gbrain doctor (or equivalent) can report how many embedded pages are currently hidden from default search by prefix policy.
Content under
archive/is silently unfindable: ingested but hard-excluded from all search by defaultSummary
A markdown page that lives in the brain repo under
archive/cannot be found byquery/search/askunless the caller explicitly opts the prefix back in. There is no signal at call time that an entire subtree was withheld from the result set. For a large brain this produces a class of "the page exists in the repo but the agent swears it doesn't exist" failures — the agent queries, gets nothing, and confidently concludes the content isn't there.The guiding principle this violates: if it's committed to the brain repo, it should be embedded, graphed, and findable by default. Findability should be opt-out, not opt-in.
Root cause
Two independent gates, only the second of which is the problem:
Ingest gate —
isSyncable/pruneDirinsrc/core/sync.ts. Prune set is small (node_modules,.raw,ops).archive/is not pruned here, so archive pages are eligible for embedding + graphing on sync. (Good.)Search gate —
DEFAULT_HARD_EXCLUDESinsrc/core/search/source-boost.ts:resolveHardExcludes()unions this into every search and subtracts matching slug prefixes from results unless a caller passesinclude_slug_prefixes: ['archive/']. Soarchive/pages are embedded but invisible to the default search path, with no diagnostic emitted.test/,attachments/, and.raw/are legitimately noise.archive/is not categorically noise — it routinely holds high-signal historical content (imported conversation exports, prior-system logs, older notes) that users absolutely expect to retrieve.Impact
grepover the working tree finds them.archive/and disappears.Repro
archive/.../something.mdthat is synced/embedded.gbrain search "<an exact phrase from that page>"→ no results.gbrain query "<topic of that page>"→ no results.pages; only the search gate hides it.Proposed fix
Prefer demotion over exclusion so the content stays findable but never dominates curated pages:
'archive/'fromDEFAULT_HARD_EXCLUDES(keeptest/,attachments/,.raw/).DEFAULT_SOURCE_BOOSTS:archive/pages are embedded + graphed (not just disk-present).Secondary hardening (independent of the above):
resolveHardExcludes()drops results, return a count/flag in the response payload (e.g.excluded_by_prefix: { "archive/": N }) so callers can tell "no matches" apart from "matches hidden by policy." Silent subtraction is the deeper bug.gbrain config excludes) that prints the active hard-exclude + boost map, so the retrieval policy isn't buried in source.Acceptance criteria
archive/is returned byquery/search/askby default (ranked, possibly demoted), with no special caller flag.test//attachments//.raw/remain excluded.gbrain doctor(or equivalent) can report how many embedded pages are currently hidden from default search by prefix policy.