Skip to content

fix: warn when env vars override ze-switch embedding config#1421

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/ze-switch-env-override-warning
Closed

fix: warn when env vars override ze-switch embedding config#1421
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/ze-switch-env-override-warning

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

ze-switch writes the new embedding model to DB config and ~/.gbrain/config.json, but loadConfig() gives highest precedence to process.env.GBRAIN_EMBEDDING_MODEL. If the user has this env var set (common in .env files loaded at gateway/shell startup), the switch silently does nothing at runtime.

The user sees Switch status: applied and thinks it worked. But every gbrain command still uses the old OpenAI model because the env var overrides DB + file config.

Real-world impact: 716K chunks had to be re-embedded because the env override meant the switch to ZeroEntropy silently didn't take effect. Doctor reported openai:text-embedding-3-large dimension mismatch for hours before the root cause was traced to stale env vars.

Fix

1. ze-switch env override warning

After applying the switch, checks if GBRAIN_EMBEDDING_MODEL or GBRAIN_EMBEDDING_DIMENSIONS are set in the process environment and differ from the target. Prints a prominent warning box:

╔══════════════════════════════════════════════════════════════╗
║  ⚠️  ENV OVERRIDE DETECTED — ACTION REQUIRED                ║
╠══════════════════════════════════════════════════════════════╣
║  GBRAIN_EMBEDDING_MODEL is set in your environment:         ║
║    Current env: openai:text-embedding-3-large               ║
║    Switch target: zeroentropyai:zembed-1                    ║
║                                                              ║
║  The env var takes HIGHEST PRECEDENCE and will override      ║
║  this switch. Update your .env file or shell environment.    ║
║                                                              ║
║  Without this change, the switch has NO EFFECT at runtime.   ║
╚══════════════════════════════════════════════════════════════╝

2. doctor env/DB mismatch detector

New embedding_env_override check: if the env var disagrees with DB config, warns with the exact fix. Surfaces the mismatch on every doctor run instead of reporting a confusing dimension error.

Testing

  • Set GBRAIN_EMBEDDING_MODEL=openai:text-embedding-3-large in env
  • Run gbrain ze-switch --non-interactive --force
  • Verify warning box appears after "Switch status: applied"
  • Run gbrain doctor --fast
  • Verify embedding_env_override warning appears

Two fixes for the silent env-override bug where ze-switch writes to DB +
file config but GBRAIN_EMBEDDING_MODEL env var takes highest precedence
in loadConfig(), causing the switch to have no runtime effect.

1. ze-switch (retrieval-upgrade-planner.ts):
   After applying the switch, check if GBRAIN_EMBEDDING_MODEL or
   GBRAIN_EMBEDDING_DIMENSIONS are set in the environment and differ
   from the target. Print a prominent warning box with the exact env
   vars to update. Without this, users think the switch worked but
   nothing changed at runtime.

2. doctor (doctor.ts):
   New 'embedding_env_override' check detects when the env var
   disagrees with the DB config and warns with the fix command.
   This surfaces the mismatch on every hourly doctor run instead
   of just reporting a confusing dimension mismatch.

Real-world impact: 716K chunks had to be re-embedded because the
env override meant the switch to ZeroEntropy silently didn't take
effect. Doctor reported 'openai:text-embedding-3-large' dimension
mismatch for hours before the root cause was found.
@garrytan

Copy link
Copy Markdown
Owner

Closing in favor of fix-wave incorporation on garrytan/pr-1414-1416-1421 (shipping as v0.41.2.1). Rebuilt with /plan-eng-review + codex outside-voice structural improvements:

  • Pre-apply refusal (with --ignore-env-override escape hatch) rather than post-apply warning — the 716K-chunk damage incident in your PR description proves the schema transition is the one-way door, so the gate must fire BEFORE any mutation (snapshot write, schema). Test asserts ZERO setConfig calls fire on a refused apply.
  • Gate also fires in resumeRetrievalUpgrade (codex caught the resume bypass that would have re-opened the same failure surface).
  • Gate fires BEFORE setConfig(KEY_PREVIOUS_SNAPSHOT) at line ~294 (not just before runSchemaTransition at line ~304) — refused = zero side effects.
  • ApplyResult tagged union extends with {status:'refused', reason:'env_override', warning}; CLI ze-switch.ts renders the ASCII box from formatEnvOverrideWarning — planner stays data-pure.
  • embedding_env_override doctor check uses Check.details (typed shape per existing Check.issues mismatch codex caught) and is wired into BOTH buildChecks() AND doctorReportRemote() (cross-surface parity).
  • ASCII box convention (per repo D10) replaces the Unicode box-drawing UI.

Credit on the merged commit.

@garrytan garrytan closed this May 25, 2026
garrytan added a commit that referenced this pull request May 25, 2026
…mpotency + ze-switch env-gate (#1445)

* fix-wave: dream.* DB merge + batch retry + extract_atoms idempotency + ze-switch env-gate + doctor check

Closes PRs #1414, #1416, #1421 (rebuilt from designs by @garrytan-agents
with structural improvements from /plan-eng-review + codex outside-voice).

Three production reliability fixes in one wave:

1. dream.* DB-config merge (closes PR #1416 silent-config gap)
   - loadConfigWithEngine() sparse-merge extends with 7 dream.* keys
   - File > DB > defaults precedence (no GBRAIN_DREAM_* env vars)
   - extract-atoms switches to loadConfigWithEngine() so DB-plane keys reach it

2. Batch retry on transient connection drops (closes PR #1416 ~30%-loss bug)
   - withRetry() pure primitive exported from src/commands/extract.ts
   - 6 flush() sites snapshot-before-clear with onRetry callback
   - Reuses isRetryableConnError from src/core/retry-matcher.ts
   - retry-matcher extended with GBrainError{problem:'No database connection'}

3. extract_atoms source-hash idempotency + page-based discovery (closes #1414)
   - One raw SQL with NOT EXISTS subquery replaces 6 listPages + N atom checks
   - sourceId threaded through every putPage call (codex caught real bug)
   - NULL content_hash filter + dream_generated exclusion + transcript-side idempotency
   - cycle.ts passes union of syncPagesAffected + synthesizeWrittenSlugs

4. ze-switch pre-apply + pre-resume env-override gate (closes PR #1421)
   - Gate fires FIRST in apply AND resume; zero setConfig calls on refusal
   - ASCII warning box (no Unicode per repo D10)
   - --ignore-env-override escape hatch for power users
   - ApplyResult extended with refused variant

5. doctor embedding_env_override check (defense-in-depth for #1421)
   - Cross-surface parity: buildChecks() + doctorReportRemote()
   - Uses Check.details (not Check.issues per codex schema review)

Co-Authored-By: garrytan-agents <garrytan-agents@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.41.10.0)

Adds 61 new tests across 5 new files pinning the fix-wave contracts:
- test/extract-batch-retry.test.ts (16 cases) — withRetry primitive + snapshot contract
- test/extract-atoms-page-discovery.test.ts (17 cases) — discovery SQL + dual-source idempotency
- test/ze-switch-env-override.test.ts (17 cases) — env-gate apply + resume + ZERO-setConfig assertion
- test/doctor-embedding-env-override.test.ts (7 cases) — cross-surface parity
- test/e2e/extract-atoms-discovery-sql.test.ts (4 cases) — real-Postgres parity for raw SQL

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): pin gateway to 1536-dim in 2 PGLite tests that hardcode 1536-vector inserts

CI shards 1 + 4 failed persistently (not flake — confirmed via retry) after the
v0.41.6.0 merge with this error:

  error: expected 1280 dimensions, not 1536
  file: "vector.c", routine: "CheckExpectedDim"

Two test files insert 1536-dim Float32Array vectors into `content_chunks.embedding`
/ `facts.embedding`, but v0.41.5.0 flipped `DEFAULT_EMBEDDING_DIMENSIONS` from
1536 to 1280 (ZE Matryoshka default). On a fresh CI bun process where no prior
test pre-configured the gateway, `initSchema()` sizes the vector column at
vector(1280) and the inserts throw.

Locally this is hidden when an earlier test file in the shard happens to have
called `configureGateway({embedding_dimensions: 1536})` — that state leaks
forward through bun's shared process. The v0.41.6.0 LPT shard re-balancing
reordered files so these two ran cold, surfacing the latent bug.

Fix follows the canonical hermetic pattern from
test/consolidate-valid-until.test.ts:23-34: pin the gateway to 1536d in
beforeAll, reset in afterAll. Test is now isolated from shard ordering.

  test/search-types-filter.test.ts     — shard 1 fail
  test/operations-find-trajectory.test.ts — shard 4 (6 fails)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: empty commit to trigger CI

* chore: trigger CI again

* chore: renumber v0.41.10.0 -> v0.41.10.1

Per request — version slot moved to .1 micro tier to leave .0 available
for unrelated wave landing on master.

---------

Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants