Summary
Claude Code will read and echo the contents of secret-bearing files (.env, .dev.vars, credential files) into the conversation transcript, even when the user's CLAUDE.md contains explicit prohibitions against doing so. The model's safety reflex fires on output it has already produced, not on commands it is about to run, so the violation is detected only after the secret has already been written to chat history — at which point rotation is the only remedy.
Reproduction
- User has a global
CLAUDE.md containing rules like:
NEVER output secret values into chat. NEVER read .env files unless strictly necessary, and never reproduce any values in your response.
- User asks Claude Code to help populate a
.dev.vars file with API tokens.
- User accidentally pastes a malformed value (e.g. a
curl example command) instead of a bare token.
- Claude Code attempts to diagnose by running:
grep -n prints the full matching line, including the real API token, into the tool result — which is rendered into the conversation transcript and persisted in history.
- Claude Code then notices the violation, apologizes, and instructs the user to rotate the exposed credential.
This is reproducible with any inspection command that can echo file contents: Read, cat, head, tail, grep without -c, sed/awk patterns that print full lines, etc. The model knows the rules but does not pre-screen its own commands against them.
Why the existing safeguards don't work
CLAUDE.md instructions are advisory — they shape intent but don't block tool calls at execution time.
- The model's safety check runs on generated output it can see, not on the side effects of tool calls it's about to issue.
- "Operational reflexes" (e.g. defaulting to
grep -n for line numbers) override the rule because the rule is stored as guidance, not as a command-level filter.
- The remediation flow ("apologize and ask the user to rotate") is fired after the credential is already in conversation history, server logs, and any downstream telemetry. By then the damage is done.
Proposed fix
A defense-in-depth approach at the harness level, not the prompt level:
- Path-based pre-execution filter. When a Bash/Read/Grep tool call targets a path matching a known secret pattern (
.env*, .dev.vars*, *.pem, *.key, credentials*.json, *secret*, *token*), require the command to be on an allowlist of value-stripping operations (wc -l, grep -c, stat, length-only awk patterns, blind Edit with literal old_string). Block everything else by default.
- Output-side scrubber. Before tool results are appended to the conversation, run a regex pass over their stdout looking for high-entropy strings, known token prefixes (
sk-, sk-ant-, cfut_, ghp_, xox[bp]-, AWS access keys, JWT shapes, etc.) and redact them with a <REDACTED:reason> placeholder. The model should see the redaction marker, not the value.
- First-class "this file contains secrets" annotation. When the model has reason to believe a file contains secrets (filename, prior context, user statement), surface that as a structured flag in its working state, not just as a soft instruction in the prompt.
- Make the pre-execution filter overridable but loud. A user can explicitly opt-out for a one-shot command, but the override should be per-command, logged, and never the default.
The key principle: the model should not be trusted to police its own tool calls against secrets. The harness should enforce it.
User impact
This is not a hypothetical. Real reproduction in a real session today:
- A Cloudflare API token was written to chat history while diagnosing a malformed
.dev.vars file.
- The token had to be rotated immediately, requiring re-issuance, re-pasting into the file, and re-verification.
- The user had to interrupt their actual work (deploying a Cloudflare Worker) to deal with the cleanup.
- The exposed value exists in conversation history, transmission logs, and any retained telemetry between the user's machine and Anthropic's servers — beyond the user's direct control to scrub.
For developers using Claude Code on client work, the consequences scale fast: a leaked production credential during a paid engagement is a trust-and-reputation event, not just a chore. The implicit cost of this class of bug — minutes per incident, multiplied across the developer base, multiplied by the percentage of incidents that involve real production secrets — is substantial. Anthropic should consider whether affected users are owed credit for time lost to cleanup caused by the agent's own failure to honor its documented safety rules, in addition to fixing the underlying bug.
Environment
- Claude Code CLI on Windows 11 (MINGW64 / Git Bash)
- Model: Claude Opus 4.6 (1M context)
- Project has a global
CLAUDE.md with explicit secret-handling rules
- User confirmed reproducibility immediately after the incident
Summary
Claude Code will read and echo the contents of secret-bearing files (
.env,.dev.vars, credential files) into the conversation transcript, even when the user'sCLAUDE.mdcontains explicit prohibitions against doing so. The model's safety reflex fires on output it has already produced, not on commands it is about to run, so the violation is detected only after the secret has already been written to chat history — at which point rotation is the only remedy.Reproduction
CLAUDE.mdcontaining rules like:.dev.varsfile with API tokens.curlexample command) instead of a bare token.grep -nprints the full matching line, including the real API token, into the tool result — which is rendered into the conversation transcript and persisted in history.This is reproducible with any inspection command that can echo file contents:
Read,cat,head,tail,grepwithout-c,sed/awkpatterns that print full lines, etc. The model knows the rules but does not pre-screen its own commands against them.Why the existing safeguards don't work
CLAUDE.mdinstructions are advisory — they shape intent but don't block tool calls at execution time.grep -nfor line numbers) override the rule because the rule is stored as guidance, not as a command-level filter.Proposed fix
A defense-in-depth approach at the harness level, not the prompt level:
.env*,.dev.vars*,*.pem,*.key,credentials*.json,*secret*,*token*), require the command to be on an allowlist of value-stripping operations (wc -l,grep -c,stat, length-onlyawkpatterns, blindEditwith literalold_string). Block everything else by default.sk-,sk-ant-,cfut_,ghp_,xox[bp]-, AWS access keys, JWT shapes, etc.) and redact them with a<REDACTED:reason>placeholder. The model should see the redaction marker, not the value.The key principle: the model should not be trusted to police its own tool calls against secrets. The harness should enforce it.
User impact
This is not a hypothetical. Real reproduction in a real session today:
.dev.varsfile.For developers using Claude Code on client work, the consequences scale fast: a leaked production credential during a paid engagement is a trust-and-reputation event, not just a chore. The implicit cost of this class of bug — minutes per incident, multiplied across the developer base, multiplied by the percentage of incidents that involve real production secrets — is substantial. Anthropic should consider whether affected users are owed credit for time lost to cleanup caused by the agent's own failure to honor its documented safety rules, in addition to fixing the underlying bug.
Environment
CLAUDE.mdwith explicit secret-handling rules