fix: improve policy persistence error handling and diagnostics (#19919)#19921
fix: improve policy persistence error handling and diagnostics (#19919)#19921Sikandar1310291 wants to merge 3 commits into
Conversation
…e-gemini#19919) - Include actual error cause in user-visible feedback message - Clean up orphaned tmp files when write operations fail - Abort persistence on non-ENOENT read errors (e.g. EACCES) to prevent silently overwriting existing policy files - Add EXDEV cross-device rename fallback (copy + unlink) - Fix existing tests to use proper ENOENT errors - Add 4 new tests for error diagnostics, cleanup, EACCES abort, and EXDEV fallback
Summary of ChangesHello @Sikandar1310291, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and user experience of the policy persistence mechanism. It addresses several edge cases and common failure scenarios by providing more detailed error messages, preventing data loss through silent overwrites, ensuring proper cleanup of temporary files, and adding compatibility for cross-device file operations. The changes make the policy persistence more resilient and easier to diagnose when issues arise. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves the error handling and diagnostics for policy persistence. The changes are well-implemented and address several important issues, such as preventing data loss from non-ENOENT read errors and ensuring temporary files are cleaned up on failure. The fallback mechanism for cross-device rename operations is a robust addition. The new unit tests are comprehensive and correctly validate all the fixes. Overall, this is an excellent contribution that increases the reliability of policy management.
| // Fall back to copy + unlink which works across filesystems. | ||
| if (isNodeError(renameError) && renameError.code === 'EXDEV') { | ||
| await fs.copyFile(tmpFile, policyFile); | ||
| await fs.unlink(tmpFile); |
There was a problem hiding this comment.
Catch unlink errors: await fs.unlink(tmpFile).catch(() => {}); As the policyFile was successfully renamed in the previous line.
| persist: true, | ||
| }); | ||
|
|
||
| await new Promise((resolve) => setTimeout(resolve, 0)); |
There was a problem hiding this comment.
Prefer vi.waitFor as fixed wait can introduce flakiness. Also please change in other places
| (fs.mkdir as unknown as Mock).mockRejectedValue( | ||
| new Error('Permission denied'), | ||
| ); |
There was a problem hiding this comment.
Prefer vi.mocked(fs.mkdir).mockRejectedValue(new Error('Permission denied'));. Also please change in other places accordingly
Abhijit-2592
left a comment
There was a problem hiding this comment.
Thanks for your contribution. Overall LGTM. I have some minor comments that needs to be addressed before merging
| } catch (renameError) { | ||
| // Cross-device rename fails with EXDEV on some Linux mount configurations. | ||
| // Fall back to copy + unlink which works across filesystems. | ||
| if (isNodeError(renameError) && renameError.code === 'EXDEV') { |
There was a problem hiding this comment.
Here, it explicitly checks for code === 'EXDEV'.
Earlier on line 518, tmpFile is created inside the exact same directory
as policyFile (${policyFile}.${tmpSuffix}.tmp). Because these two files
are intrinsically on the same filesystem, a POSIX rename operation will
never logically throw an EXDEV (Cross-device link) error in this specific scenario.
When users run into this bug structurally (often via single-file bind mounts
in Docker where the target inode is locked), Linux actually throws
EBUSY, not EXDEV. Because the logic explicitly checks for
EXDEV and the else block on line 537 re-throws everything else, an
EBUSY error will bypass your great fallback code entirely and still fail to
persist the file.
Suggestion: We should update the check on line 533 to handle both codes:
if (isNodeError(renameError) && (renameError.code === 'EXDEV' || renameError.code === 'EBUSY'))
Let me know what you think.
| } catch (error) { | ||
| if (!isNodeError(error) || error.code !== 'ENOENT') { | ||
| debugLogger.warn( | ||
| `Failed to parse ${policyFile}, overwriting with new policy.`, | ||
| error, | ||
| ); | ||
| if (isNodeError(error) && error.code === 'ENOENT') { | ||
| // File doesn't exist yet, start fresh | ||
| } else { | ||
| // Non-ENOENT read errors (e.g. EACCES) should abort persistence | ||
| // to avoid silently overwriting the existing file with empty data | ||
| throw error; |
There was a problem hiding this comment.
If a user manually edits their policy file and accidentally introduces a typo
that breaks the TOML format (like forgetting a quote mark), toml.parse() on
line 458 throws a generic Error.
Because a generic parsing error doesn't have a .code property, the new
if (isNodeError(error)) check on line 467 returns false. The new logic
then falls through to the else block on line 472 and throws, completely
aborting the save process.
This traps the CLI in a permanently "bricked" state. Every time the user clicks
"Always Allow" in the future, the app crashes behind the scenes and says "Failed to persist policy." It will never auto-save again until they manually find and delete the corrupted file.
Suggestion (Safe Backup & Recover): To preserve data and keep the app working, we should catch TOML parsing errors specifically. Instead of aborting (the new way) or deleting their old rules (the old way), the gold standard is to rename the corrupted file to auto-saved.toml.bak so they don't lose data, start a fresh existingData object so the save can succeed, and use emitFeedback('warning', ...) to tell the user we safely backed up their broken file!
Here is one way to do It like this, inside the catch on line 466:
} catch (error) {
if (isNodeError(error) && error.code === 'ENOENT') {
// File doesn't exist yet, start fresh
} else if (!isNodeError(error)) {
// This is a parsing error (no Error.code)
coreEvents.emitFeedback(
'warning',
`Syntax error found in policy file. Backing up corrupted file to ${policyFile}.bak and starting fresh.`
);
await fs.copyFile(policyFile, `${policyFile}.bak`).catch(() => {});
existingData = {}; // Clear existingData so the new command overrides it
} else {
// This is a real filesystem error (e.g. EACCES). Throw to prevent silent failure.
throw error;
}
}Let me know what you think.
|
Hi @Abhijit-2592, thanks for the review! I've addressed all the feedback and pushed the updates:
All local tests and type checks are passing now. Let me know if there's anything else needed! |
|
Thanks for addressing the earlier review feedback, the changes look good. I also noticed the branch currently has merge conflicts with If it helps, I’d be happy to help resolve the conflicts and push an updated commit on top of this PR, along with the two remaining review suggestions from @mrpmohiburrahman (the Let me know if you'd like me to take care of that. |
|
Hi @Sikandar1310291, this PR is being closed due to merge conflicts and 34 days of inactivity. To keep the review queue manageable, we are closing stale blocked PRs. Please feel free to re-open this PR or open a new one once the issues are resolved! Thank you for your contribution. |
Summary
Fixes #19919 — Users on Linux (aarch64) receive "✕ Failed to persist policy for run_shell_command" when selecting "allow for all future sessions", despite the
~/.gemini/policiesdirectory being writable andenablePermanentToolApprovalbeing enabled.This PR improves the policy persistence logic with better error handling and diagnostics in three ways:
Atomic write diagnostics — Provides richer error messages when the write fails, helping users understand whether the issue is a permissions problem, a filesystem constraint, or something else.
EXDEV + EBUSY cross-device fallback — When
fs.renamethrowsEXDEV(cross-mount) orEBUSY(emulated/Docker volume lock), the code gracefully falls back tocopyFile+unlink. This fixes policy persistence on Docker volume mounts and other emulated environments.Corrupted TOML recovery — If
auto-saved.tomlcontains invalid TOML (e.g., from a partial/interrupted write), the CLI no longer bricks itself. The corrupted file is automatically backed up toauto-saved.toml.bakand a user-visible warning is emitted, while policy persistence resumes from a clean empty state.Details
The core change is in
packages/core/src/policy/config.tsinside thecreatePolicyUpdaterfunction:try/catchspecifically aroundtoml.parse(). On parse failure, the corrupted file is backed up with.baksuffix, a warning is surfaced viacoreEvents.emitFeedback, and persistence continues withexistingData = {}.catchonfs.renamenow additionally handlesEBUSYalongside the existingEXDEVhandler, triggering the samecopyFile+unlinkfallback.Both fixes are covered by new unit tests in
policy-updater.test.ts.Related Issues
Closes #19919
How to Validate
EBUSY scenario (Docker): Run Gemini CLI inside a Docker container with a bound-mount workspace. Select "Allow for all future sessions" on any shell command. Verify the policy is persisted in
auto-saved.tomlwithout error.Corrupted TOML scenario: Manually corrupt
~/.gemini/policies/auto-saved.tomlwith invalid content (e.g.,invalid;;). Launch Gemini CLI and select "Allow for all future sessions". Verify that:auto-saved.toml.bakis created with the old corrupt content.auto-saved.tomlnow contains the freshly written valid rule.Unit tests: Run
npm test -w @google/gemini-cli-core -- src/policy/policy-updater.test.ts— all 8 tests should pass including the two new ones.Pre-Merge Checklist