refactor(ssh): use agent instead of temporary files for private keys by water-sucks · Pull Request #196 · nix-community/nixos-cli

water-sucks · 2026-02-26T01:03:12Z

Description

After some testing, the temporary file mechanism introduced in #175 is a bit brittle sometimes on non-Linux systems, and has the potential to leave private key material on the system in failure scenarios. Even though the Linux implementation uses memfd to make sure this never happens, I still wanted to see if there was an alternative approach for storing keys. And there is! We can just spin up an agent ourselves or add keys to an existing agent.

This PR creates an SSH AgentManager to do just that: it either uses an existing SSH agent, or spins up its own internal SSH agent to add keys to, and subsequent processes (such as nix-copy-closure can inherit this value through SSH_AUTH_SOCK so that we don't need to pass temporary file paths through NIX_SSHOPTS either.

The agent server can be disabled using a new setting, ssh.enable_agent_server if so desired.

If an existing SSH agent is used, any added keys are removed from the agent at the end of the program.

Also, unrelated fix: it handles errors for knownhosts detection if $HOME is not defined, which should be rare.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced SSH agent support with automatic management and connection pooling for improved SSH operation reliability.
- Improved SSH configuration handling with better reconnection and cloning capabilities.
Bug Fixes
- Refined SSH error handling and connection lifecycle management.
Chores
- Removed legacy temporary file implementations in favor of standard system approaches.

water-sucks · 2026-02-26T01:03:34Z

@coderabbitai review

coderabbitai · 2026-02-26T01:03:45Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-02-26T01:09:35Z

📝 Walkthrough

Walkthrough

This PR introduces a dedicated SSH agent management module, refactors SSH system architecture to use a configuration-based approach with separate SSHConfig and AgentManager types, integrates this new architecture into the apply command flow, and removes the TempFile abstraction from system utilities.

Changes

Cohort / File(s)	Summary
SSH Agent Management `internal/ssh/agent.go`	New module introducing AgentManager for thread-safe SSH agent handling with support for connecting to existing agents via SSH_AUTH_SOCK or creating in-process Unix-socket-based agents. Includes key management, lifecycle control, and optional in-memory agent server functionality.
SSH System Refactoring `internal/system/ssh.go`	Major architectural shift replacing monolithic SSHSystem with separate SSHConfig and SSHSystem types. Introduces SSHConfigOptions for centralized configuration, refactors authentication/host-key logic into new helpers (knownHostsCallback, dialClient, runPrivateKeyCmd), and adds Reconnect/Clone methods.
Apply Command Integration `cmd/apply/apply.go`	Integrates new SSH agent and configuration-based system by creating an AgentManager at startup, building SSHConfig with agent/host information, and deferring cleanup of SSH resources.
System Utilities `internal/system/system.go`	Removes accumulation of Nix SSH options from remote-copy branches, simplifying NIX_SSHOPTS construction logic.
Tempfile Removal `internal/system/tempfile_linux.go`, `internal/system/tempfile_other.go`	Deletes platform-specific TempFile implementations including constructors, path accessors, and cleanup methods.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(apply): add support for remote building/activation #116: Modifies SSH-based remote build/deploy flow in cmd/apply and internal SSH system initialization pathways with overlapping code.
fix(ssh): parse ssh host string manually, clean up bugs #161: Adjusts SSH initialization logic in internal/system/ssh.go including host-string parsing and agent dialing/timeout handling.
feat(system): redo password handling for remote root, allow SSH password auth #124: Modifies SSHSystem authentication flow and remote-root password management via EnsureRemoteRootPassword.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main architectural change: replacing temporary files with SSH agent for private key handling.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/system/ssh.go (1)

256-260: ⚠️ Potential issue | 🟠 Major

Don’t retain an unverified sudo password on auth failure.

The password is cached before verification. If verification fails once, future calls won’t reprompt because s.sudoPassword is already set.

✅ Suggested fix

 	s.sudoPassword = bytePassword

 	if err = s.testRemoteRoot(rootCmd); err != nil {
+		s.sudoPassword = nil
 		return fmt.Errorf("failed to verify %s password: %s", rootCmd, err)
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/system/ssh.go` around lines 256 - 260, The code currently sets
s.sudoPassword = bytePassword before verification, causing an unverified
password to be cached; move the assignment so s.sudoPassword is only set after
testRemoteRoot succeeds, or if you prefer to keep the assignment where it is,
ensure you clear it on error (e.g., set s.sudoPassword = nil) when
testRemoteRoot(rootCmd) returns an error; update the block around
testRemoteRoot, referencing s.sudoPassword, testRemoteRoot, rootCmd, and
bytePassword accordingly.

🧹 Nitpick comments (1)

internal/system/ssh.go (1)

296-297: Clone should copy password bytes instead of sharing the same backing array.

Sharing the same []byte across clones is fragile for concurrent/session lifecycle handling.

✅ Suggested refactor

-		sudoPassword: s.sudoPassword,
+		sudoPassword: append([]byte(nil), s.sudoPassword...),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/system/ssh.go` around lines 296 - 297, The Clone implementation
currently assigns the sudoPassword slice by reference (sudoPassword:
s.sudoPassword), which shares the backing array across clones; change Clone (or
the method constructing the cloned ssh session) to deep-copy the password bytes:
if s.sudoPassword != nil create a new []byte of the same length, copy the
contents, and assign that new slice to sudoPassword (handle nil appropriately)
so each clone has its own backing array.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/apply/apply.go`:
- Around line 236-241: The deferred call to sshAgent.Stop() may panic when
sshUtils.NewAgentManager(log) returns an error and sshAgent is nil; update the
code to only defer sshAgent.Stop() when sshAgent is non-nil (e.g., check the
result of sshUtils.NewAgentManager and, after ensuring err == nil and sshAgent
!= nil, register defer sshAgent.Stop()), or assign sshAgent to a local variable
and use a nil-check inside the deferred closure before calling Stop(); reference
sshAgent, sshUtils.NewAgentManager, and Stop() when making the change.

In `@internal/ssh/agent.go`:
- Around line 312-319: The code calls os.Stat(tmpdir) and if it returns
os.ErrNotExist it creates the dir with os.MkdirAll(tmpdir, 0o777|os.ModeSticky)
but then still returns the original stat error unconditionally; update the
branch so that when errors.Is(err, os.ErrNotExist) and os.MkdirAll succeeds you
do NOT return the original err (only return on MkdirAll failure or other
non-NotExist stat errors). Locate the tmpdir check around os.Stat and
os.MkdirAll and remove or guard the final return "", err so success paths
continue normally.
- Around line 242-243: The goroutine passed to s.wg.Go (e.g., the callback that
currently defers s.wg.Done() around s.acceptConnections and the handler at the
shown callback at lines 296–299) is calling Done() manually even though
WaitGroup.Go automatically handles Done(); remove the manual defer s.wg.Done()
from any functions or anonymous goroutines supplied to s.wg.Go (leave the
callsites like s.wg.Go(s.acceptConnections) or the anonymous callback intact),
ensuring Done() is only called by WaitGroup.Go to avoid a double Done/underflow
panic.

In `@internal/system/ssh.go`:
- Around line 100-116: The current logic only uses PrivateKeyCmd output when
options.AgentManager is present, causing key-based auth to be dropped if
AgentManager is nil or manager.Add fails; change the flow in the block around
options.PrivateKeyCmd/runPrivateKeyCmd so that after obtaining privateKey (or
its bytes) you attempt to add it to the agent if manager != nil and manager.Add
succeeds, but regardless of agent availability or Add failure, parse the
returned private key into an ssh.Signer (use the same conversion that
runPrivateKeyCmd returns) and append a direct key auth method to auth (e.g.,
ssh.PublicKeys with the signer or equivalent), while keeping the existing
ssh.PublicKeysCallback(manager.Client().Signers) when manager is present and Add
succeeded; reference options.PrivateKeyCmd, runPrivateKeyCmd, manager.Add, auth,
and ssh.PublicKeysCallback to locate where to add the fallback signer append.
- Around line 164-166: The code appends a possibly relative path when
os.UserHomeDir() fails; change the block in internal/system/ssh.go so you check
the error and the returned homeDir before building knownHostsUserFile and
appending to defaultKnownHosts (i.e., only construct filepath.Join(homeDir,
".ssh", "known_hosts") and append it when err == nil and homeDir != "");
optionally fall back to HOME env or log/skipped addition when the home directory
can't be determined to avoid writing a known_hosts file into the current working
directory.

---

Outside diff comments:
In `@internal/system/ssh.go`:
- Around line 256-260: The code currently sets s.sudoPassword = bytePassword
before verification, causing an unverified password to be cached; move the
assignment so s.sudoPassword is only set after testRemoteRoot succeeds, or if
you prefer to keep the assignment where it is, ensure you clear it on error
(e.g., set s.sudoPassword = nil) when testRemoteRoot(rootCmd) returns an error;
update the block around testRemoteRoot, referencing s.sudoPassword,
testRemoteRoot, rootCmd, and bytePassword accordingly.

---

Nitpick comments:
In `@internal/system/ssh.go`:
- Around line 296-297: The Clone implementation currently assigns the
sudoPassword slice by reference (sudoPassword: s.sudoPassword), which shares the
backing array across clones; change Clone (or the method constructing the cloned
ssh session) to deep-copy the password bytes: if s.sudoPassword != nil create a
new []byte of the same length, copy the contents, and assign that new slice to
sudoPassword (handle nil appropriately) so each clone has its own backing array.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1097d85 and d643fef.

📒 Files selected for processing (6)

cmd/apply/apply.go
internal/ssh/agent.go
internal/system/ssh.go
internal/system/system.go
internal/system/tempfile_linux.go
internal/system/tempfile_other.go

💤 Files with no reviewable changes (3)

internal/system/tempfile_linux.go
internal/system/system.go
internal/system/tempfile_other.go

cmd/apply/apply.go

internal/ssh/agent.go

internal/system/ssh.go

water-sucks · 2026-02-26T02:58:23Z

@Sporif since I know you worked on the temp private key functionality, I'd like your input on this in particular.

Sporif

This is a nice solution for non-Linux systems. I wish ssh-agent supported abstract unix sockets though, that would make it as good as memfd for Linux. But overall I'm fine with this approach, and it works well in my limited testing so far.

internal/system/ssh.go

cmd/apply/apply.go

This spins up an internal SSH agent server fallback for when an SSH agent is not running. All private key material that is accessed using this command must never be persisted to disk, and should instead use this SSH agent.

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

cmd/apply/apply.go Show resolved Hide resolved

internal/ssh/agent.go Show resolved Hide resolved

internal/ssh/agent.go Show resolved Hide resolved

internal/system/ssh.go Outdated Show resolved Hide resolved

internal/system/ssh.go Outdated Show resolved Hide resolved

water-sucks force-pushed the ssh-refactors branch 2 times, most recently from 4b70a36 to 460b51e Compare February 26, 2026 02:56

water-sucks requested a review from Sporif February 26, 2026 02:56

Sporif requested changes Feb 26, 2026

View reviewed changes

internal/system/ssh.go Show resolved Hide resolved

cmd/apply/apply.go Show resolved Hide resolved

water-sucks added 5 commits February 28, 2026 14:46

refactor(ssh): split ssh config attributes out of system

e8e922d

feat(ssh): add in-memory ssh client with internal agent server

2a9b880

This spins up an internal SSH agent server fallback for when an SSH agent is not running. All private key material that is accessed using this command must never be persisted to disk, and should instead use this SSH agent.

refactor(ssh): use ssh agent for adding keys instead of temp files

4e7c675

refactor(ssh): only add to user knownhosts if user home dir is defined

bf444e6

refactor(ssh): add ctx for blocking password reads

f073b9e

water-sucks force-pushed the ssh-refactors branch from 460b51e to f073b9e Compare March 1, 2026 03:02

water-sucks requested a review from Sporif March 1, 2026 03:38

Sporif approved these changes Mar 1, 2026

View reviewed changes

water-sucks merged commit 0d699d6 into nix-community:main Mar 1, 2026
2 checks passed

Uh oh!

Conversation

water-sucks commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Release Notes

Uh oh!

water-sucks commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

water-sucks commented Feb 26, 2026

Uh oh!

Sporif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

water-sucks commented Feb 26, 2026 •

edited

Loading

coderabbitai bot commented Feb 26, 2026 •

edited

Loading