Skip to content

Wire server discovery protocol into thv serve#4319

Merged
JAORMX merged 4 commits intomainfrom
wire-server-discovery-protocol
Mar 26, 2026
Merged

Wire server discovery protocol into thv serve#4319
JAORMX merged 4 commits intomainfrom
wire-server-discovery-protocol

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Mar 23, 2026

Summary

RFC-0034 proposes a long-running local server architecture where thv serve advertises itself via a discovery file so clients (CLI, Studio) can find it without hardcoded ports. The discovery package (pkg/server/discovery/) was already fully implemented and tested but had zero imports anywhere in the codebase. This PR wires it in, giving us a working discovery protocol that immediately benefits Studio and the skills CLI client.

  • The server now writes a discovery file ($XDG_STATE_HOME/toolhive/server/server.json) on startup with the actual listen URL, PID, and a cryptographic nonce, and removes it on shutdown
  • The /health endpoint returns the nonce via X-Toolhive-Nonce so clients can verify server instance identity (prevents PID-reuse confusion)
  • The skills client now auto-discovers the server via the discovery file before falling back to TOOLHIVE_API_URL or localhost:8080, with loopback and socket-path validation on discovered URLs
  • Fixes: SIGTERM handling (was only SIGINT), 30-second shutdown timeout (was unbounded), symlink rejection on the discovery file read path, directory permission tightening, constant-time nonce comparison
  • A startup guard refuses to start if another healthy server is already running

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Manual testing: run thv serve, verify server.json appears with correct URL/PID/nonce, curl -v /health shows X-Toolhive-Nonce header, stop server and verify file is removed, start a second server while first is running and verify it refuses.

Changes

File Change
pkg/server/discovery/*.go Discovery package (was implemented but unwired) with security hardening: symlink check on read, dir permission tightening, constant-time nonce comparison, exported URL validators
pkg/api/server.go Nonce generation, WithNonce() builder, ListenURL(), discovery write/remove in lifecycle, startup guard, shutdown timeout
pkg/api/v1/healthcheck.go Nonce parameter, X-Toolhive-Nonce header on 204 responses
cmd/thv/app/server.go Add SIGTERM to signal handling
pkg/skills/client/client.go Discovery-first URL resolution with loopback/socket-path validation
pkg/api/server_test.go Tests for generateNonce, ListenURL (TCP + Unix)
pkg/api/v1/healtcheck_test.go Tests for nonce header present/absent/unhealthy
pkg/server/discovery/*_test.go Tests for symlink read rejection, dir permission tightening, paralleltest fixes

Does this introduce a user-facing change?

thv serve now writes a discovery file that allows other tools (Studio, skills CLI) to automatically find the running server without requiring TOOLHIVE_API_URL or knowing the port. This is transparent to users -- existing behavior is preserved as a fallback.

Special notes for reviewers

  • The pkg/server/discovery/ package was previously implemented and tested in a separate effort but never wired in. This PR includes it as new files since it had no prior commits.
  • The startup guard prevents silent orphaning: if a healthy server is already running, the new server refuses to start. Stale discovery files (from crashes/SIGKILL) are automatically cleaned up.
  • The nonce is explicitly not a secret -- it is a server instance identifier returned in a plaintext HTTP header and stored in a 0600 file. It prevents PID-reuse confusion, not unauthorized access.
  • Security review found no high/critical issues. TOCTOU gaps on symlink checks are mitigated by the 0700 directory permissions (single-user trust model).
  • Future work: thv server start/stop/status subcommands, CLI auto-start, lock file singleton, SSE events, Studio integration.

Large PR Justification

  • This contains both the setup of the server discovery and the usage.

Generated with Claude Code

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Mar 23, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 36bc774 to 7fdf8fa Compare March 23, 2026 14:13
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 23, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 50.22624% with 110 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.46%. Comparing base (a9f92a0) to head (3de135a).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
pkg/api/server.go 13.20% 45 Missing and 1 partial ⚠️
pkg/server/discovery/discovery.go 52.38% 14 Missing and 6 partials ⚠️
pkg/server/discovery/discover.go 41.93% 17 Missing and 1 partial ⚠️
pkg/skills/client/client.go 33.33% 14 Missing and 2 partials ⚠️
pkg/server/discovery/health.go 88.05% 4 Missing and 4 partials ⚠️
pkg/api/v1/healthcheck.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4319      +/-   ##
==========================================
- Coverage   69.55%   69.46%   -0.10%     
==========================================
  Files         480      483       +3     
  Lines       48837    49048     +211     
==========================================
+ Hits        33971    34070      +99     
- Misses      12249    12345      +96     
- Partials     2617     2633      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 7fdf8fa to c166dd9 Compare March 24, 2026 06:04
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from c166dd9 to c6e6ece Compare March 24, 2026 06:10
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from c6e6ece to 6948298 Compare March 24, 2026 06:19
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
Copy link
Copy Markdown
Member

@aponcedeleonch aponcedeleonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work wiring in the discovery protocol. The migration path is seamless (no server.json existed before, so StateNotFound handles all upgrade scenarios cleanly), and the rollback edge case also works correctly (nonce mismatch triggers overwrite). A few non-blocking observations below, mostly around multi-process safety and context propagation.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 6948298 to e3196be Compare March 24, 2026 11:05
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
@github-actions github-actions bot dismissed their stale review March 24, 2026 11:05

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from e3196be to 4ca07b2 Compare March 24, 2026 11:11
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
Copy link
Copy Markdown
Member

@aponcedeleonch aponcedeleonch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed against PR description. The implementation looks solid — discovery protocol is well-structured with good security considerations (symlink rejection, loopback validation, constant-time nonce comparison, directory permission tightening, startup guard).

Mismatches between PR description and actual code:

  1. SIGTERM handling not in this PR: The Summary and Changes table claim cmd/thv/app/server.go was changed to add SIGTERM signal handling, but this file is not modified in the diff. This change was likely landed separately (possibly in #4327 based on the recent commit history: 5769a5d2 Gracefully handle SIGTERM in thv HTTP server).

  2. Shutdown timeout not added by this PR: The description claims "30-second shutdown timeout (was unbounded)" as a fix, but shutdownTimeout = 30 * time.Second appears to be pre-existing code, not introduced here.

  3. Discovery vs env var priority is described backwards: The Summary says the skills client "auto-discovers the server via the discovery file before falling back to TOOLHIVE_API_URL", but the actual code in pkg/skills/client/client.go checks TOOLHIVE_API_URL first (explicit override wins), then tries discovery, then falls back to localhost:8080. The NewDefaultClient godoc and the numbered resolution logic are correct — only the Summary bullet inverts the priority.

These are description-only issues; the code itself is correct and well-implemented.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 4ca07b2 to fc8052c Compare March 24, 2026 12:38
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 24, 2026
blkt
blkt previously approved these changes Mar 24, 2026
Copy link
Copy Markdown
Contributor

@blkt blkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple questions, but overall looks good to me.

@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from fc8052c to 23ec5ae Compare March 25, 2026 10:43
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 25, 2026
aponcedeleonch
aponcedeleonch previously approved these changes Mar 25, 2026
@JAORMX JAORMX dismissed stale reviews from aponcedeleonch and blkt via bbb5f65 March 25, 2026 13:03
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from 23ec5ae to bbb5f65 Compare March 25, 2026 13:03
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 25, 2026
@JAORMX JAORMX requested review from aponcedeleonch and blkt March 25, 2026 14:20
JAORMX and others added 4 commits March 25, 2026 16:24
The discovery package (pkg/server/discovery/) was already implemented
and tested but had zero imports in the codebase. This wires it into
the serve command so clients (CLI, Studio) can auto-discover a running
server without hardcoded ports or environment variables.

On startup, thv serve now generates a cryptographic nonce, writes a
discovery file to $XDG_STATE_HOME/toolhive/server/server.json with
the actual listen URL (supporting port 0 and Unix sockets), and
returns the nonce via the X-Toolhive-Nonce health check header. On
shutdown the file is removed.

The skills client now tries discovery before falling back to the
TOOLHIVE_API_URL env var or the default localhost:8080, with loopback
and socket-path validation on discovered URLs.

Additional fixes: SIGTERM handling in the serve command, a 30-second
shutdown timeout (was unbounded), symlink rejection on the discovery
file read path, directory permission tightening after MkdirAll, and
constant-time nonce comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
- Wrap writeDiscoveryFile check-then-write in WithFileLock to prevent
  TOCTOU race when two servers start simultaneously
- Log FindProcess errors at Debug level instead of silently discarding
- Consolidate ListenURL tests into a table-driven test
- Rename healtcheck_test.go to healthcheck_test.go (fix typo)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Juan Antonio Osorio <ozz@stacklok.com>
The discovery lock file is created in the same directory as
server.json, but the directory may not exist on a fresh system.
MkdirAll was called inside the lock callback (via WriteServerInfo),
but the lock acquisition itself needs the directory to already exist.
Create the directory before calling WithFileLock so the lock file
can be written.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Extract shared HTTPClientForURL in the discovery package to
  deduplicate transport setup between health.go and the skills client
- Propagate context.Context through NewDefaultClient and
  resolveViaDiscovery instead of using context.Background()
- Add comment explaining intentional opts-shadowing order so
  caller-supplied options can override discovery defaults
- Use url.JoinPath in buildHealthClient instead of string concatenation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX force-pushed the wire-server-discovery-protocol branch from bbb5f65 to 3de135a Compare March 25, 2026 14:41
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 25, 2026
@JAORMX JAORMX merged commit 6b0ff32 into main Mar 26, 2026
66 of 68 checks passed
@JAORMX JAORMX deleted the wire-server-discovery-protocol branch March 26, 2026 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants