Skip to content

feat: support leader hint when NodeIsNotLeader#883

Merged
mattisonchao merged 9 commits intomainfrom
feat.leaderhint
Feb 22, 2026
Merged

feat: support leader hint when NodeIsNotLeader#883
mattisonchao merged 9 commits intomainfrom
feat.leaderhint

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Feb 9, 2026

Motivation

When a client sends a request to a non-leader node, the retry relies on the shard manager's assignment mapping, which may be stale. This PR enriches the NodeIsNotLeader gRPC error with a leader hint so the client can retry directly to the correct leader, reducing retry latency during leadership changes.

Modification

  • Add LeaderHint protobuf message and embed it as gRPC status detail in NodeIsNotLeader errors.
  • Add shardAssignmentsIndex on the assignment dispatcher for O(log n) shard-to-leader lookup.
  • Extract leader hint on the client side and use it in read/write batch retry paths.
  • Invalidate cached write streams when a leader hint is present, with proper cleanup of old streams.
  • Remove unneeded goroutine for write stream context watching.
  • Add WithFailureInjection client option and DizzyShardManager for testing.
  • Add integration tests for leader hint with and without client.

mattisonchao and others added 3 commits February 22, 2026 22:40
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao
Copy link
Copy Markdown
Member Author

Code Review Summary

Regression Check

All modifications from main have been verified:

  1. Removed handleStreamClosed goroutine (write_stream.go) — Actually an improvement. The old code had two goroutines: handleResponses exited on error without failing pending futures, leaving that to handleStreamClosed (which used generic io.EOF). The new code consolidates both: handleResponses now fails all pending futures with the actual error on Recv() failure, which is better — CodeNodeIsNotLeader errors propagate correctly for hint extraction.

  2. writeStream invalidation on leader hint (executor.go) — The RLock→Lock upgrade race is the same pattern as main (for failed streams). Worst case: two goroutines create streams simultaneously, one gets overwritten. Old stream is now explicitly closed and marked failed, which is better than main's silent overwrite.

  3. shard_manager.go Leader fallback — Changed from panic("shard not found") to return s.serviceAddress. Safer fallback — the service address can redirect, whereas a panic crashes the client.

  4. public_rpc_server.go error handling — Logic properly inverted. CodeNodeIsNotLeader now gets enriched with hint; all other errors still get logged as warnings. Behavior preserved and enhanced.

  5. Batch retry paths (read_batch.go, write_batch.go) — hint variable is updated in RetryNotify callback (fires between retries), correctly set before the next execute call.

  6. assignment_dispatcher.go — Index rebuilt atomically under write lock, queried under read lock. Initialized() fixed to use RLock.

Verdict

No regressions, no blocking issues. The PR is ready to merge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao merged commit 4f1411e into main Feb 22, 2026
9 checks passed
@mattisonchao mattisonchao deleted the feat.leaderhint branch February 22, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant