Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a semantic response-cache layer with embedder and vector-store abstractions, propagates a guardrails hash into request context, refactors exact-match caching into helpers, moves caching into the inference request path, provides in-memory and SQLite vecstores, updates config/validation/deps, and adds tests and an ADR. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Handler
participant ResponseCache
participant SimpleCache
participant SemanticCache
participant Embedder
participant VecStore
participant Downstream
Client->>Handler: POST /v1/chat/completions
Handler->>ResponseCache: HandleRequest(body, next)
ResponseCache->>SimpleCache: TryHit(body)
alt Exact match
SimpleCache-->>ResponseCache: HIT (exact)
ResponseCache-->>Handler: return cached response (X-Cache: HIT (exact))
Handler-->>Client: 200 OK
else No exact match
SimpleCache-->>ResponseCache: miss
ResponseCache->>SemanticCache: Handle(body, next)
SemanticCache->>Embedder: Embed(extracted text)
Embedder-->>SemanticCache: vector
SemanticCache->>VecStore: Search(vector, paramsHash)
VecStore-->>SemanticCache: results
alt Similarity >= threshold
SemanticCache-->>ResponseCache: HIT (semantic)
ResponseCache-->>Handler: return cached response (X-Cache: HIT (semantic))
Handler-->>Client: 200 OK
else miss
SemanticCache->>Downstream: invoke next()
Downstream-->>SemanticCache: response
SemanticCache->>VecStore: Insert(vector, response, paramsHash, ttl) (async)
SemanticCache-->>ResponseCache: return response (X-Cache: MISS)
ResponseCache-->>Handler: response
Handler-->>Client: response
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 18
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Line 32: Update the gitignore entry for the macOS Finder metadata so it
matches the actual filename by replacing the current DS_Store pattern with
.DS_Store (i.e., change "DS_Store" to ".DS_Store") so .DS_Store files are
correctly ignored by git.
In `@config/config.go`:
- Around line 378-392: The current validation enforces vector_store type and
URLs whenever semanticCacheActive detects vector fields, but it doesn't require
the Semantic.Enabled flag; update the check so these validations only run when
the semantic cache is actually enabled (i.e., require sem.Enabled == true) or
change semanticCacheActive to include sem.Enabled in its logic; specifically
modify the block that starts with sem := &c.Response.Semantic and/or the
semanticCacheActive function so that the switch on sem.VectorStore.Type and the
qdrant/pgvector URL checks only execute when sem.Enabled is true (or when
semanticCacheActive returns true after including Enabled).
- Around line 396-405: The validation currently treats any set semantic cache
field as intent to enable, but runtime only checks semanticEnabled; align them
by updating semanticCacheActive to require the Enabled flag as well (i.e.,
return sem != nil && sem.Enabled && (...existing field checks...)) so validation
fails unless Enabled is true, or alternatively change semanticEnabled (in
internal/responsecache/responsecache.go) to call semanticCacheActive(sem) to use
the same detection logic; pick one approach and apply consistently so both
semanticCacheActive and semanticEnabled use the same Enabled+field criteria.
In `@go.mod`:
- Around line 139-140: Add a short comment above the replace directive
documenting why we pin github.com/sugarme/tokenizer to the fork
github.com/clems4ever/tokenizer v0.0.0-20250926133620-9ddc80533c43 (e.g.,
specific bugfix, compatibility with Go toolchain, or missing feature), including
the issue/PR or commit ID that necessitated the fork and any guidance on when it
can be reverted; update the replace line comment to reference that rationale so
future maintainers understand why the forked tokenizer is required.
In `@internal/app/app.go`:
- Around line 579-596: computeGuardrailsHashFromConfig currently builds
responsecache.GuardrailRuleDescriptor with only Name, Type and
SystemPrompt.Content; extend it to include the rule Order and SystemPrompt.Mode
so changes to ordering or prompt mode invalidate the semantic cache, and update
the responsecache.GuardrailRuleDescriptor type and
responsecache.ComputeGuardrailsHash usage to accept and incorporate these new
fields accordingly.
In `@internal/embedding/embedding_test.go`:
- Around line 10-18: The test TestNewEmbedder_LocalDefault should not check
err.Error() == "" because that branch is redundant; instead, keep the early skip
when NewEmbedder returns nil error (ONNX present) and otherwise assert that an
error was returned — remove the unreachable/unused empty-error-string assertion.
Locate TestNewEmbedder_LocalDefault and the NewEmbedder call and delete the
err.Error() == "" check (and its t.Fatal), leaving a clear skip for err == nil
and a single assertion that err != nil when ONNX is absent.
In `@internal/embedding/embedding.go`:
- Around line 77-83: The Embed method for miniLMEmbedder ignores the provided
context, so long-running ONNX Compute calls cannot be cancelled; update Embed
(miniLMEmbedder.Embed) to respect ctx by either calling a context-aware variant
of the model compute if available (e.g., model.ComputeContext(ctx, ...)) or by
running e.model.Compute(text, true) in a goroutine and select on ctx.Done() to
abort and return ctx.Err() if cancelled; ensure any resources are cleaned up and
errors are wrapped consistently (keep the existing error message format).
- Around line 48-53: The apiEmbedder is constructed with &http.Client{} which
has no timeout; update the constructor that creates apiEmbedder in embedding.go
to set a reasonable Timeout on the http.Client (e.g., Timeout: defaultTimeout)
or accept a configurable timeout parameter and use that; add a defaultTimeout
constant (or wire through an existing config value) and ensure the apiEmbedder
struct uses &http.Client{Timeout: timeout} so embedding API calls cannot hang
indefinitely.
- Around line 127-146: The code currently parses the body then inspects
parsed.Error, but it should first validate HTTP status codes from
e.httpClient.Do(req) to catch non-2xx responses that may not populate
parsed.Error; after receiving resp, check resp.StatusCode (e.g., if
resp.StatusCode < 200 || resp.StatusCode >= 300) and read the rawBody and return
a formatted error including the status code and body content before attempting
json.Unmarshal; update the error path near the embeddingResponse/parsed handling
so failures return early with context (status code and body) rather than only
relying on parsed.Error or parsed.Data.
In `@internal/responsecache/responsecache.go`:
- Around line 113-118: The semantic path currently returns directly from
m.semantic.Handle when skipSemantic is false, so on a semantic miss the exact
Redis cache never gets populated; change the control flow in responsecache.go so
that after calling m.semantic.Handle (or when it indicates a miss), you still
call m.simple.StoreAfter when m.simple != nil and skipExact is false (i.e.,
perform exact-cache backfill after semantic handling instead of returning
early), and add a HandleRequest regression test that enables both caches and
verifies a semantic miss results in the Redis exact cache being populated for
subsequent exact requests.
In `@internal/responsecache/semantic.go`:
- Around line 165-209: The current semantic cache key is unsafe because
extractEmbedText only returns the last user text (via lastUserText) and
extractTextFromContent drops non-text parts, letting multimodal or multi-turn
requests collide; update caching so computeParamsHash (or whichever builds the
cache key) either (A) only enables semantic caching for single-turn, text-only
requests by checking extractEmbedText’s nonSystemCount==1 and verifying
extractTextFromContent found only plain string/text parts, or (B) include the
full relevant context in the key by passing the entire messages slice (not just
lastUserText) plus deterministic digests/hashes of any non-text content parts
produced by extractTextFromContent (e.g., embed binary/image ids or checksums)
so multimodal/multi-turn differences are represented; modify
extractEmbedText/extractTextFromContent to surface a full messages
representation or content digests to support this change.
- Around line 382-385: The current ShouldSkipAllCache only checks
X-Cache-Control and therefore disagrees with the semantic path that honors
standard Cache-Control: no-store; update ShouldSkipAllCache (used by
ResponseCacheMiddleware.HandleRequest) to use the same predicate as the semantic
path — either call the existing predicate used by the semantic route (if one
exists) or change ShouldSkipAllCache to return true when X-Cache-Control equals
"no-store" OR when the standard Cache-Control header contains the directive
"no-store" (case-insensitive, and handling multiple comma/space-separated
directives).
- Around line 27-30: Remove the now-unused context key type and constant
declarations introduced for response-cache guardrails by deleting the
contextKeyType type and the contextKeyGuardrailsHash constant (and their
comment) from semantic.go; these symbols (contextKeyType,
contextKeyGuardrailsHash) are no longer referenced after switching to
core.WithGuardrailsHash so removing them will fix the lint error—after removal,
run the linter/build to confirm no remaining references.
In `@internal/responsecache/simple.go`:
- Around line 72-97: The X-Cache value set in simpleCacheMiddleware.TryHit is
using "HIT" but ADR 0006 requires the exact-match format "HIT exact" (semantic
layer uses "HIT (semantic)"); update the header set in TryHit so
c.Response().Header().Set("X-Cache", ...) uses "HIT exact" for exact-cache hits
(locate TryHit in simpleCacheMiddleware and the point after cache read where
hashRequest is used to determine a hit) and keep the rest of the response logic
unchanged.
In `@internal/responsecache/vecstore_map.go`:
- Around line 121-130: The custom Newton-Raphson sqrt64 function should be
replaced with the standard library math.Sqrt: remove or delete the sqrt64
function and change any callers to use math.Sqrt(x) instead, and add/import
"math" where needed; ensure you remove the unused sqrt64 symbol to avoid dead
code or update references consistently (function name: sqrt64).
In `@internal/responsecache/vecstore_sqlite.go`:
- Around line 170-176: The current sqliteVecStore.deleteByKeys loops and issues
one ExecContext per key, causing O(n) round-trips; change it to execute a single
batched DELETE against the vec_items table using a single DELETE ... WHERE key
IN (?, ?, ...) with dynamically generated placeholders from the keys slice (or,
if placeholder generation is undesirable, wrap a prepared statement in a
transaction and execute batch deletes), ensuring you still use the ctx with
timeout and handle/return errors from s.db.ExecContext rather than ignoring
them; update the function sqliteVecStore.deleteByKeys to build the placeholders,
call s.db.ExecContext once, and properly handle the result/error.
- Around line 70-119: The Search method (sqliteVecStore.Search) can return fewer
than the requested limit because expired rows are filtered after the SQL LIMIT;
fix by moving expiry filtering into the SQL query: add "AND (expires_at = 0 OR
expires_at >= ?)" to the WHERE clause and pass the current unix timestamp as the
extra parameter so the DB only returns non-expired rows, then remove/ignore the
application-level expiry check that drops results (you can still collect expired
keys separately via a separate query or keep the goroutine deleteByKeys call if
needed). Use the existing serializeFloat32(vec) parameter ordering when adding
the timestamp argument.
In `@internal/server/translated_inference_service.go`:
- Around line 55-61: When marshalRequestBody(req) fails it's silently bypassing
the cache; add a debug log for the error before falling back to
dispatchChatCompletion. In the block that checks s.responseCache != nil &&
!req.Stream, capture marshalErr and if marshalErr != nil call the server's
logger (e.g., s.logger.Debugf or similar) to log a message like
"marshalRequestBody failed: %v" with marshalErr, then continue to call
s.dispatchChatCompletion(c, req, plan) as before; keep the existing
s.responseCache.HandleRequest path when marshal succeeds.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a6410690-7aac-454d-806d-6f5c9a9cf86b
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (19)
.gitignoreconfig/config.godocs/adr/0006-semantic-response-cache.mdgo.modinternal/app/app.gointernal/core/context.gointernal/embedding/embedding.gointernal/embedding/embedding_test.gointernal/responsecache/middleware_test.gointernal/responsecache/responsecache.gointernal/responsecache/semantic.gointernal/responsecache/semantic_test.gointernal/responsecache/simple.gointernal/responsecache/vecstore.gointernal/responsecache/vecstore_map.gointernal/responsecache/vecstore_sqlite.gointernal/server/handlers.gointernal/server/http.gointernal/server/translated_inference_service.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@go.mod`:
- Line 3: go.mod currently pins "go 1.25.1" but CI and Docker use "1.25" (and
release uses "stable"), causing potential version drift; choose one approach and
make them consistent: either change go.mod to "go 1.25" (relax patch-level) or
update CI and Docker pins to the exact patch "1.25.1" — update the GO_VERSION in
.github/workflows/test.yml, the FROM image tag in Dockerfile (golang:1.25 ->
golang:1.25.1-alpine3.23 or equivalent), and the go-version in
.github/workflows/release.yml to match; ensure all three locations (go.mod,
GO_VERSION, Docker FROM / release go-version) use the same version string.
In `@internal/responsecache/simple.go`:
- Around line 81-84: When m.store.Get(ctx, key) returns an error the code
currently swallows it; update the handler around the m.store.Get call to log the
error (at debug or warn) including the key and error details before falling back
to a cache miss. Use the middleware's logger (e.g., m.logger or the existing
logger field) or the request context logger to emit a message like "cache read
error" with key and err, then keep returning false, nil to preserve behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d8c78c18-0a66-494f-8090-2de0c74f8a1a
📒 Files selected for processing (3)
.gitignorego.modinternal/responsecache/simple.go
guardrails_hash already implemented
on semantic-miss
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config/config.go`:
- Around line 379-392: The semantic cache validation currently only checks
vector_store.Type and related URLs; add guards inside the
SemanticCacheActive(sem) block to reject invalid tuning values by validating
sem.Tuning.SimilarityThreshold (or similarity_threshold) must be > 0 and <= 1
and return a descriptive error if not, and validate sem.Tuning.TTL (or ttl) must
be >= 0 (reject negative values) with a clear error message; place these checks
alongside the existing vector_store validations (referencing
c.Response.Semantic, SemanticCacheActive, and sem.VectorStore.Type) so invalid
tuning values are caught during config validation.
In `@internal/responsecache/responsecache.go`:
- Around line 98-99: The header comparison for X-Cache-Type is case-sensitive
while ShouldSkipExactCache is case-insensitive, so update the check in the
handler to compare case-insensitively (e.g., use strings.EqualFold or normalize
strings.ToLower) when evaluating c.Request().Header.Get("X-Cache-Type") against
CacheTypeExact; modify the expression skipSemantic := m.semantic == nil ||
strings.EqualFold(c.Request().Header.Get("X-Cache-Type"), CacheTypeExact) (and
add/import "strings" if missing) so both paths handle header casing consistently
with ShouldSkipExactCache.
In `@internal/responsecache/semantic.go`:
- Around line 81-86: The current paramsHash includes conversationFingerprint
which contains the full last user turn causing paraphrases to fall into
different namespaces; change the namespace construction so it only hashes
invariant context (e.g., prior turns excluding the final user text, non-text
attachments/metadata, guardrails hash from core.GetGuardrailsHash(ctx), model
and embedder identity) instead of the full message JSON inside
conversationFingerprint used when computing paramsHash; alternatively, detect
non-single-turn or non-text-only requests in the code paths (the same places
around conversationFingerprint/computeParamsHash and the later 194-235 block)
and explicitly skip semantic caching for those cases so store.Search (which is
filtered by params_hash) only compares across semantically equivalent
namespaces.
In `@internal/responsecache/vecstore_sqlite.go`:
- Around line 49-62: Add an index and a uniqueness constraint to the vec_items
schema and switch inserts to an upsert to match the lookup pattern: alter the
CREATE TABLE for vec_items to enforce uniqueness on (key, params_hash) (or add a
UNIQUE INDEX on those columns), and add an index on params_hash or a composite
index on (params_hash, expires_at) so queries using WHERE params_hash = ? AND
expires_at > ? are indexed; then change the code paths that insert into
vec_items (the INSERT statements in vecstore_sqlite.go that write new
embeddings/responses) to use INSERT ... ON CONFLICT(key, params_hash) DO UPDATE
(or DO NOTHING if you prefer no overwrite) to dedupe concurrent inserts; update
both the initial table/index creation block and the other index creation block
referenced in the file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: b760722d-770a-4a10-a2a3-e7f1c5501b37
📒 Files selected for processing (15)
config/cache_validation_test.goconfig/config.godocs/adr/0006-semantic-response-cache.mdgo.modinternal/app/app.gointernal/embedding/embedding.gointernal/embedding/embedding_test.gointernal/responsecache/handle_request_test.gointernal/responsecache/middleware_test.gointernal/responsecache/responsecache.gointernal/responsecache/semantic.gointernal/responsecache/semantic_test.gointernal/responsecache/vecstore_map.gointernal/responsecache/vecstore_sqlite.gointernal/server/translated_inference_service.go
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/server/translated_inference_service.go (1)
192-216: 🧹 Nitpick | 🔵 TrivialEmbeddings handler not integrated with the new caching pipeline.
Embeddingsbypasses thehandleTranslatedInference/handleWithCachepattern used byChatCompletionandResponses. If semantic caching should apply to embeddings in the future, this method will need to be refactored.If caching is intentionally not supported for embeddings, consider adding a comment documenting this decision.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/server/translated_inference_service.go` around lines 192 - 216, The Embeddings method currently bypasses the shared caching pipeline used by ChatCompletion/Responses (handleTranslatedInference and handleWithCache), so either integrate embeddings into that flow or explicitly document the decision not to cache; update the translatedInferenceService.Embeddings implementation to call the same helper (e.g., handleTranslatedInference or handleWithCache) with the embedding request and provider/model resolution (preserving requestID and usage logging via s.logUsage and usage.ExtractFromEmbeddingResponse), or add a clear comment above Embeddings stating caching is intentionally disabled for embeddings and why, including references to handleTranslatedInference, handleWithCache, ChatCompletion, and Responses so future maintainers know the intended behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/server/handlers.go`:
- Around line 68-81: The translatedInference() helper allocates a new
translatedInferenceService and calls initHandlers() on every request; change
Handler to hold a cached field (e.g. translatedSvc *translatedInferenceService)
and lazily initialize it inside translatedInference() using sync.Once or a mutex
to ensure thread-safety, construct the service reusing Handler's
provider/modelResolver/... fields, call initHandlers() only once during
initialization, and return the cached instance on subsequent calls so you avoid
per-request allocations.
---
Outside diff comments:
In `@internal/server/translated_inference_service.go`:
- Around line 192-216: The Embeddings method currently bypasses the shared
caching pipeline used by ChatCompletion/Responses (handleTranslatedInference and
handleWithCache), so either integrate embeddings into that flow or explicitly
document the decision not to cache; update the
translatedInferenceService.Embeddings implementation to call the same helper
(e.g., handleTranslatedInference or handleWithCache) with the embedding request
and provider/model resolution (preserving requestID and usage logging via
s.logUsage and usage.ExtractFromEmbeddingResponse), or add a clear comment above
Embeddings stating caching is intentionally disabled for embeddings and why,
including references to handleTranslatedInference, handleWithCache,
ChatCompletion, and Responses so future maintainers know the intended behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 96ebe06b-5a54-4ba8-8075-0fcefe4ef364
📒 Files selected for processing (2)
internal/server/handlers.gointernal/server/translated_inference_service.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/responsecache/responsecache.go`:
- Around line 98-102: The import block containing "log/slog", "strings", "time"
was accidentally placed inside the HandleRequest function causing a syntax
error; remove that misplaced import block from inside HandleRequest and add
"strings" to the top-level import list at the top of the file (keeping existing
imports such as log/slog and time as needed), then run gofmt/goimports to ensure
imports are ordered and the file compiles.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 98061d51-5fc2-4e4f-9c7e-151cb8008470
📒 Files selected for processing (1)
internal/responsecache/responsecache.go
# Conflicts: # .gitignore # go.mod # go.sum # internal/responsecache/simple.go
Summary by CodeRabbit
New Features
Documentation
Chores
Tests