Skip to content

test: migrate contract suite to replay harness#109

Merged
SantiagoDePolonia merged 5 commits intomainfrom
refactoring-tests
Feb 28, 2026
Merged

test: migrate contract suite to replay harness#109
SantiagoDePolonia merged 5 commits intomainfrom
refactoring-tests

Conversation

@SantiagoDePolonia
Copy link
Copy Markdown
Contributor

@SantiagoDePolonia SantiagoDePolonia commented Feb 28, 2026

Summary by CodeRabbit

  • New Features

    • Messages can record multiple tool invocations per message.
    • New Responses API surface: standard and streaming responses supported.
    • Gemini provider can be configured to use a custom models API URL.
  • Tests

    • Contract replay testing framework introduced with extensive golden fixtures for providers.
  • Documentation

    • Testing strategy and guides updated to cover contract replay workflows and fixture recording.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61ca107 and 412c9b8.

📒 Files selected for processing (4)
  • internal/providers/gemini/gemini.go
  • internal/providers/gemini/gemini_test.go
  • tests/contract/gemini_test.go
  • tests/contract/replay_harness_test.go

📝 Walkthrough

Walkthrough

Adds contract replay testing infrastructure and CI job, introduces ToolCall/FunctionCall types and Anthropic tool extraction, extends provider model listing and responses handling (Gemini/models client, recordapi responses endpoints), and converts many provider contract tests to replay-driven, golden-file based suites.

Changes

Cohort / File(s) Summary
CI / Workflow
.github/workflows/test.yml
Added test-contract job to run contract replay tests; made build depend on test-contract.
Core types
internal/core/types.go
Added ToolCall and FunctionCall types and Message.ToolCalls []ToolCall.
Record CLI
cmd/recordapi/main.go
Added support for /v1/responses and /v1/responses_stream routes and provider capability gating for responses endpoints.
Provider — Anthropic
internal/providers/anthropic/anthropic.go
Expanded content fields (ID, Name, Input); added parsing of "tool_use" blocks and conversion to core.ToolCall; propagate tool calls into chat responses (streaming and non-streaming).
Provider — Gemini
internal/providers/gemini/gemini.go
Introduced dedicated models client config and SetModelsURL; ListModels uses native Gemini parsing with OpenAI fallback, header-based API key, and request tracing (X-Request-Id).
Docs / Testing strategy
docs/TESTING_STRATEGY.md, tests/contract/README.md
Renamed to "Contract Replay Tests"; standardized recording workflow (make targets), added 5m timeout guidance, and updated commands and descriptions.
Test harness & helpers
tests/contract/replay_harness_test.go, tests/contract/golden_helpers_test.go, tests/contract/replay_provider_helpers_test.go
Added in-test HTTP replay transport, SSE parsing, golden helpers (normalization, ID anonymization), and provider helpers to create replay-backed providers.
Contract tests (many providers)
tests/contract/*_{test.go}
Replaced numerous ad-hoc contract assertions with replay-driven, table/subtest suites and golden-file comparisons for Anthropic, Gemini, Groq, OpenAI, XAI. Added streaming and responses tests where applicable.
Provider tests
internal/providers/gemini/gemini_test.go
Updated TestListModels to expect API key in x-goog-api-key header instead of query param.
Golden fixtures
tests/contract/testdata/golden/{anthropic,gemini,groq,openai,xai}/*, tests/contract/testdata/{openai,xai}/*
Added 50+ golden JSON and stream fixtures covering chat completions (basic/params/tools/multimodal), streaming outputs, models lists, and responses streams used by the replay tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐇 Golden fixtures rustle in the test-hall glade,

I hop through replays the devs have made.
Tool calls tucked in each assistant note,
CI listens closely to every rote.
Hooray — our contracts hop, all tests parade!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main objective: migrating the contract test suite to use a replay harness. This is the primary focus of the changeset across multiple test files and supporting infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactoring-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/recordapi/main.go`:
- Around line 95-111: The routes "responses" and "responses_stream" currently
build requests unconditionally; add a provider-endpoint capability check before
constructing/sending these requests so unsupported providers return a clear
local error. In cmd/recordapi/main.go update the logic that registers/handles
"responses" and "responses_stream" (the map entries shown) to call the provider
capability check (e.g., Provider.SupportsEndpoint or
Provider.Capabilities/Has("responses")) and short-circuit with a 400/501 error
if the provider does not support "/v1/responses"; ensure both the non-stream and
stream handlers use the same validation and return a concise error message
mentioning the missing "responses" capability.

In `@internal/providers/gemini/gemini.go`:
- Around line 154-162: The models auth header function `headers` currently uses
q.Add and doesn't forward trace headers; make it idempotent by using
q.Set("key", p.apiKey) instead of q.Add so repeated calls replace the key rather
than appending, and copy the incoming "X-Request-Id" into the outgoing request
headers (matching the behavior in `setHeaders`) to preserve trace propagation;
update `headers` to read any existing X-Request-Id from the incoming context/req
and set it on req.Header before returning.
- Around line 152-153: ListModels currently constructs a fresh client via
llmclient.DefaultConfig("gemini", p.modelsURL) and only copies p.hooks, which
bypasses the provider's configured retry/circuit-breaker/resilience settings;
change the implementation to reuse the provider's already-configured llmclient
(or derive the models config from the provider's existing client/config) instead
of DefaultConfig so the retry and circuit-breaker options are preserved—e.g.,
obtain the config or client from p (referencing p.modelsURL, p.hooks, modelsCfg,
llmclient.DefaultConfig) and call ListModels through that existing client or
cloned config that carries over the resilience settings (apply same change for
the other occurrence at lines 164-169).

In `@tests/contract/anthropic_test.go`:
- Around line 27-141: Restore golden-file validation in the Anthropic contract
replay tests by replacing/augmenting the current shallow assertions in
TestAnthropicReplayChatCompletion, TestAnthropicReplayStreamChatCompletion,
TestAnthropicReplayResponses, and TestAnthropicReplayStreamResponses with full
normalized output comparisons against committed golden fixtures (use the
existing json/sse fixture routes as the golden artifacts), add the required
build tag comment (`//go:build contract` / `// +build contract`) and a 5-minute
test timeout comment at the top of the file, and ensure stream outputs are
serialized/normalized before comparing; record or update the golden files by
running make record-api after changes.

In `@tests/contract/gemini_test.go`:
- Around line 18-26: The helper newGeminiReplayProvider duplicates common setup;
extract its body into a shared test helper (e.g., newReplayProvider or
newGeminiReplayProvider) in the shared replay harness test file and have
existing provider contract tests call that shared helper. Move the creation
logic that calls newReplayHTTPClient(t, routes),
gemini.NewWithHTTPClient("test-api-key", client, llmclient.Hooks{}), and the
provider.SetBaseURL/SetModelsURL calls into the shared helper while preserving
the function signature (t *testing.T, routes map[string]replayRoute) and return
type core.Provider; update all places that currently define this helper to
import/use the shared helper and remove the duplicate local definitions. Ensure
the shared helper has the same package scope and required imports so tests
compile.
- Around line 28-155: The tests in TestGeminiReplay* only assert selected
fields; update each test (TestGeminiReplayChatCompletion,
TestGeminiReplayStreamChatCompletion, TestGeminiReplayListModels,
TestGeminiReplayResponses, TestGeminiReplayStreamResponses) to perform
normalized golden-file validation: serialize the full normalized response
output, compare against a committed golden file, and fail if different; provide
a way to re-record goldens via make record-api. Also add the required build tag
and 5-minute timeout to the top of the file (per tests/contract/**/*_test.go
guidelines) so the file uses -tags=contract and the test harness timeout. Ensure
stream tests write/normalize stream output before comparing to their respective
golden files.

In `@tests/contract/groq_test.go`:
- Around line 27-154: The tests (TestGroqReplayChatCompletion,
TestGroqReplayStreamChatCompletion, TestGroqReplayListModels,
TestGroqReplayResponses, TestGroqReplayStreamResponses) currently assert
individual fields; replace those field-level checks with full golden-file
validation: produce a normalized representation of the response/stream output
and compare it against a golden artifact (use the repo's golden helpers and
normalization used by other contract tests), add the required `// +build
contract` (or `//go:build contract`) build tag and set the test timeout to 5
minutes, and update/record the new golden files by running `make record-api` so
the tests validate complete outputs rather than only selective fields.

In `@tests/contract/openai_test.go`:
- Around line 114-116: The test currently skips when required golden fixtures
are missing (uses goldenFileExists(...) and t.Skip), which lets CI pass
silently; change the conditional to fail the test instead by replacing
t.Skip(...) with t.Fatalf(...) or t.Fatal(...) including a message that names
the missing fixture ("openai/responses.json") and instructs to record via "make
record-api"; apply the same replacement for the other occurrence that also
checks goldenFileExists in this file so contract tests fail CI when fixtures are
absent.
- Around line 27-112: Tests are only asserting selected fields; update
TestOpenAIReplayChatCompletion, TestOpenAIReplayStreamChatCompletion, and
TestOpenAIReplayListModels to perform full normalized output vs. golden-file
validation using the repository's golden helpers (e.g., produce a normalized
response and call the compare/validate golden helper) and add the required
contract build tag comment at the top of the file (//go:build contract) and
ensure the test run is documented to use a 5-minute timeout (tests are intended
to be run with -tags=contract and -timeout=5m); record any new golden artifacts
by running make record-api and commit those golden files.
- Around line 153-160: The test TestOpenAIReplayStreamResponses currently checks
for response events but not that the stream terminated; update the assertions
after parseResponsesStream to also verify the stream termination by asserting
the presence of the "[DONE]" termination marker (e.g. require.Contains(raw,
"[DONE]") or assert.True(t, hasResponsesEvent(events, "[DONE]"))), keeping the
existing checks (readAllStream, parseResponsesStream, hasResponsesEvent for
"response.created"/"response.output_text.delta"/"response.completed" and
extractResponsesStreamText) intact.

In `@tests/contract/README.md`:
- Around line 34-46: Update the tests/contract/README.md to align commands with
the contract-test policy: change the contract test run instruction to include
the build tag and five-minute timeout (use "-tags=contract" and "-timeout=5m")
and document the canonical golden-file recording command as "make record-api"
(or reference cmd/recordapi) for updating fixtures; also state that contract
tests under tests/contract/**/*_test.go must use golden file validation and be
run with those flags so local and CI workflows remain consistent.

In `@tests/contract/replay_harness_test.go`:
- Around line 115-125: The flush closure currently appends one sseEvent per
entry in dataLines, causing multi-line SSE `data:` blocks to become multiple
events; update flush (referencing flush, dataLines, events, sseEvent, and
currentName) to join all entries in dataLines with newline separators into a
single sseEvent.Data (e.g., strings.Join(dataLines, "\n")), append that single
event with Name set to currentName to events, then reset currentName and clear
dataLines so multi-line SSE payloads are emitted as one event.

In `@tests/contract/xai_test.go`:
- Around line 27-98: Add golden-file validation and contract build tag/timeout:
add the contract build tag header to this test file (//go:build contract) and
ensure tests run under the 5-minute contract timeout when executing (run with
-timeout=5m in CI or document running via make record-api). Replace the current
field-level assertions in TestXAIReplayChatCompletion,
TestXAIReplayStreamChatCompletion, and TestXAIReplayListModels with golden
comparisons that serialize/normalize the full responses (resp from
ChatCompletion/ListModels and the reconstructed stream chunks from
StreamChatCompletion) and compare them against dedicated golden files (e.g.,
testdata/golden/xai/*.golden.json); if a RECORD flag is set (or when running
make record-api) update/write those golden files. Use existing helpers like
newXAIReplayProvider, jsonFixtureRoute, sseFixtureRoute, readAllStream,
parseChatStream and extractChatStreamText to obtain the actual outputs to
serialize, and fail the test if the normalized output differs from the golden
artifact.
- Around line 101-103: Replace the t.Skip branch that hides missing golden
fixtures with a failing assertion: in tests/contract/xai_test.go (the test
function using goldenFileExists), remove the t.Skip("golden file not found -
record xai responses first") and instead call t.Fatalf with a clear message that
the golden file xai/responses.json is missing and instruct contributors to
regenerate fixtures (e.g., "missing golden file xai/responses.json; run `make
record-api` to create/update contract fixtures"). Apply the same change to the
other checks around the same test (the occurrences noted at lines ~126-128) so
contract tests fail-fast when fixtures are absent.
- Around line 140-148: The test reads the raw stream via readAllStream and
parses events but doesn't assert the stream termination marker; update
TestXAIReplayStreamResponses to assert the terminal "[DONE]" marker is present
in the raw stream (use the raw variable returned from readAllStream) — e.g., add
a require.Contains(raw, "[DONE]") or require.True(strings.HasSuffix(raw,
"[DONE]\n")) after events are parsed to ensure StreamResponses emits the final
"[DONE]" sentinel; keep existing checks using parseResponsesStream,
hasResponsesEvent, and extractResponsesStreamText.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07900ae and be99a6d.

📒 Files selected for processing (15)
  • .github/workflows/test.yml
  • cmd/recordapi/main.go
  • docs/TESTING_STRATEGY.md
  • internal/providers/gemini/gemini.go
  • tests/contract/README.md
  • tests/contract/anthropic_test.go
  • tests/contract/gemini_test.go
  • tests/contract/groq_test.go
  • tests/contract/openai_test.go
  • tests/contract/replay_harness_test.go
  • tests/contract/testdata/openai/responses.json
  • tests/contract/testdata/openai/responses_stream.txt
  • tests/contract/testdata/xai/responses.json
  • tests/contract/testdata/xai/responses_stream.txt
  • tests/contract/xai_test.go

Comment on lines 152 to 153
modelsCfg := llmclient.DefaultConfig("gemini", p.modelsURL)
modelsCfg.Hooks = p.hooks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

ListModels bypasses configured retry/circuit-breaker settings.

This code creates a fresh client from llmclient.DefaultConfig and only reuses hooks, so resilience options passed through provider construction are not applied to model-list requests.

Also applies to: 164-169

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/gemini/gemini.go` around lines 152 - 153, ListModels
currently constructs a fresh client via llmclient.DefaultConfig("gemini",
p.modelsURL) and only copies p.hooks, which bypasses the provider's configured
retry/circuit-breaker/resilience settings; change the implementation to reuse
the provider's already-configured llmclient (or derive the models config from
the provider's existing client/config) instead of DefaultConfig so the retry and
circuit-breaker options are preserved—e.g., obtain the config or client from p
(referencing p.modelsURL, p.hooks, modelsCfg, llmclient.DefaultConfig) and call
ListModels through that existing client or cloned config that carries over the
resilience settings (apply same change for the other occurrence at lines
164-169).

Comment on lines +154 to +162
headers := func(req *http.Request) {
// Add API key as query parameter.
// NOTE: Passing the API key in the URL query parameter is required by Google's native Gemini API for the models endpoint.
// This may be a security concern, as the API key can be logged in server access logs, proxy logs, and browser history.
// See: https://cloud.google.com/vertex-ai/docs/generative-ai/model-parameters#api-key
q := req.URL.Query()
q.Add("key", p.apiKey)
req.URL.RawQuery = q.Encode()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make models auth setup idempotent and keep trace propagation.

Use q.Set("key", ...) instead of q.Add(...) and forward X-Request-Id here as well, matching setHeaders behavior.

Suggested fix
 headers := func(req *http.Request) {
-  q := req.URL.Query()
-  q.Add("key", p.apiKey)
-  req.URL.RawQuery = q.Encode()
+  q := req.URL.Query()
+  q.Set("key", p.apiKey)
+  req.URL.RawQuery = q.Encode()
+
+  if requestID := core.GetRequestID(req.Context()); requestID != "" {
+    req.Header.Set("X-Request-Id", requestID)
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/gemini/gemini.go` around lines 154 - 162, The models auth
header function `headers` currently uses q.Add and doesn't forward trace
headers; make it idempotent by using q.Set("key", p.apiKey) instead of q.Add so
repeated calls replace the key rather than appending, and copy the incoming
"X-Request-Id" into the outgoing request headers (matching the behavior in
`setHeaders`) to preserve trace propagation; update `headers` to read any
existing X-Request-Id from the incoming context/req and set it on req.Header
before returning.

Comment on lines +27 to 141
func TestAnthropicReplayChatCompletion(t *testing.T) {
testCases := []struct {
name string
fixturePath string
finishReason string
}{
{name: "basic", fixturePath: "anthropic/messages.json"},
{name: "with-params", fixturePath: "anthropic/messages_with_params.json"},
{name: "with-tools", fixturePath: "anthropic/messages_with_tools.json", finishReason: "tool_use"},
{name: "extended-thinking", fixturePath: "anthropic/messages_extended_thinking.json"},
{name: "multi-turn", fixturePath: "anthropic/messages_multi_turn.json"},
{name: "multimodal", fixturePath: "anthropic/messages_multimodal.json"},
}

data := loadGoldenFileRaw(t, "anthropic/messages.json")

var resp AnthropicMessageResponse
err := json.Unmarshal(data, &resp)
require.NoError(t, err, "failed to unmarshal Anthropic response")

t.Run("Contract", func(t *testing.T) {
// Validate required fields
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "message", resp.Type, "type should be 'message'")
assert.Equal(t, "assistant", resp.Role, "role should be 'assistant'")
assert.NotEmpty(t, resp.Model, "model should not be empty")

// Validate content structure
require.NotEmpty(t, resp.Content, "content should not be empty")
for i, block := range resp.Content {
assert.NotEmpty(t, block.Type, "content block %d: type should not be empty", i)
if block.Type == "text" {
assert.NotEmpty(t, block.Text, "content block %d: text should not be empty for text type", i)
for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
provider := newAnthropicReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/messages"): jsonFixtureRoute(t, tc.fixturePath),
})

resp, err := provider.ChatCompletion(context.Background(), &core.ChatRequest{
Model: "claude-sonnet-4-20250514",
Messages: []core.Message{{
Role: "user",
Content: "hello",
}},
})
require.NoError(t, err)
require.NotNil(t, resp)

assert.NotEmpty(t, resp.ID)
assert.Equal(t, "chat.completion", resp.Object)
require.NotEmpty(t, resp.Choices)
assert.Equal(t, "assistant", resp.Choices[0].Message.Role)
assert.NotEmpty(t, resp.Choices[0].FinishReason)
if tc.finishReason != "" {
assert.Equal(t, tc.finishReason, resp.Choices[0].FinishReason)
}
}

// Validate stop reason
assert.NotEmpty(t, resp.StopReason, "stop_reason should not be empty")
assert.NotEmpty(t, resp.Choices[0].Message.Content)
})
}
}

// Validate usage
assert.GreaterOrEqual(t, resp.Usage.InputTokens, 0, "input_tokens should be >= 0")
assert.GreaterOrEqual(t, resp.Usage.OutputTokens, 0, "output_tokens should be >= 0")
func TestAnthropicReplayStreamChatCompletion(t *testing.T) {
provider := newAnthropicReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/messages"): sseFixtureRoute(t, "anthropic/messages_stream.txt"),
})

t.Run("IDFormat", func(t *testing.T) {
// Anthropic message IDs typically start with "msg_"
assert.Contains(t, resp.ID, "msg_", "Anthropic message ID should contain 'msg_'")
stream, err := provider.StreamChatCompletion(context.Background(), &core.ChatRequest{
Model: "claude-sonnet-4-20250514",
Messages: []core.Message{{
Role: "user",
Content: "stream",
}},
})
require.NoError(t, err)

t.Run("Streaming", func(t *testing.T) {
if !goldenFileExists(t, "anthropic/messages_stream.txt") {
t.Skip("golden file not found - run 'make record-api' to generate")
}

data := loadGoldenFileRaw(t, "anthropic/messages_stream.txt")

// Anthropic streaming responses use SSE format with event types
assert.Contains(t, string(data), "event:", "streaming response should contain SSE event lines")
assert.Contains(t, string(data), "data:", "streaming response should contain SSE data lines")

// Should contain message_start event
assert.Contains(t, string(data), "message_start", "streaming response should contain message_start event")
raw := readAllStream(t, stream)
chunks, done := parseChatStream(t, raw)

// Should contain message_stop event
assert.Contains(t, string(data), "message_stop", "streaming response should contain message_stop event")
})
require.True(t, done, "stream should terminate with [DONE]")
require.NotEmpty(t, chunks)
assert.NotEmpty(t, extractChatStreamText(chunks))
}

func TestAnthropic_MessagesWithParams(t *testing.T) {
if !goldenFileExists(t, "anthropic/messages_with_params.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
}

data := loadGoldenFileRaw(t, "anthropic/messages_with_params.json")

var resp AnthropicMessageResponse
err := json.Unmarshal(data, &resp)
require.NoError(t, err, "failed to unmarshal Anthropic response")

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "message", resp.Type, "type should be 'message'")
assert.Equal(t, "assistant", resp.Role, "role should be 'assistant'")
assert.NotEmpty(t, resp.Model, "model should not be empty")
func TestAnthropicReplayResponses(t *testing.T) {
provider := newAnthropicReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/messages"): jsonFixtureRoute(t, "anthropic/messages.json"),
})
}

// AnthropicToolUseResponse represents an Anthropic response with tool use.
type AnthropicToolUseResponse struct {
ID string `json:"id"`
Type string `json:"type"`
Role string `json:"role"`
Content []AnthropicToolContentBlock `json:"content"`
Model string `json:"model"`
StopReason string `json:"stop_reason"`
StopSequence *string `json:"stop_sequence"`
Usage AnthropicUsage `json:"usage"`
}

// AnthropicToolContentBlock represents a content block that may be text or tool_use.
type AnthropicToolContentBlock struct {
Type string `json:"type"`
Text string `json:"text,omitempty"`
ID string `json:"id,omitempty"`
Name string `json:"name,omitempty"`
Input json.RawMessage `json:"input,omitempty"`
}

func TestAnthropic_MessagesWithTools(t *testing.T) {
if !goldenFileExists(t, "anthropic/messages_with_tools.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
}

data := loadGoldenFileRaw(t, "anthropic/messages_with_tools.json")

var resp AnthropicToolUseResponse
err := json.Unmarshal(data, &resp)
require.NoError(t, err, "failed to unmarshal Anthropic response")

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "message", resp.Type, "type should be 'message'")
assert.Equal(t, "assistant", resp.Role, "role should be 'assistant'")
assert.NotEmpty(t, resp.Model, "model should not be empty")

// Check for tool_use content blocks
hasToolUse := false
for _, block := range resp.Content {
if block.Type == "tool_use" {
hasToolUse = true
assert.NotEmpty(t, block.ID, "tool_use block should have ID")
assert.NotEmpty(t, block.Name, "tool_use block should have name")
}
}
if hasToolUse {
assert.Equal(t, "tool_use", resp.StopReason, "stop_reason should be 'tool_use' when tools are called")
}
resp, err := provider.Responses(context.Background(), &core.ResponsesRequest{
Model: "claude-sonnet-4-20250514",
Input: "hello",
})
require.NoError(t, err)
require.NotNil(t, resp)

assert.Equal(t, "response", resp.Object)
assert.Equal(t, "completed", resp.Status)
require.NotEmpty(t, resp.Output)
require.NotEmpty(t, resp.Output[0].Content)
assert.NotEmpty(t, resp.Output[0].Content[0].Text)
require.NotNil(t, resp.Usage)
assert.GreaterOrEqual(t, resp.Usage.TotalTokens, 0)
}

// AnthropicThinkingResponse represents an Anthropic response with extended thinking.
type AnthropicThinkingResponse struct {
ID string `json:"id"`
Type string `json:"type"`
Role string `json:"role"`
Content []AnthropicThinkingContentBlock `json:"content"`
Model string `json:"model"`
StopReason string `json:"stop_reason"`
StopSequence *string `json:"stop_sequence"`
Usage AnthropicThinkingUsage `json:"usage"`
}

// AnthropicThinkingContentBlock represents a content block that may be text or thinking.
type AnthropicThinkingContentBlock struct {
Type string `json:"type"`
Text string `json:"text,omitempty"`
Thinking string `json:"thinking,omitempty"`
}

// AnthropicThinkingUsage represents usage with cache tokens for extended thinking.
type AnthropicThinkingUsage struct {
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
}

func TestAnthropic_ExtendedThinking(t *testing.T) {
if !goldenFileExists(t, "anthropic/messages_extended_thinking.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
}

data := loadGoldenFileRaw(t, "anthropic/messages_extended_thinking.json")

var resp AnthropicThinkingResponse
err := json.Unmarshal(data, &resp)
require.NoError(t, err, "failed to unmarshal Anthropic response")

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "message", resp.Type, "type should be 'message'")
assert.Equal(t, "assistant", resp.Role, "role should be 'assistant'")
assert.NotEmpty(t, resp.Model, "model should not be empty")

// Extended thinking responses should have thinking content block
hasThinking := false
for _, block := range resp.Content {
if block.Type == "thinking" {
hasThinking = true
assert.NotEmpty(t, block.Thinking, "thinking block should have content")
}
}
// Note: thinking may not always be present depending on the model and request
_ = hasThinking
func TestAnthropicReplayStreamResponses(t *testing.T) {
provider := newAnthropicReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/messages"): sseFixtureRoute(t, "anthropic/messages_stream.txt"),
})
}

func TestAnthropic_MessagesMultiTurn(t *testing.T) {
if !goldenFileExists(t, "anthropic/messages_multi_turn.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
}

data := loadGoldenFileRaw(t, "anthropic/messages_multi_turn.json")

var resp AnthropicMessageResponse
err := json.Unmarshal(data, &resp)
require.NoError(t, err, "failed to unmarshal Anthropic response")

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "message", resp.Type, "type should be 'message'")
assert.Equal(t, "assistant", resp.Role, "role should be 'assistant'")
require.NotEmpty(t, resp.Content, "content should not be empty")
stream, err := provider.StreamResponses(context.Background(), &core.ResponsesRequest{
Model: "claude-sonnet-4-20250514",
Input: "stream",
})
}

func TestAnthropic_MessagesMultimodal(t *testing.T) {
if !goldenFileExists(t, "anthropic/messages_multimodal.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
require.NoError(t, err)

raw := readAllStream(t, stream)
events := parseResponsesStream(t, raw)
require.NotEmpty(t, events)

assert.True(t, hasResponsesEvent(events, "response.created"))
assert.True(t, hasResponsesEvent(events, "response.output_text.delta"))
assert.True(t, hasResponsesEvent(events, "response.completed"))
assert.NotEmpty(t, extractResponsesStreamText(events))

hasDone := false
for _, event := range events {
if event.Done {
hasDone = true
break
}
}

data := loadGoldenFileRaw(t, "anthropic/messages_multimodal.json")

var resp AnthropicMessageResponse
err := json.Unmarshal(data, &resp)
require.NoError(t, err, "failed to unmarshal Anthropic response")

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "message", resp.Type, "type should be 'message'")
assert.Equal(t, "assistant", resp.Role, "role should be 'assistant'")
require.NotEmpty(t, resp.Content, "content should not be empty")
})
assert.True(t, hasDone, "responses stream should terminate with [DONE]")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore golden-file validation in Anthropic contract replay tests.

These tests assert only selected fields and do not validate normalized outputs against committed golden artifacts, which weakens contract regression detection.

As per coding guidelines tests/contract/**/*_test.go: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout. Record new golden files using make record-api.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/anthropic_test.go` around lines 27 - 141, Restore golden-file
validation in the Anthropic contract replay tests by replacing/augmenting the
current shallow assertions in TestAnthropicReplayChatCompletion,
TestAnthropicReplayStreamChatCompletion, TestAnthropicReplayResponses, and
TestAnthropicReplayStreamResponses with full normalized output comparisons
against committed golden fixtures (use the existing json/sse fixture routes as
the golden artifacts), add the required build tag comment (`//go:build contract`
/ `// +build contract`) and a 5-minute test timeout comment at the top of the
file, and ensure stream outputs are serialized/normalized before comparing;
record or update the golden files by running make record-api after changes.

Comment on lines 18 to 26
func newGeminiReplayProvider(t *testing.T, routes map[string]replayRoute) core.Provider {
t.Helper()

t.Run("ModelPrefix", func(t *testing.T) {
// Gemini models typically contain "gemini"
assert.Contains(t, resp.Model, "gemini", "Gemini model should contain 'gemini'")
})
client := newReplayHTTPClient(t, routes)
provider := gemini.NewWithHTTPClient("test-api-key", client, llmclient.Hooks{})
provider.SetBaseURL("https://replay.local")
provider.SetModelsURL("https://replay.local")
return provider
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Extract common replay-provider setup to shared test helper.

newGeminiReplayProvider duplicates the same setup pattern used across provider contract files; centralizing this in tests/contract/replay_harness_test.go would reduce drift and simplify future changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/gemini_test.go` around lines 18 - 26, The helper
newGeminiReplayProvider duplicates common setup; extract its body into a shared
test helper (e.g., newReplayProvider or newGeminiReplayProvider) in the shared
replay harness test file and have existing provider contract tests call that
shared helper. Move the creation logic that calls newReplayHTTPClient(t,
routes), gemini.NewWithHTTPClient("test-api-key", client, llmclient.Hooks{}),
and the provider.SetBaseURL/SetModelsURL calls into the shared helper while
preserving the function signature (t *testing.T, routes map[string]replayRoute)
and return type core.Provider; update all places that currently define this
helper to import/use the shared helper and remove the duplicate local
definitions. Ensure the shared helper has the same package scope and required
imports so tests compile.

Comment on lines +28 to 155
func TestGeminiReplayChatCompletion(t *testing.T) {
testCases := []struct {
name string
fixturePath string
expectContent bool
}{
{name: "basic", fixturePath: "gemini/chat_completion.json", expectContent: true},
{name: "params", fixturePath: "gemini/chat_with_params.json", expectContent: true},
{name: "tools", fixturePath: "gemini/chat_with_tools.json", expectContent: false},
}

resp := loadGoldenFile[core.ModelsResponse](t, "gemini/models.json")

// Validate required fields
assert.Equal(t, "list", resp.Object, "object should be 'list'")
assert.NotEmpty(t, resp.Data, "models list should not be empty")

// Validate each model structure
for i, model := range resp.Data {
assert.NotEmpty(t, model.ID, "model %d: ID should not be empty", i)
assert.Equal(t, "model", model.Object, "model %d: object should be 'model'", i)
}

// Check for some expected models (Gemini variants)
modelIDs := make(map[string]bool)
for _, model := range resp.Data {
modelIDs[model.ID] = true
}

// At least one Gemini model should exist
hasGemini := false
for id := range modelIDs {
if hasPrefixIgnoreModels(id, "gemini") {
hasGemini = true
break
}
for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
provider := newGeminiReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/chat/completions"): jsonFixtureRoute(t, tc.fixturePath),
})

resp, err := provider.ChatCompletion(context.Background(), &core.ChatRequest{
Model: "gemini-2.5-flash",
Messages: []core.Message{{
Role: "user",
Content: "hello",
}},
})
require.NoError(t, err)
require.NotNil(t, resp)

assert.NotEmpty(t, resp.ID)
assert.Equal(t, "chat.completion", resp.Object)
require.NotEmpty(t, resp.Choices)
assert.Equal(t, "assistant", resp.Choices[0].Message.Role)
assert.NotEmpty(t, resp.Choices[0].FinishReason)
if tc.expectContent {
assert.NotEmpty(t, resp.Choices[0].Message.Content)
}
})
}
assert.True(t, hasGemini, "expected at least one Gemini model in models list")
}

func TestGemini_StreamingFormat_Contract(t *testing.T) {
if !goldenFileExists(t, "gemini/chat_completion_stream.txt") {
t.Skip("golden file not found - run 'make record-api' to generate")
}
func TestGeminiReplayStreamChatCompletion(t *testing.T) {
provider := newGeminiReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/chat/completions"): sseFixtureRoute(t, "gemini/chat_completion_stream.txt"),
})

data := loadGoldenFileRaw(t, "gemini/chat_completion_stream.txt")
stream, err := provider.StreamChatCompletion(context.Background(), &core.ChatRequest{
Model: "gemini-2.5-flash",
Messages: []core.Message{{
Role: "user",
Content: "stream",
}},
})
require.NoError(t, err)

// Streaming responses should be in SSE format
assert.Contains(t, string(data), "data:", "streaming response should contain SSE data lines")
raw := readAllStream(t, stream)
chunks, done := parseChatStream(t, raw)

// Should end with [DONE]
assert.Contains(t, string(data), "[DONE]", "streaming response should end with [DONE]")
require.True(t, done, "stream should terminate with [DONE]")
require.NotEmpty(t, chunks)
assert.NotEmpty(t, extractChatStreamText(chunks))
}

func TestGemini_ChatWithParams(t *testing.T) {
if !goldenFileExists(t, "gemini/chat_with_params.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
func TestGeminiReplayListModels(t *testing.T) {
provider := newGeminiReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodGet, "/models?key=test-api-key"): jsonFixtureRoute(t, "gemini/models.json"),
})

resp, err := provider.ListModels(context.Background())
require.NoError(t, err)
require.NotNil(t, resp)

assert.Equal(t, "list", resp.Object)
require.NotEmpty(t, resp.Data)
for _, model := range resp.Data {
assert.NotEmpty(t, model.ID)
assert.Equal(t, "model", model.Object)
}
}

resp := loadGoldenFile[core.ChatResponse](t, "gemini/chat_with_params.json")
func TestGeminiReplayResponses(t *testing.T) {
provider := newGeminiReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/chat/completions"): jsonFixtureRoute(t, "gemini/chat_completion.json"),
})

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "chat.completion", resp.Object, "object should be chat.completion")
require.NotEmpty(t, resp.Choices, "choices should not be empty")
assert.Equal(t, "assistant", resp.Choices[0].Message.Role, "message role should be assistant")
resp, err := provider.Responses(context.Background(), &core.ResponsesRequest{
Model: "gemini-2.5-flash",
Input: "hello",
})
require.NoError(t, err)
require.NotNil(t, resp)

assert.Equal(t, "response", resp.Object)
assert.Equal(t, "completed", resp.Status)
require.NotEmpty(t, resp.Output)
require.NotEmpty(t, resp.Output[0].Content)
assert.NotEmpty(t, resp.Output[0].Content[0].Text)
require.NotNil(t, resp.Usage)
assert.GreaterOrEqual(t, resp.Usage.TotalTokens, 0)
}

func TestGemini_ChatWithTools(t *testing.T) {
if !goldenFileExists(t, "gemini/chat_with_tools.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
}
func TestGeminiReplayStreamResponses(t *testing.T) {
provider := newGeminiReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/chat/completions"): sseFixtureRoute(t, "gemini/chat_completion_stream.txt"),
})

resp := loadGoldenFile[ToolCallResponse](t, "gemini/chat_with_tools.json")
stream, err := provider.StreamResponses(context.Background(), &core.ResponsesRequest{
Model: "gemini-2.5-flash",
Input: "stream",
})
require.NoError(t, err)

t.Run("Contract", func(t *testing.T) {
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.Equal(t, "chat.completion", resp.Object, "object should be chat.completion")
require.NotEmpty(t, resp.Choices, "choices should not be empty")
raw := readAllStream(t, stream)
events := parseResponsesStream(t, raw)
require.NotEmpty(t, events)

choice := resp.Choices[0]
// Tool calls may be present if the model chose to use tools
if len(choice.Message.ToolCalls) > 0 {
for i, tc := range choice.Message.ToolCalls {
assert.NotEmpty(t, tc.ID, "tool call %d: ID should not be empty", i)
assert.Equal(t, "function", tc.Type, "tool call %d: type should be 'function'", i)
assert.NotEmpty(t, tc.Function.Name, "tool call %d: function name should not be empty", i)
}
}
})
}
assert.True(t, hasResponsesEvent(events, "response.created"))
assert.True(t, hasResponsesEvent(events, "response.output_text.delta"))
assert.True(t, hasResponsesEvent(events, "response.completed"))
assert.NotEmpty(t, extractResponsesStreamText(events))

// hasPrefixIgnoreModels checks if a string has a given prefix, after stripping "models/" prefix
func hasPrefixIgnoreModels(s, prefix string) bool {
s = strings.TrimPrefix(s, "models/")
return strings.HasPrefix(s, prefix)
hasDone := false
for _, event := range events {
if event.Done {
hasDone = true
break
}
}
assert.True(t, hasDone, "responses stream should terminate with [DONE]")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Gemini contract replay tests should include golden-file validation.

The suite currently checks selected fields only and does not compare normalized outputs against committed golden artifacts.

As per coding guidelines tests/contract/**/*_test.go: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout. Record new golden files using make record-api.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/gemini_test.go` around lines 28 - 155, The tests in
TestGeminiReplay* only assert selected fields; update each test
(TestGeminiReplayChatCompletion, TestGeminiReplayStreamChatCompletion,
TestGeminiReplayListModels, TestGeminiReplayResponses,
TestGeminiReplayStreamResponses) to perform normalized golden-file validation:
serialize the full normalized response output, compare against a committed
golden file, and fail if different; provide a way to re-record goldens via make
record-api. Also add the required build tag and 5-minute timeout to the top of
the file (per tests/contract/**/*_test.go guidelines) so the file uses
-tags=contract and the test harness timeout. Ensure stream tests write/normalize
stream output before comparing to their respective golden files.

Comment on lines +34 to +46
## Running

```bash
# Basic chat completion
curl https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"Say Hello World"}],"max_tokens":50}'

# Streaming
curl https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"Say Hello World"}],"max_tokens":50,"stream":true}'

# Models list
curl https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY"

# With parameters
curl https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model":"llama-3.3-70b-versatile",
"messages":[
{"role":"system","content":"You are concise."},
{"role":"user","content":"Name 3 planets"}
],
"temperature":0.5,
"max_tokens":50
}'
# Run contract replay tests
go test -v -tags=contract ./tests/contract/...

# Function calling
curl https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model":"llama-3.3-70b-versatile",
"messages":[{"role":"user","content":"What is the weather in London?"}],
"tools":[{"type":"function","function":{"name":"get_weather","parameters":{"type":"object","properties":{"location":{"type":"string"}}}}}],
"max_tokens":200
}'
# Make target
make test-contract
```

## Models Used

| Provider | Model | Notes |
| --------- | -------------------------------- | ------------------------------------ |
| OpenAI | gpt-4o-mini | Standard chat model |
| OpenAI | o3-mini | Reasoning model with thinking tokens |
| Anthropic | claude-sonnet-4-20250514 | Latest Claude model |
| Gemini | gemini-2.5-flash-preview-09-2025 | Preview flash model |
| xAI | grok-3-mini | Latest Grok model |
| Groq | llama-3.3-70b-versatile | Fast inference Llama model |

## Test Coverage
## Updating fixtures

| Feature | OpenAI | Anthropic | Gemini | xAI | Groq |
| -------------- | ------ | --------- | ------ | --- | ---- |
| Basic chat | ✅ | ✅ | ✅ | ✅ | ✅ |
| Streaming | ✅ | ✅ | ✅ | ✅ | ✅ |
| Models list | ✅ | - | ✅ | ✅ | ✅ |
| Tool calling | ✅ | ✅ | ✅ | - | ✅ |
| System prompt | ✅ | ✅ | ✅ | ✅ | ✅ |
| Temperature | ✅ | ✅ | ✅ | ✅ | ✅ |
| Stop sequences | ✅ | ✅ | - | - | - |
| Multi-turn | ✅ | ✅ | - | - | - |
| Multimodal | ✅ | ✅ | - | - | - |
| JSON mode | ✅ | - | - | - | - |
| Reasoning | ✅ | ✅ | - | - | - |
Use `cmd/recordapi` (or provider curl calls) to refresh payloads when provider contracts change, then re-run the contract suite.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align README commands with the contract-test policy.

The run/update instructions should include the required timeout and the canonical fixture-recording command to keep local and CI workflows consistent.

Suggested doc patch
 # Run contract replay tests
-go test -v -tags=contract ./tests/contract/...
+go test -v -tags=contract -timeout=5m ./tests/contract/...
 
 # Make target
 make test-contract
@@
-Use `cmd/recordapi` (or provider curl calls) to refresh payloads when provider contracts change, then re-run the contract suite.
+Use `make record-api` (or `cmd/recordapi`) to refresh payloads when provider contracts change, then re-run the contract suite.

As per coding guidelines tests/contract/**/*_test.go: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout. Record new golden files using make record-api.
Based on learnings: Applies to tests/contract/README.md — update tests/contract/README.md when modifying contract testing procedures or golden file validation processes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/README.md` around lines 34 - 46, Update the
tests/contract/README.md to align commands with the contract-test policy: change
the contract test run instruction to include the build tag and five-minute
timeout (use "-tags=contract" and "-timeout=5m") and document the canonical
golden-file recording command as "make record-api" (or reference cmd/recordapi)
for updating fixtures; also state that contract tests under
tests/contract/**/*_test.go must use golden file validation and be run with
those flags so local and CI workflows remain consistent.

Comment on lines +115 to +125
flush := func() {
if len(dataLines) == 0 {
currentName = ""
return
}
for _, data := range dataLines {
events = append(events, sseEvent{
Name: currentName,
Data: data,
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assemble multi-line SSE data: blocks into one event payload.

Current flush logic emits one event per data: line; multi-line SSE events should be emitted as a single event with newline-joined data.

Suggested fix
 flush := func() {
   if len(dataLines) == 0 {
     currentName = ""
     return
   }
-  for _, data := range dataLines {
-    events = append(events, sseEvent{
-      Name: currentName,
-      Data: data,
-    })
-  }
+  events = append(events, sseEvent{
+    Name: currentName,
+    Data: strings.Join(dataLines, "\n"),
+  })
   currentName = ""
   dataLines = dataLines[:0]
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
flush := func() {
if len(dataLines) == 0 {
currentName = ""
return
}
for _, data := range dataLines {
events = append(events, sseEvent{
Name: currentName,
Data: data,
})
}
flush := func() {
if len(dataLines) == 0 {
currentName = ""
return
}
events = append(events, sseEvent{
Name: currentName,
Data: strings.Join(dataLines, "\n"),
})
currentName = ""
dataLines = dataLines[:0]
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/replay_harness_test.go` around lines 115 - 125, The flush
closure currently appends one sseEvent per entry in dataLines, causing
multi-line SSE `data:` blocks to become multiple events; update flush
(referencing flush, dataLines, events, sseEvent, and currentName) to join all
entries in dataLines with newline separators into a single sseEvent.Data (e.g.,
strings.Join(dataLines, "\n")), append that single event with Name set to
currentName to events, then reset currentName and clear dataLines so multi-line
SSE payloads are emitted as one event.

Comment on lines +27 to 98
func TestXAIReplayChatCompletion(t *testing.T) {
testCases := []struct {
name string
fixturePath string
}{
{name: "basic", fixturePath: "xai/chat_completion.json"},
{name: "params", fixturePath: "xai/chat_with_params.json"},
}

for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
provider := newXAIReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/chat/completions"): jsonFixtureRoute(t, tc.fixturePath),
})

resp, err := provider.ChatCompletion(context.Background(), &core.ChatRequest{
Model: "grok-3-mini",
Messages: []core.Message{{
Role: "user",
Content: "hello",
}},
})
require.NoError(t, err)
require.NotNil(t, resp)

assert.NotEmpty(t, resp.ID)
assert.Equal(t, "chat.completion", resp.Object)
require.NotEmpty(t, resp.Choices)
assert.Equal(t, "assistant", resp.Choices[0].Message.Role)
assert.NotEmpty(t, resp.Choices[0].Message.Content)
})
}
}

resp := loadGoldenFile[core.ChatResponse](t, "xai/chat_completion.json")

t.Run("Contract", func(t *testing.T) {
// Validate required fields exist (structure validation)
assert.NotEmpty(t, resp.ID, "response ID should not be empty")
assert.NotEmpty(t, resp.Object, "response object should not be empty")
assert.Equal(t, "chat.completion", resp.Object, "object should be chat.completion")
assert.NotEmpty(t, resp.Model, "model should not be empty")
assert.NotZero(t, resp.Created, "created timestamp should not be zero")

// Validate choices structure
require.NotEmpty(t, resp.Choices, "choices should not be empty")
choice := resp.Choices[0]
assert.GreaterOrEqual(t, choice.Index, 0, "choice index should be >= 0")
assert.NotNil(t, choice.Message, "choice message should not be nil")
assert.NotEmpty(t, choice.Message.Role, "message role should not be empty")
assert.Equal(t, "assistant", choice.Message.Role, "message role should be assistant")
assert.NotEmpty(t, choice.FinishReason, "finish reason should not be empty")

// Validate usage structure
assert.GreaterOrEqual(t, resp.Usage.PromptTokens, 0, "prompt tokens should be >= 0")
assert.GreaterOrEqual(t, resp.Usage.CompletionTokens, 0, "completion tokens should be >= 0")
assert.GreaterOrEqual(t, resp.Usage.TotalTokens, 0, "total tokens should be >= 0")
func TestXAIReplayStreamChatCompletion(t *testing.T) {
provider := newXAIReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodPost, "/chat/completions"): sseFixtureRoute(t, "xai/chat_completion_stream.txt"),
})

t.Run("ModelPrefix", func(t *testing.T) {
// xAI models typically contain "grok"
assert.Contains(t, resp.Model, "grok", "xAI model should contain 'grok'")
stream, err := provider.StreamChatCompletion(context.Background(), &core.ChatRequest{
Model: "grok-3-mini",
Messages: []core.Message{{
Role: "user",
Content: "stream",
}},
})
}
require.NoError(t, err)

func TestXAI_ModelsResponse_Contract(t *testing.T) {
if !goldenFileExists(t, "xai/models.json") {
t.Skip("golden file not found - run 'make record-api' to generate")
}
raw := readAllStream(t, stream)
chunks, done := parseChatStream(t, raw)

resp := loadGoldenFile[core.ModelsResponse](t, "xai/models.json")
require.True(t, done, "stream should terminate with [DONE]")
require.NotEmpty(t, chunks)
assert.NotEmpty(t, extractChatStreamText(chunks))
}

// Validate required fields
assert.Equal(t, "list", resp.Object, "object should be 'list'")
assert.NotEmpty(t, resp.Data, "models list should not be empty")
func TestXAIReplayListModels(t *testing.T) {
provider := newXAIReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodGet, "/models"): jsonFixtureRoute(t, "xai/models.json"),
})

// Validate each model structure
for i, model := range resp.Data {
assert.NotEmpty(t, model.ID, "model %d: ID should not be empty", i)
assert.Equal(t, "model", model.Object, "model %d: object should be 'model'", i)
assert.NotEmpty(t, model.OwnedBy, "model %d: owned_by should not be empty", i)
}
resp, err := provider.ListModels(context.Background())
require.NoError(t, err)
require.NotNil(t, resp)

// Check for some expected models (Grok variants)
modelIDs := make(map[string]bool)
assert.Equal(t, "list", resp.Object)
require.NotEmpty(t, resp.Data)
for _, model := range resp.Data {
modelIDs[model.ID] = true
}

// At least one Grok model should exist
hasGrok := false
for id := range modelIDs {
if strings.HasPrefix(id, "grok") {
hasGrok = true
break
}
assert.NotEmpty(t, model.ID)
assert.Equal(t, "model", model.Object)
}
assert.True(t, hasGrok, "expected at least one Grok model in models list")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add golden-file assertions for XAI replay contract coverage.

The current checks are field-level only and do not validate full normalized outputs against golden artifacts.

As per coding guidelines tests/contract/**/*_test.go: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout. Record new golden files using make record-api.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/xai_test.go` around lines 27 - 98, Add golden-file validation
and contract build tag/timeout: add the contract build tag header to this test
file (//go:build contract) and ensure tests run under the 5-minute contract
timeout when executing (run with -timeout=5m in CI or document running via make
record-api). Replace the current field-level assertions in
TestXAIReplayChatCompletion, TestXAIReplayStreamChatCompletion, and
TestXAIReplayListModels with golden comparisons that serialize/normalize the
full responses (resp from ChatCompletion/ListModels and the reconstructed stream
chunks from StreamChatCompletion) and compare them against dedicated golden
files (e.g., testdata/golden/xai/*.golden.json); if a RECORD flag is set (or
when running make record-api) update/write those golden files. Use existing
helpers like newXAIReplayProvider, jsonFixtureRoute, sseFixtureRoute,
readAllStream, parseChatStream and extractChatStreamText to obtain the actual
outputs to serialize, and fail the test if the normalized output differs from
the golden artifact.

Comment on lines +101 to 103
if !goldenFileExists(t, "xai/responses.json") {
t.Skip("golden file not found - record xai responses first")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not skip when required response fixtures are missing.

Skipping turns missing contract artifacts into green CI. Fail fast and direct contributors to regenerate fixtures.

Suggested fix
-	if !goldenFileExists(t, "xai/responses.json") {
-		t.Skip("golden file not found - record xai responses first")
-	}
+	require.True(t, goldenFileExists(t, "xai/responses.json"),
+		"missing xai/responses.json; run `make record-api`")
@@
-	if !goldenFileExists(t, "xai/responses_stream.txt") {
-		t.Skip("golden file not found - record xai responses stream first")
-	}
+	require.True(t, goldenFileExists(t, "xai/responses_stream.txt"),
+		"missing xai/responses_stream.txt; run `make record-api`")

As per coding guidelines tests/contract/**/*_test.go: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout. Record new golden files using make record-api.

Also applies to: 126-128

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/xai_test.go` around lines 101 - 103, Replace the t.Skip branch
that hides missing golden fixtures with a failing assertion: in
tests/contract/xai_test.go (the test function using goldenFileExists), remove
the t.Skip("golden file not found - record xai responses first") and instead
call t.Fatalf with a clear message that the golden file xai/responses.json is
missing and instruct contributors to regenerate fixtures (e.g., "missing golden
file xai/responses.json; run `make record-api` to create/update contract
fixtures"). Apply the same change to the other checks around the same test (the
occurrences noted at lines ~126-128) so contract tests fail-fast when fixtures
are absent.

Comment on lines +140 to 148
raw := readAllStream(t, stream)
events := parseResponsesStream(t, raw)
require.NotEmpty(t, events)

assert.True(t, hasResponsesEvent(events, "response.created"))
assert.True(t, hasResponsesEvent(events, "response.output_text.delta"))
assert.True(t, hasResponsesEvent(events, "response.completed"))
assert.NotEmpty(t, extractResponsesStreamText(events))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Assert stream termination with [DONE] in StreamResponses.

TestXAIReplayStreamResponses checks events/text but does not verify terminal completion marker.

Suggested fix
 	raw := readAllStream(t, stream)
 	events := parseResponsesStream(t, raw)
 	require.NotEmpty(t, events)

 	assert.True(t, hasResponsesEvent(events, "response.created"))
 	assert.True(t, hasResponsesEvent(events, "response.output_text.delta"))
 	assert.True(t, hasResponsesEvent(events, "response.completed"))
 	assert.NotEmpty(t, extractResponsesStreamText(events))
+
+	hasDone := false
+	for _, event := range events {
+		if event.Done {
+			hasDone = true
+			break
+		}
+	}
+	assert.True(t, hasDone, "responses stream should terminate with [DONE]")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
raw := readAllStream(t, stream)
events := parseResponsesStream(t, raw)
require.NotEmpty(t, events)
assert.True(t, hasResponsesEvent(events, "response.created"))
assert.True(t, hasResponsesEvent(events, "response.output_text.delta"))
assert.True(t, hasResponsesEvent(events, "response.completed"))
assert.NotEmpty(t, extractResponsesStreamText(events))
}
raw := readAllStream(t, stream)
events := parseResponsesStream(t, raw)
require.NotEmpty(t, events)
assert.True(t, hasResponsesEvent(events, "response.created"))
assert.True(t, hasResponsesEvent(events, "response.output_text.delta"))
assert.True(t, hasResponsesEvent(events, "response.completed"))
assert.NotEmpty(t, extractResponsesStreamText(events))
hasDone := false
for _, event := range events {
if event.Done {
hasDone = true
break
}
}
assert.True(t, hasDone, "responses stream should terminate with [DONE]")
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/xai_test.go` around lines 140 - 148, The test reads the raw
stream via readAllStream and parses events but doesn't assert the stream
termination marker; update TestXAIReplayStreamResponses to assert the terminal
"[DONE]" marker is present in the raw stream (use the raw variable returned from
readAllStream) — e.g., add a require.Contains(raw, "[DONE]") or
require.True(strings.HasSuffix(raw, "[DONE]\n")) after events are parsed to
ensure StreamResponses emits the final "[DONE]" sentinel; keep existing checks
using parseResponsesStream, hasResponsesEvent, and extractResponsesStreamText.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (2)
internal/providers/gemini/gemini.go (1)

79-81: 🧹 Nitpick | 🔵 Trivial

NewWithHTTPClient uses DefaultConfig for resilience settings.

Unlike New() which accepts ProviderOptions with explicit resilience config, NewWithHTTPClient uses llmclient.DefaultConfig() which applies default retry/circuit-breaker settings. This asymmetry could cause unexpected behavior if callers expect consistent resilience configuration.

Consider accepting ProviderOptions or resilience config in NewWithHTTPClient for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/gemini/gemini.go` around lines 79 - 81, NewWithHTTPClient
currently builds modelsCfg via llmclient.DefaultConfig("gemini",
defaultModelsBaseURL) (and assigns it to p.modelsClientConf) which applies
default retry/circuit-breaker settings and creates an asymmetry with New that
accepts ProviderOptions; modify NewWithHTTPClient to accept the same
ProviderOptions (or a resilience config) parameter and use that to construct
modelsCfg instead of llmclient.DefaultConfig, or provide an overload that takes
a ResilienceConfig and merges it into modelsCfg before assigning to
p.modelsClientConf so resilience settings are consistent with New.
tests/contract/openai_test.go (1)

121-149: 🧹 Nitpick | 🔵 Trivial

Add [DONE] termination assertion for consistency.

TestOpenAIReplayStreamResponses validates event types but doesn't verify stream termination with a [DONE] marker, unlike the Anthropic and Groq equivalents which include this check. For contract consistency, add the termination assertion.

♻️ Proposed fix to add [DONE] termination check
 	require.True(t, hasResponsesEvent(events, "response.created"))
 	require.True(t, hasResponsesEvent(events, "response.output_text.delta"))
 	require.True(t, hasResponsesEvent(events, "response.completed"))
 	require.NotEmpty(t, extractResponsesStreamText(events))
+
+	hasDone := false
+	for _, event := range events {
+		if event.Done {
+			hasDone = true
+			break
+		}
+	}
+	require.True(t, hasDone, "responses stream should terminate with [DONE]")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/openai_test.go` around lines 121 - 149,
TestOpenAIReplayStreamResponses currently validates event types but doesn't
assert the stream termination; after reading the raw stream (raw :=
readAllStream(t, stream)) add a termination assertion to verify the provider
emitted the "[DONE]" marker (for example: require.Contains(t, string(raw),
"[DONE]") or require.True(t, hasResponsesEvent(events, "[DONE]"))), ensuring the
stream is properly terminated like the Anthropic/Groq tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/test.yml:
- Around line 88-102: Update the contract tests README to reflect the migration
to replay-harness behavior: edit the contract tests README
(tests/contract/README.md) to describe the new contributor-facing steps for
running contract replay tests (referencing the CI job name test-contract) and
the exact local command developers should run (go test -v -tags=contract
-timeout=5m ./tests/contract/...) plus updated guidance on golden file
validation and how to regenerate/verify golden outputs under the replay harness.

In `@internal/providers/gemini/gemini.go`:
- Line 269: Replace the plain fmt.Errorf("unexpected Gemini models response
format") return with a structured GatewayError (or existing gateway error type)
so downstream middleware can distinguish parsing failures; create/populate a
GatewayError (e.g., &GatewayError{Code: "parse_error", Message: "unexpected
Gemini models response format", Details: <optional raw payload>}) and return
that (or wrap it with fmt.Errorf("%w: ...", gwErr) if wrappers are required),
updating the site that checks for GatewayError via errors.As accordingly.

In `@tests/contract/gemini_test.go`:
- Around line 71-81: The test TestGeminiReplayListModels couples the API key
into the route key string (replayKey(http.MethodGet,
"/models?key=test-api-key")), making it fragile; update the test to decouple the
key by either using replayKey(http.MethodGet, "/models") (i.e., match the path
without query params) or construct the route key dynamically from the provider
created by newGeminiReplayProvider so it uses the same API key value; modify the
replayRoute entry accordingly (touching TestGeminiReplayListModels,
newGeminiReplayProvider usage, and the replayKey call) so the route matching no
longer hardcodes "?key=test-api-key".

In `@tests/contract/golden_helpers_test.go`:
- Around line 93-105: The normalizeGeneratedID function only treats resp_ and
msg_ suffixes as generated when the remainder parses as a UUID; extend its
normalization to also detect known provider-specific ID formats used in golden
files (e.g., OpenAI 56-char hex for resp_, Anthropic 24-char base62-like for
msg_, and Groq/ULID-like req_ patterns) so those are mapped to resp_<generated>,
msg_<generated>, or req_<generated> respectively. Modify normalizeGeneratedID to
check for these additional patterns (in addition to the existing uuid.Parse
path) using deterministic checks (length + allowed character sets or a ULID
regex) and return the normalized token when matched; keep the existing behavior
for strings.HasPrefix("resp_"), strings.HasPrefix("msg_"), and add support for
"req_" as well so golden-file comparisons are consistent.

In `@tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json`:
- Around line 4-9: The fixture shows "finish_reason": "tool_use" but the message
object only has plain text (message.content) so tooling assertions won't run;
update this golden file to include structured tool-call data under the message
(e.g., add a "tool_calls" array/object alongside message.content) that contains
the invoked tool name and its arguments (matching the adapter's expected keys)
so tests validate the tool name and args mapping when finish_reason is
"tool_use".

In `@tests/contract/testdata/golden/gemini/chat_with_tools.golden.json`:
- Around line 4-9: The fixture's assistant message has finish_reason set to
"tool_calls" but the message object lacks the required tool_calls array; update
the JSON in tests/contract/testdata/golden/gemini/chat_with_tools.golden.json so
that the "message" object includes a "tool_calls" array containing the tool call
entries (with fields like id, type, and function) similar to the OpenAI/Groq
fixtures, ensuring the array reflects the same structure and content used by the
other golden files.

In `@tests/contract/testdata/golden/groq/chat_with_tools.golden.json`:
- Around line 4-9: The golden fixtures declare a tool-calling completion with
"finish_reason": "tool_calls" but lack the required message.tool_calls payload;
update each affected golden file so the assistant message includes a tool_calls
array containing objects with id, type, and function (with name and arguments),
then re-record the fixtures (e.g., run make record-api and/or RECORD=1 go test
-v -tags=contract ./tests/contract/...) to capture real tool-call payloads, and
add a short note in the contract tests README describing that when finish_reason
== "tool_calls" the message must include message.tool_calls with id, type, and
function{name, arguments}.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`:
- Around line 25-30: The prompt_tokens_details object is inconsistent with
prompt_tokens: update the detail counts so they sum to prompt_tokens (13) —
since this is a text-only prompt, set "text_tokens" to 13 and keep
"audio_tokens", "cached_tokens", and "image_tokens" at 0; ensure the
"prompt_tokens" field and the sum of prompt_tokens_details remain consistent
going forward (verify any generation logic that produces these fields uses the
same source for totals).

In `@tests/contract/testdata/golden/openai/chat_with_tools.golden.json`:
- Around line 4-9: Add a ToolCalls field to the core message schema and wire
through provider parsing: update the core.Message struct to include ToolCalls
[]ToolCall (create a ToolCall type if missing with fields matching the OpenAI
tool_calls payload), then update each provider adapter (the OpenAI adapter
parsing messages and any other providers that emit tool-invocation responses) to
detect finish_reason == "tool_calls" and populate Message.ToolCalls by
normalizing the provider-specific tool call objects into the new ToolCall type;
ensure all downstream code that constructs or marshals core.Message
(serializers, tests like chat_with_tools.golden.json, and any response builders)
handle the new field.

In `@tests/contract/testdata/golden/xai/responses_stream.golden.json`:
- Around line 2322-2449: The stream uses inconsistent message IDs across related
events: response.output_item.added/response.output_item.done use
"msg_<generated>" while response.content_part.added and response.output_text.*
use "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0"; update the golden so all events
for the same output item share one canonical ID (pick either "msg_<generated>"
or "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0") for the entries with Name values
response.output_item.added, response.content_part.added,
response.output_text.delta, response.output_text.done,
response.content_part.done, and response.output_item.done to preserve
correlation semantics.

In `@tests/contract/xai_test.go`:
- Around line 109-110: Replace the hardcoded golden file paths passed to
compareGoldenJSON with calls to goldenPathForFixture so XAI tests use the
centralized fixture naming; specifically, update the compareGoldenJSON(t,
"xai/responses.golden.json", resp) calls (and the similar occurrences around the
later block at lines 133-136) to compareGoldenJSON(t,
goldenPathForFixture("xai/responses.golden.json"), resp) so the test uses
goldenPathForFixture(...) consistently.

---

Duplicate comments:
In `@internal/providers/gemini/gemini.go`:
- Around line 79-81: NewWithHTTPClient currently builds modelsCfg via
llmclient.DefaultConfig("gemini", defaultModelsBaseURL) (and assigns it to
p.modelsClientConf) which applies default retry/circuit-breaker settings and
creates an asymmetry with New that accepts ProviderOptions; modify
NewWithHTTPClient to accept the same ProviderOptions (or a resilience config)
parameter and use that to construct modelsCfg instead of
llmclient.DefaultConfig, or provide an overload that takes a ResilienceConfig
and merges it into modelsCfg before assigning to p.modelsClientConf so
resilience settings are consistent with New.

In `@tests/contract/openai_test.go`:
- Around line 121-149: TestOpenAIReplayStreamResponses currently validates event
types but doesn't assert the stream termination; after reading the raw stream
(raw := readAllStream(t, stream)) add a termination assertion to verify the
provider emitted the "[DONE]" marker (for example: require.Contains(t,
string(raw), "[DONE]") or require.True(t, hasResponsesEvent(events, "[DONE]"))),
ensuring the stream is properly terminated like the Anthropic/Groq tests.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07900ae and 3784193.

📒 Files selected for processing (57)
  • .github/workflows/test.yml
  • cmd/recordapi/main.go
  • docs/TESTING_STRATEGY.md
  • internal/providers/gemini/gemini.go
  • tests/contract/README.md
  • tests/contract/anthropic_test.go
  • tests/contract/gemini_test.go
  • tests/contract/golden_helpers_test.go
  • tests/contract/groq_test.go
  • tests/contract/openai_test.go
  • tests/contract/replay_harness_test.go
  • tests/contract/replay_provider_helpers_test.go
  • tests/contract/testdata/golden/anthropic/messages.golden.json
  • tests/contract/testdata/golden/anthropic/messages_extended_thinking.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multi_turn.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multimodal.golden.json
  • tests/contract/testdata/golden/anthropic/messages_stream.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_params.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json
  • tests/contract/testdata/golden/anthropic/responses.golden.json
  • tests/contract/testdata/golden/anthropic/responses_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_params.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_tools.golden.json
  • tests/contract/testdata/golden/gemini/models.golden.json
  • tests/contract/testdata/golden/gemini/responses.golden.json
  • tests/contract/testdata/golden/gemini/responses_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_completion.golden.json
  • tests/contract/testdata/golden/groq/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_with_params.golden.json
  • tests/contract/testdata/golden/groq/chat_with_tools.golden.json
  • tests/contract/testdata/golden/groq/models.golden.json
  • tests/contract/testdata/golden/groq/responses.golden.json
  • tests/contract/testdata/golden/groq/responses_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_completion.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_json_mode.golden.json
  • tests/contract/testdata/golden/openai/chat_multi_turn.golden.json
  • tests/contract/testdata/golden/openai/chat_multimodal.golden.json
  • tests/contract/testdata/golden/openai/chat_with_params.golden.json
  • tests/contract/testdata/golden/openai/chat_with_tools.golden.json
  • tests/contract/testdata/golden/openai/models.golden.json
  • tests/contract/testdata/golden/openai/responses.golden.json
  • tests/contract/testdata/golden/openai/responses_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_completion.golden.json
  • tests/contract/testdata/golden/xai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_with_params.golden.json
  • tests/contract/testdata/golden/xai/models.golden.json
  • tests/contract/testdata/golden/xai/responses.golden.json
  • tests/contract/testdata/golden/xai/responses_stream.golden.json
  • tests/contract/testdata/openai/responses.json
  • tests/contract/testdata/openai/responses_stream.txt
  • tests/contract/testdata/xai/responses.json
  • tests/contract/testdata/xai/responses_stream.txt
  • tests/contract/xai_test.go

Comment on lines +88 to +102
test-contract:
name: Contract Replay Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6

- name: Set up Go
uses: actions/setup-go@v6
with:
go-version: ${{ env.GO_VERSION }}
cache: true

- name: Run contract replay tests
run: go test -v -tags=contract -timeout=5m ./tests/contract/...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Please include the contract-testing procedure update in tests/contract/README.md.

This PR migrates contract execution to replay harness behavior, so the contributor-facing contract testing steps should be updated in the same change set.

Based on learnings: Applies to tests/contract/README.md : Update tests/contract/README.md when modifying contract testing procedures or golden file validation processes

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 88 - 102, Update the contract tests
README to reflect the migration to replay-harness behavior: edit the contract
tests README (tests/contract/README.md) to describe the new contributor-facing
steps for running contract replay tests (referencing the CI job name
test-contract) and the exact local command developers should run (go test -v
-tags=contract -timeout=5m ./tests/contract/...) plus updated guidance on golden
file validation and how to regenerate/verify golden outputs under the replay
harness.

Object: "list",
Data: models,
}, nil
return nil, fmt.Errorf("unexpected Gemini models response format")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider returning a structured error type.

Returning a plain fmt.Errorf for an unexpected response format may not integrate well with error handling middleware that expects GatewayError types. Consider wrapping this in a more specific error type if downstream handlers need to distinguish parsing failures from network errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/gemini/gemini.go` at line 269, Replace the plain
fmt.Errorf("unexpected Gemini models response format") return with a structured
GatewayError (or existing gateway error type) so downstream middleware can
distinguish parsing failures; create/populate a GatewayError (e.g.,
&GatewayError{Code: "parse_error", Message: "unexpected Gemini models response
format", Details: <optional raw payload>}) and return that (or wrap it with
fmt.Errorf("%w: ...", gwErr) if wrappers are required), updating the site that
checks for GatewayError via errors.As accordingly.

Comment on lines +71 to 81
func TestGeminiReplayListModels(t *testing.T) {
provider := newGeminiReplayProvider(t, map[string]replayRoute{
replayKey(http.MethodGet, "/models?key=test-api-key"): jsonFixtureRoute(t, "gemini/models.json"),
})

// Check for some expected models (Gemini variants)
modelIDs := make(map[string]bool)
for _, model := range resp.Data {
modelIDs[model.ID] = true
}
resp, err := provider.ListModels(context.Background())
require.NoError(t, err)
require.NotNil(t, resp)

// At least one Gemini model should exist
hasGemini := false
for id := range modelIDs {
if hasPrefixIgnoreModels(id, "gemini") {
hasGemini = true
break
}
}
assert.True(t, hasGemini, "expected at least one Gemini model in models list")
compareGoldenJSON(t, goldenPathForFixture("gemini/models.json"), resp)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider decoupling the API key from the route key.

The route key includes the query parameter ?key=test-api-key which couples the test to the specific API key value used in the provider. If the provider's API key handling changes, this test would break. Consider whether the models endpoint route could be matched without the query parameter, or ensure the key matches what newGeminiReplayProvider uses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/gemini_test.go` around lines 71 - 81, The test
TestGeminiReplayListModels couples the API key into the route key string
(replayKey(http.MethodGet, "/models?key=test-api-key")), making it fragile;
update the test to decouple the key by either using replayKey(http.MethodGet,
"/models") (i.e., match the path without query params) or construct the route
key dynamically from the provider created by newGeminiReplayProvider so it uses
the same API key value; modify the replayRoute entry accordingly (touching
TestGeminiReplayListModels, newGeminiReplayProvider usage, and the replayKey
call) so the route matching no longer hardcodes "?key=test-api-key".

Comment on lines +93 to +105
func normalizeGeneratedID(id string) string {
if strings.HasPrefix(id, "resp_") {
if _, err := uuid.Parse(strings.TrimPrefix(id, "resp_")); err == nil {
return "resp_<generated>"
}
}
if strings.HasPrefix(id, "msg_") {
if _, err := uuid.Parse(strings.TrimPrefix(id, "msg_")); err == nil {
return "msg_<generated>"
}
}
return id
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find tests/contract -name "*.json" -o -name "*.yaml" -o -name "*.yml" | head -20

Repository: ENTERPILOT/GOModel

Length of output: 1315


🏁 Script executed:

rg '"id":\s*"[^"]*"' tests/contract/testdata/golden/ --max-count=30

Repository: ENTERPILOT/GOModel

Length of output: 20293


🏁 Script executed:

# Also check the context around the normalizeGeneratedID function
cat -n tests/contract/golden_helpers_test.go | sed -n '80,115p'

Repository: ENTERPILOT/GOModel

Length of output: 764


🏁 Script executed:

# Check how normalizeGeneratedID is actually used
rg -B5 -A5 'normalizeGeneratedID' tests/contract/

Repository: ENTERPILOT/GOModel

Length of output: 1344


The normalizeGeneratedID function doesn't handle non-UUID ID formats used by actual providers.

The function only normalizes IDs where the suffix (after resp_ or msg_) is a valid UUID. However, golden files contain:

  • OpenAI: resp_0d8b81b7c805508d0069a31431b68c8194af45c94a6952706b (56-char hex string)
  • Anthropic: msg_013LBDy1nvGiRgxiRAWseYRS (base62-like, 24 chars)
  • Groq: req_01kfk38ha1ey6bw3b68mcaeddg (ULID-like format)

These will fail uuid.Parse() and remain unchanged in golden files, while other providers have already-normalized IDs (msg_<generated>). This inconsistency can cause comparison failures during golden file updates. Either extend the pattern matching to handle provider-specific formats, or document that only UUID-suffixed IDs are normalized.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/golden_helpers_test.go` around lines 93 - 105, The
normalizeGeneratedID function only treats resp_ and msg_ suffixes as generated
when the remainder parses as a UUID; extend its normalization to also detect
known provider-specific ID formats used in golden files (e.g., OpenAI 56-char
hex for resp_, Anthropic 24-char base62-like for msg_, and Groq/ULID-like req_
patterns) so those are mapped to resp_<generated>, msg_<generated>, or
req_<generated> respectively. Modify normalizeGeneratedID to check for these
additional patterns (in addition to the existing uuid.Parse path) using
deterministic checks (length + allowed character sets or a ULID regex) and
return the normalized token when matched; keep the existing behavior for
strings.HasPrefix("resp_"), strings.HasPrefix("msg_"), and add support for
"req_" as well so golden-file comparisons are consistent.

Comment on lines +4 to +9
"finish_reason": "tool_use",
"index": 0,
"message": {
"content": "I'll check the current weather in Paris for you.",
"role": "assistant"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# First, check if the file exists and show its content
if [ -f "tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json" ]; then
  echo "=== File content (full) ==="
  cat "tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json"
  echo ""
else
  echo "File not found"
  exit 1
fi

Repository: ENTERPILOT/GOModel

Length of output: 545


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Inspect the fixture for tool-call fields
echo "=== Searching fixture for finish_reason, tool_calls, content ==="
rg -nP -C2 '"finish_reason"\s*:\s*"tool_use"|"tool_calls"|"content"' tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json

echo ""
echo "=== Searching test assertions related to this fixture ==="
# 2) Locate replay/contract assertions related to this fixture and tool-call parsing
rg -nP --type=go -C3 'messages_with_tools|tool_use|tool_calls|ToolCalls|finish_reason' tests/contract 2>/dev/null || echo "(search completed)"

Repository: ENTERPILOT/GOModel

Length of output: 2610


Tool-use fixture is missing structured tool-call data.

Line 4 indicates "finish_reason": "tool_use", but the message object (lines 7–8) contains only plain text content with no tool_calls field. This prevents the test from validating adapter mapping for tool name and arguments, allowing tool-call regressions to pass undetected.

Proposed fixture adjustment
   "choices": [
     {
       "finish_reason": "tool_use",
       "index": 0,
       "message": {
-        "content": "I'll check the current weather in Paris for you.",
-        "role": "assistant"
+        "content": "",
+        "role": "assistant",
+        "tool_calls": [
+          {
+            "id": "call_1",
+            "type": "function",
+            "function": {
+              "name": "get_weather",
+              "arguments": "{\"location\":\"Paris\"}"
+            }
+          }
+        ]
       }
     }
   ],
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"finish_reason": "tool_use",
"index": 0,
"message": {
"content": "I'll check the current weather in Paris for you.",
"role": "assistant"
}
"finish_reason": "tool_use",
"index": 0,
"message": {
"content": "",
"role": "assistant",
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Paris\"}"
}
}
]
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json`
around lines 4 - 9, The fixture shows "finish_reason": "tool_use" but the
message object only has plain text (message.content) so tooling assertions won't
run; update this golden file to include structured tool-call data under the
message (e.g., add a "tool_calls" array/object alongside message.content) that
contains the invoked tool name and its arguments (matching the adapter's
expected keys) so tests validate the tool name and args mapping when
finish_reason is "tool_use".

Comment on lines +4 to +9
"finish_reason": "tool_calls",
"index": 0,
"message": {
"content": "",
"role": "assistant"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and verify the snippet
cat -n tests/contract/testdata/golden/groq/chat_with_tools.golden.json | head -20

Repository: ENTERPILOT/GOModel

Length of output: 620


🏁 Script executed:

# Check if tests/contract/README.md exists and what it documents
ls -la tests/contract/README.md 2>/dev/null || echo "README.md not found"

Repository: ENTERPILOT/GOModel

Length of output: 135


🏁 Script executed:

# Search for other golden files with finish_reason="tool_calls"
fd -e json . tests/contract/testdata/golden -type f | while read f; do
  if jq -e '.choices[]?.finish_reason' "$f" 2>/dev/null | grep -q "tool_calls"; then
    echo "File: $f"
    jq '.choices[]?.finish_reason' "$f" 2>/dev/null | head -3
  fi
done

Repository: ENTERPILOT/GOModel

Length of output: 232


🌐 Web query:

Groq API chat completions tool_calls response schema finish_reason

💡 Result:

Groq /openai/v1/chat/completions response shape (tool calls + finish_reason)

Groq’s Chat Completions endpoint is OpenAI-compatible and returns a top-level chat.completion object with a choices[] array; each choice includes a message and a finish_reason. When the model chooses to call a tool, the assistant message contains tool_calls, and finish_reason is "tool_calls". [1][2]

Typical tool-calling response (non-streaming)

{
  "id": "chatcmpl_...",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "openai/gpt-oss-120b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "call_...",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"New York, NY\",\"unit\":\"fahrenheit\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ],
  "usage": { "...": "..." }
}

Key tool_calls[] fields: id, type ("function"), function.name, function.arguments (a JSON string). [2]

finish_reason values you should handle

Groq follows the OpenAI-style set documented/expected for chat completions:

  • "stop" (normal completion) [1][3]
  • "length" (token limit hit) [4]
  • "tool_calls" (model requests one or more tool calls) [2][4]
  • "content_filter" (content filtered) [4]

Streaming note (important edge case)

A Groq community report (Nov 19, 2025) indicates that for some models, streaming chat-completion chunks may omit finish_reason in the final chunk (breaking OpenAI streaming expectations). If you depend on finish_reason in streaming, you may need a defensive fallback. [4]

Sources: Groq API Reference (chat completions) [1], Groq “Local Tool Calling” docs (tool_calls + finish_reason example) [2], Groq built-in tools example showing finish_reason: "stop" [3], Groq community thread on missing streaming finish_reason + expected values [4].


🏁 Script executed:

# Correct the fd command and search for other golden files with finish_reason="tool_calls"
fd -e json . tests/contract/testdata/golden | while read f; do
  if jq -e '.choices[]?.finish_reason' "$f" 2>/dev/null | grep -q "tool_calls"; then
    echo "=== $f ==="
    jq '.choices[] | select(.finish_reason=="tool_calls") | .message | keys' "$f" 2>/dev/null
  fi
done

Repository: ENTERPILOT/GOModel

Length of output: 359


🏁 Script executed:

# Check the README.md content to see what's documented
cat tests/contract/README.md

Repository: ENTERPILOT/GOModel

Length of output: 1764


finish_reason: "tool_calls" missing message.tool_calls across all provider golden files.

This pattern appears in three golden fixtures:

  • tests/contract/testdata/golden/openai/chat_with_tools.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_tools.golden.json
  • tests/contract/testdata/golden/groq/chat_with_tools.golden.json

Each declares a tool-calling completion (finish_reason: "tool_calls") but omits the message.tool_calls array. Per OpenAI/Groq schema, when finish_reason is "tool_calls", the message must include a tool_calls array with id, type, and function (name, arguments). Missing this allows replay tests to pass while silently masking adapter tool-call parsing regressions.

Re-record fixtures using make record-api, then refresh golden outputs via RECORD=1 go test -v -tags=contract ./tests/contract/... to capture actual tool-call payloads.

Additionally, update tests/contract/README.md to document the required tool_calls payload structure when finish_reason is "tool_calls", so future golden fixtures comply with schema expectations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/groq/chat_with_tools.golden.json` around lines
4 - 9, The golden fixtures declare a tool-calling completion with
"finish_reason": "tool_calls" but lack the required message.tool_calls payload;
update each affected golden file so the assistant message includes a tool_calls
array containing objects with id, type, and function (with name and arguments),
then re-record the fixtures (e.g., run make record-api and/or RECORD=1 go test
-v -tags=contract ./tests/contract/...) to capture real tool-call payloads, and
add a short note in the contract tests README describing that when finish_reason
== "tool_calls" the message must include message.tool_calls with id, type, and
function{name, arguments}.

Comment on lines +25 to +30
"prompt_tokens": 13,
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0,
"image_tokens": 0,
"text_tokens": 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent token accounting in prompt_tokens_details.

The prompt_tokens field shows 13 tokens, but all the detail fields (audio_tokens, cached_tokens, image_tokens, text_tokens) are 0. The detail fields should sum to match the parent total. For a text-only prompt with 13 tokens, text_tokens should be 13.

📊 Proposed fix to correct token details
     "prompt_tokens": 13,
     "prompt_tokens_details": {
       "audio_tokens": 0,
       "cached_tokens": 0,
       "image_tokens": 0,
-      "text_tokens": 0
+      "text_tokens": 13
     },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`
around lines 25 - 30, The prompt_tokens_details object is inconsistent with
prompt_tokens: update the detail counts so they sum to prompt_tokens (13) —
since this is a text-only prompt, set "text_tokens" to 13 and keep
"audio_tokens", "cached_tokens", and "image_tokens" at 0; ensure the
"prompt_tokens" field and the sum of prompt_tokens_details remain consistent
going forward (verify any generation logic that produces these fields uses the
same source for totals).

Comment on lines +4 to +9
"finish_reason": "tool_calls",
"index": 0,
"message": {
"content": "",
"role": "assistant"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the file and understand its context
find . -type f -name "chat_with_tools.golden.json" 2>/dev/null

Repository: ENTERPILOT/GOModel

Length of output: 263


🏁 Script executed:

# List golden test fixtures to understand the structure
fd "golden.*json" tests/ | head -20

Repository: ENTERPILOT/GOModel

Length of output: 1418


🏁 Script executed:

# Check if there's a schema or documentation about golden files
find . -type f -name "*.md" | xargs grep -l "golden" 2>/dev/null | head -5

Repository: ENTERPILOT/GOModel

Length of output: 127


🏁 Script executed:

# Look for test files that reference chat_with_tools
rg "chat_with_tools" --type go --type json 2>/dev/null | head -20

Repository: ENTERPILOT/GOModel

Length of output: 364


🏁 Script executed:

# Read the full OpenAI chat_with_tools golden fixture
cat -n tests/contract/testdata/golden/openai/chat_with_tools.golden.json

Repository: ENTERPILOT/GOModel

Length of output: 1024


🏁 Script executed:

# Read the test file to understand expectations
cat -n tests/contract/openai_test.go | head -100

Repository: ENTERPILOT/GOModel

Length of output: 3840


🏁 Script executed:

# Search for schema definitions or type definitions related to responses
rg "type.*Response" --type go tests/contract/ -A 5 | head -50

Repository: ENTERPILOT/GOModel

Length of output: 442


🏁 Script executed:

# Check if there's any documentation in the contract test directory
cat tests/contract/README.md

Repository: ENTERPILOT/GOModel

Length of output: 1764


🏁 Script executed:

# Find OpenAI provider implementation
find . -type f -path "*/providers/openai*" -name "*.go" | head -10

Repository: ENTERPILOT/GOModel

Length of output: 142


🏁 Script executed:

# Search for tool_calls handling in codebase
rg "tool_calls" --type go -B 2 -A 2

Repository: ENTERPILOT/GOModel

Length of output: 715


🏁 Script executed:

# Check the core types to understand schema
find . -type f -path "*/internal/core*" -name "*.go" | head -10

Repository: ENTERPILOT/GOModel

Length of output: 259


🌐 Web query:

OpenAI Chat Completions API message tool_calls field schema finish_reason

💡 Result:

choices[i].message.tool_calls (Chat Completions API)

In a chat completion response, an assistant message may include:

{
  "role": "assistant",
  "content": null,
  "tool_calls": [
    {
      "id": "string",
      "type": "function",
      "function": {
        "name": "string",
        "arguments": "string (JSON text)"
      }
    }
  ]
}
  • tool_calls is an optional array of ChatCompletionMessageToolCall objects. [1]
  • Each tool call has an id and a type. For function tools, type is "function" and the payload is under function: { name, arguments } (with arguments provided as a string containing JSON). [1]
  • The schema also allows a custom tool call variant: { id, type: "custom", custom: { name, input } }. [2]

Tool result message schema (role: "tool")

To reply with the tool’s output, send a tool message that references the tool call:

{
  "role": "tool",
  "tool_call_id": "string",
  "content": "string (or text parts)"
}

tool_call_id links the tool result back to the corresponding tool_calls[].id. [3]

choices[i].finish_reason when tools are invoked

finish_reason is one of: "stop", "length", "tool_calls", "content_filter", or "function_call" (deprecated). [4]
When the model decides to call a tool, finish_reason will be "tool_calls". [4]

Sources: OpenAI API Reference (Chat Completions). [1][2][3][4]


🏁 Script executed:

# Check OpenAI adapter for tool_calls parsing
cat -n ./internal/providers/openai/openai.go | head -150

Repository: ENTERPILOT/GOModel

Length of output: 6029


🏁 Script executed:

# Check core types definition
cat -n ./internal/core/types.go | grep -A 20 "type Message struct"

Repository: ENTERPILOT/GOModel

Length of output: 815


🏁 Script executed:

# Compare with other provider fixtures that handle tools
cat tests/contract/testdata/golden/gemini/chat_with_tools.golden.json

Repository: ENTERPILOT/GOModel

Length of output: 471


🏁 Script executed:

# Check if anthropic has tool_calls in its fixture
cat tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json | head -50

Repository: ENTERPILOT/GOModel

Length of output: 516


🏁 Script executed:

# Search for ToolCall handling in OpenAI adapter
rg "ToolCall" -type go ./internal/providers/openai/

Repository: ENTERPILOT/GOModel

Length of output: 89


🏁 Script executed:

# Check if tool_calls are parsed anywhere
rg "tool_calls" -type go ./internal/providers/ -B 3 -A 3

Repository: ENTERPILOT/GOModel

Length of output: 89


🏁 Script executed:

# Look at the full Message type definition more carefully
cat -n ./internal/core/types.go | head -100

Repository: ENTERPILOT/GOModel

Length of output: 4443


🏁 Script executed:

# Check if there's a separate response type for tools
rg "type.*Response" ./internal/core/types.go -A 10

Repository: ENTERPILOT/GOModel

Length of output: 1203


🏁 Script executed:

# Check if tool_calls are being parsed but dropped in normalization
grep -r "tool_calls" ./internal/providers/openai/ --include="*.go" -B 5 -A 5

Repository: ENTERPILOT/GOModel

Length of output: 44


🏁 Script executed:

# Check if there are any comment or docs about tool_calls support
rg "tool.call|tool_call" . --type go -i | head -20

Repository: ENTERPILOT/GOModel

Length of output: 741


🏁 Script executed:

# Verify the test expectations - what exactly is being compared
cat -n tests/contract/main_test.go | grep -A 50 "compareGoldenJSON"

Repository: ENTERPILOT/GOModel

Length of output: 44


🏁 Script executed:

# Find the ToolCallResponse and related struct definitions in main_test.go
cat -n tests/contract/main_test.go | grep -A 20 "type ToolCallResponse"

Repository: ENTERPILOT/GOModel

Length of output: 928


🏁 Script executed:

# Look at how the test compares the golden JSON
cat -n tests/contract/main_test.go | grep -B 5 -A 20 "func compareGoldenJSON"

Repository: ENTERPILOT/GOModel

Length of output: 44


🏁 Script executed:

# Check what the actual test expectations are for tool_calls
cat -n tests/contract/main_test.go | head -200

Repository: ENTERPILOT/GOModel

Length of output: 3444


🏁 Script executed:

# Verify that OpenAI provider directly uses core.Message (confirm no transformation)
cat -n ./internal/providers/openai/openai.go | grep -A 30 "ChatCompletion"

Repository: ENTERPILOT/GOModel

Length of output: 1937


🏁 Script executed:

# Check if there's any transformation layer between provider response and core.Message
rg "json.Unmarshal|json.Decoder" ./internal/providers/openai/ -B 3 -A 3

Repository: ENTERPILOT/GOModel

Length of output: 3952


core.Message schema must add ToolCalls field to support tool-invocation responses.

The fixture correctly reflects the current core.Message struct (which lacks a ToolCalls field), but when finish_reason="tool_calls" (line 4), the OpenAI API provides a tool_calls array in the message that cannot be captured or exposed. This is a design gap: the core schema must be extended to include ToolCalls []ToolCall in the Message type, and all provider adapters must be updated to parse and normalize tool-call data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_with_tools.golden.json` around
lines 4 - 9, Add a ToolCalls field to the core message schema and wire through
provider parsing: update the core.Message struct to include ToolCalls []ToolCall
(create a ToolCall type if missing with fields matching the OpenAI tool_calls
payload), then update each provider adapter (the OpenAI adapter parsing messages
and any other providers that emit tool-invocation responses) to detect
finish_reason == "tool_calls" and populate Message.ToolCalls by normalizing the
provider-specific tool call objects into the new ToolCall type; ensure all
downstream code that constructs or marshals core.Message (serializers, tests
like chat_with_tools.golden.json, and any response builders) handle the new
field.

Comment on lines +2322 to +2449
"Done": false,
"Name": "response.output_item.added",
"Payload": {
"item": {
"content": [],
"id": "msg_\u003cgenerated\u003e",
"role": "assistant",
"status": "in_progress",
"type": "message"
},
"output_index": 1,
"sequence_number": 186,
"type": "response.output_item.added"
}
},
{
"Done": false,
"Name": "response.content_part.added",
"Payload": {
"content_index": 0,
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"output_index": 1,
"part": {
"annotations": [],
"logprobs": [],
"text": "",
"type": "output_text"
},
"sequence_number": 187,
"type": "response.content_part.added"
}
},
{
"Done": false,
"Name": "response.output_text.delta",
"Payload": {
"content_index": 0,
"delta": "Hello",
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"logprobs": [],
"output_index": 1,
"sequence_number": 188,
"type": "response.output_text.delta"
}
},
{
"Done": false,
"Name": "response.output_text.delta",
"Payload": {
"content_index": 0,
"delta": ",",
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"logprobs": [],
"output_index": 1,
"sequence_number": 189,
"type": "response.output_text.delta"
}
},
{
"Done": false,
"Name": "response.output_text.delta",
"Payload": {
"content_index": 0,
"delta": " World",
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"logprobs": [],
"output_index": 1,
"sequence_number": 190,
"type": "response.output_text.delta"
}
},
{
"Done": false,
"Name": "response.output_text.delta",
"Payload": {
"content_index": 0,
"delta": "!",
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"logprobs": [],
"output_index": 1,
"sequence_number": 191,
"type": "response.output_text.delta"
}
},
{
"Done": false,
"Name": "response.output_text.done",
"Payload": {
"content_index": 0,
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"logprobs": [],
"output_index": 1,
"sequence_number": 192,
"text": "Hello, World!",
"type": "response.output_text.done"
}
},
{
"Done": false,
"Name": "response.content_part.done",
"Payload": {
"content_index": 0,
"item_id": "msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0",
"output_index": 1,
"part": {
"annotations": [],
"logprobs": [],
"text": "Hello, World!",
"type": "output_text"
},
"sequence_number": 193,
"type": "response.content_part.done"
}
},
{
"Done": false,
"Name": "response.output_item.done",
"Payload": {
"item": {
"content": [
{
"annotations": [],
"logprobs": [],
"text": "Hello, World!",
"type": "output_text"
}
],
"id": "msg_\u003cgenerated\u003e",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep stream message identifiers consistent across related events.

response.output_item.added/response.output_item.done use msg_<generated>, while the associated response.content_part.* and response.output_text.* events use msg_9f.... Use a single canonical ID across these events to preserve correlation semantics in the golden stream.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/xai/responses_stream.golden.json` around lines
2322 - 2449, The stream uses inconsistent message IDs across related events:
response.output_item.added/response.output_item.done use "msg_<generated>" while
response.content_part.added and response.output_text.* use
"msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0"; update the golden so all events for
the same output item share one canonical ID (pick either "msg_<generated>" or
"msg_9f619ea5-7275-4a31-cb4c-35f8adb5c6b0") for the entries with Name values
response.output_item.added, response.content_part.added,
response.output_text.delta, response.output_text.done,
response.content_part.done, and response.output_item.done to preserve
correlation semantics.

Comment on lines 109 to 110
compareGoldenJSON(t, "xai/responses.golden.json", resp)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use goldenPathForFixture(...) consistently for XAI responses tests.

These two hardcoded golden paths are the only outliers; deriving them from fixture paths keeps naming/lookup behavior centralized.

♻️ Proposed refactor
-	compareGoldenJSON(t, "xai/responses.golden.json", resp)
+	compareGoldenJSON(t, goldenPathForFixture("xai/responses.json"), resp)
@@
-	compareGoldenJSON(t, "xai/responses_stream.golden.json", map[string]any{
+	compareGoldenJSON(t, goldenPathForFixture("xai/responses_stream.txt"), map[string]any{
 		"events": events,
 		"text":   extractResponsesStreamText(events),
 	})

Also applies to: 133-136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/xai_test.go` around lines 109 - 110, Replace the hardcoded
golden file paths passed to compareGoldenJSON with calls to goldenPathForFixture
so XAI tests use the centralized fixture naming; specifically, update the
compareGoldenJSON(t, "xai/responses.golden.json", resp) calls (and the similar
occurrences around the later block at lines 133-136) to compareGoldenJSON(t,
goldenPathForFixture("xai/responses.golden.json"), resp) so the test uses
goldenPathForFixture(...) consistently.

@SantiagoDePolonia SantiagoDePolonia marked this pull request as ready for review February 28, 2026 17:49
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/providers/gemini/gemini.go (1)

156-158: ⚠️ Potential issue | 🟠 Major

Native empty models payload is treated as an error.

A valid Gemini native response with {"models":[]} currently falls through to fallback parsing and can return a 502-style provider error. Empty native lists should return an empty core.ModelsResponse.

💡 Proposed fix
 type geminiModelsResponse struct {
-	Models []geminiModel `json:"models"`
+	Models *[]geminiModel `json:"models"`
 }
@@
 	// Preferred path: native Gemini models response.
 	var geminiResp geminiModelsResponse
-	if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && len(geminiResp.Models) > 0 {
-		models := make([]core.Model, 0, len(geminiResp.Models))
+	if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && geminiResp.Models != nil {
+		models := make([]core.Model, 0, len(*geminiResp.Models))
 
-		for _, gm := range geminiResp.Models {
+		for _, gm := range *geminiResp.Models {
 			// Extract model ID from name (format: "models/gemini-...")
 			modelID := strings.TrimPrefix(gm.Name, "models/")

Also applies to: 205-206

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/gemini/gemini.go` around lines 156 - 158, The Gemini
parser treats a native response with an empty models slice as an error; update
the code paths that decode geminiModelsResponse (the struct type named
geminiModelsResponse and its Models []geminiModel field) to explicitly handle
len(resp.Models)==0 by returning an empty core.ModelsResponse (not falling back
to the error/fallback parser). Apply the same fix to the second parsing location
that currently checks the native payload (the other block handling
geminiModelsResponse around the alternate parse) so both code paths return an
empty core.ModelsResponse when Models is empty instead of producing a provider
error.
♻️ Duplicate comments (3)
tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json (1)

25-31: ⚠️ Potential issue | 🟡 Minor

prompt_tokens_details is still inconsistent with prompt_tokens.

This is still mismatched (prompt_tokens: 13 vs text_tokens: 0) and should be aligned for reliable token-accounting assertions.

Suggested fix
     "prompt_tokens": 13,
     "prompt_tokens_details": {
       "audio_tokens": 0,
       "cached_tokens": 0,
       "image_tokens": 0,
-      "text_tokens": 0
+      "text_tokens": 13
     },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`
around lines 25 - 31, The JSON shows prompt_tokens (13) not matching the sum in
prompt_tokens_details (text_tokens: 0); update the prompt_tokens_details so its
fields reflect the actual token count—e.g., set
prompt_tokens_details.text_tokens to 13 (or adjust prompt_tokens to match the
sum of audio_tokens + cached_tokens + image_tokens + text_tokens) so that
"prompt_tokens" and "prompt_tokens_details" are consistent.
tests/contract/openai_test.go (1)

136-143: ⚠️ Potential issue | 🟡 Minor

Add explicit stream termination assertion in TestOpenAIReplayStreamResponses.

At Line 136, event checks are present, but the test does not verify the final [DONE] sentinel.

Suggested fix
 raw := readAllStream(t, stream)
 events := parseResponsesStream(t, raw)
 require.NotEmpty(t, events)

 require.True(t, hasResponsesEvent(events, "response.created"))
 require.True(t, hasResponsesEvent(events, "response.output_text.delta"))
 require.True(t, hasResponsesEvent(events, "response.completed"))
 require.NotEmpty(t, extractResponsesStreamText(events))
+require.Contains(t, raw, "[DONE]", "responses stream should terminate with [DONE]")

 compareGoldenJSON(t, "openai/responses_stream.golden.json", map[string]any{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/openai_test.go` around lines 136 - 143, The test
TestOpenAIReplayStreamResponses currently checks parsed events but doesn't
assert that the stream ended with the final "[DONE]" sentinel; update the test
(after calling readAllStream and parseResponsesStream and before finishing
assertions) to explicitly verify stream termination by asserting that the raw
stream string ends with the "[DONE]" sentinel or that events include a final
sentinel event (use the existing raw variable from readAllStream(t, stream) or
events from parseResponsesStream(t, raw) and add a require.True or require.Equal
assertion to confirm the final "[DONE]" marker).
tests/contract/xai_test.go (1)

127-132: ⚠️ Potential issue | 🟡 Minor

Assert terminal [DONE] marker in stream responses test.

At Line 127, the test checks parsed events but not stream termination; truncated SSE output could still satisfy current assertions.

Suggested fix
 raw := readAllStream(t, stream)
 events := parseResponsesStream(t, raw)
 require.True(t, hasResponsesEvent(events, "response.created"))
 require.True(t, hasResponsesEvent(events, "response.output_text.delta"))
 require.True(t, hasResponsesEvent(events, "response.completed"))
+require.Contains(t, raw, "[DONE]", "responses stream should terminate with [DONE]")

 compareGoldenJSON(t, goldenPathForFixture("xai/responses_stream.txt"), map[string]any{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/xai_test.go` around lines 127 - 132, Test currently verifies
response events but not that the SSE stream contained the terminal "[DONE]"
marker; update the test around readAllStream/parseResponsesStream (functions
readAllStream, parseResponsesStream, hasResponsesEvent and variables raw/events)
to assert the presence of the terminal marker by checking the raw stream for the
"[DONE]" sentinel (e.g., require that raw contains the "[DONE]" line or that the
final SSE message equals "[DONE]") so truncated output cannot falsely pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/contract/testdata/golden/gemini/chat_with_tools.golden.json`:
- Line 23: Replace the hard-coded ID value "4zpyaaPLGYXlxN8Pq7jUkQs" under the
"id" key with the normalized placeholder pattern used in other golden fixtures
(e.g., "msg_\u003cgenerated\u003e") so the fixture matches the standardized ID
format; locate the JSON object containing the "id" property and update its value
to the placeholder string to ensure consistency across golden files.

In `@tests/contract/testdata/golden/gemini/responses_stream.golden.json`:
- Around line 26-39: The Gemini streaming golden fixture's response.completed
event is missing the usage object; update the "response.completed"
Payload.response to include a "usage" field matching other providers and Gemini
non-streaming fixtures with keys input_tokens, output_tokens, and total_tokens
(e.g., usage: { "input_tokens": 0, "output_tokens": 0, "total_tokens": 0 }),
ensuring the "response.completed" event in responses_stream.golden.json contains
this usage object alongside created_at, id, model, object, provider, and status.

In `@tests/contract/testdata/golden/groq/models.golden.json`:
- Around line 2-123: The golden fixture models.golden.json is sensitive to
upstream ordering; to prevent non-semantic churn, sort the model list by the
"id" field before writing the fixture (i.e., perform a stable sort on each
entry's id property) wherever the model-list is serialized for tests (the code
that emits the model-list golden/testdata). Ensure the serializer or test
fixture writer sorts entries by entry["id"] (or model.id) deterministically
prior to marshaling so the resulting models.golden.json is consistently ordered.

In `@tests/contract/testdata/golden/openai/chat_completion.golden.json`:
- Around line 25-31: The golden test fixture has inconsistent token accounting:
"prompt_tokens" is 17 while all entries in "prompt_tokens_details" are 0; update
the fixture so the detail fields sum to the total (for a text-only prompt set
"text_tokens": 17 or otherwise distribute the 17 across
"audio_tokens","cached_tokens","image_tokens","text_tokens")—look for the keys
"prompt_tokens", "prompt_tokens_details" and "text_tokens" in the JSON and make
the values consistent.

In `@tests/contract/testdata/golden/openai/chat_multimodal.golden.json`:
- Line 13: Replace the concrete volatile completion id value in the golden
fixture: the "id" field currently set to
"chatcmpl-D0qUlBEghKIASKSVRE4fnjs5QUDGF" should be normalized to a stable
placeholder (e.g., "chatcmpl-000000000000000" or "chatcmpl-<SANITIZED_ID>") so
regenerated fixtures don't cause churn; update the "id" key in
chat_multimodal.golden.json accordingly.

---

Outside diff comments:
In `@internal/providers/gemini/gemini.go`:
- Around line 156-158: The Gemini parser treats a native response with an empty
models slice as an error; update the code paths that decode geminiModelsResponse
(the struct type named geminiModelsResponse and its Models []geminiModel field)
to explicitly handle len(resp.Models)==0 by returning an empty
core.ModelsResponse (not falling back to the error/fallback parser). Apply the
same fix to the second parsing location that currently checks the native payload
(the other block handling geminiModelsResponse around the alternate parse) so
both code paths return an empty core.ModelsResponse when Models is empty instead
of producing a provider error.

---

Duplicate comments:
In `@tests/contract/openai_test.go`:
- Around line 136-143: The test TestOpenAIReplayStreamResponses currently checks
parsed events but doesn't assert that the stream ended with the final "[DONE]"
sentinel; update the test (after calling readAllStream and parseResponsesStream
and before finishing assertions) to explicitly verify stream termination by
asserting that the raw stream string ends with the "[DONE]" sentinel or that
events include a final sentinel event (use the existing raw variable from
readAllStream(t, stream) or events from parseResponsesStream(t, raw) and add a
require.True or require.Equal assertion to confirm the final "[DONE]" marker).

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`:
- Around line 25-31: The JSON shows prompt_tokens (13) not matching the sum in
prompt_tokens_details (text_tokens: 0); update the prompt_tokens_details so its
fields reflect the actual token count—e.g., set
prompt_tokens_details.text_tokens to 13 (or adjust prompt_tokens to match the
sum of audio_tokens + cached_tokens + image_tokens + text_tokens) so that
"prompt_tokens" and "prompt_tokens_details" are consistent.

In `@tests/contract/xai_test.go`:
- Around line 127-132: Test currently verifies response events but not that the
SSE stream contained the terminal "[DONE]" marker; update the test around
readAllStream/parseResponsesStream (functions readAllStream,
parseResponsesStream, hasResponsesEvent and variables raw/events) to assert the
presence of the terminal marker by checking the raw stream for the "[DONE]"
sentinel (e.g., require that raw contains the "[DONE]" line or that the final
SSE message equals "[DONE]") so truncated output cannot falsely pass.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07900ae and d884ab0.

📒 Files selected for processing (59)
  • .github/workflows/test.yml
  • cmd/recordapi/main.go
  • docs/TESTING_STRATEGY.md
  • internal/core/types.go
  • internal/providers/anthropic/anthropic.go
  • internal/providers/gemini/gemini.go
  • tests/contract/README.md
  • tests/contract/anthropic_test.go
  • tests/contract/gemini_test.go
  • tests/contract/golden_helpers_test.go
  • tests/contract/groq_test.go
  • tests/contract/openai_test.go
  • tests/contract/replay_harness_test.go
  • tests/contract/replay_provider_helpers_test.go
  • tests/contract/testdata/golden/anthropic/messages.golden.json
  • tests/contract/testdata/golden/anthropic/messages_extended_thinking.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multi_turn.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multimodal.golden.json
  • tests/contract/testdata/golden/anthropic/messages_stream.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_params.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json
  • tests/contract/testdata/golden/anthropic/responses.golden.json
  • tests/contract/testdata/golden/anthropic/responses_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_params.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_tools.golden.json
  • tests/contract/testdata/golden/gemini/models.golden.json
  • tests/contract/testdata/golden/gemini/responses.golden.json
  • tests/contract/testdata/golden/gemini/responses_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_completion.golden.json
  • tests/contract/testdata/golden/groq/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_with_params.golden.json
  • tests/contract/testdata/golden/groq/chat_with_tools.golden.json
  • tests/contract/testdata/golden/groq/models.golden.json
  • tests/contract/testdata/golden/groq/responses.golden.json
  • tests/contract/testdata/golden/groq/responses_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_completion.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_json_mode.golden.json
  • tests/contract/testdata/golden/openai/chat_multi_turn.golden.json
  • tests/contract/testdata/golden/openai/chat_multimodal.golden.json
  • tests/contract/testdata/golden/openai/chat_with_params.golden.json
  • tests/contract/testdata/golden/openai/chat_with_tools.golden.json
  • tests/contract/testdata/golden/openai/models.golden.json
  • tests/contract/testdata/golden/openai/responses.golden.json
  • tests/contract/testdata/golden/openai/responses_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_completion.golden.json
  • tests/contract/testdata/golden/xai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_with_params.golden.json
  • tests/contract/testdata/golden/xai/models.golden.json
  • tests/contract/testdata/golden/xai/responses.golden.json
  • tests/contract/testdata/golden/xai/responses_stream.golden.json
  • tests/contract/testdata/openai/responses.json
  • tests/contract/testdata/openai/responses_stream.txt
  • tests/contract/testdata/xai/responses.json
  • tests/contract/testdata/xai/responses_stream.txt
  • tests/contract/xai_test.go

}
],
"created": 0,
"id": "4zpyaaPLGYXlxN8Pq7jUkQs",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider normalizing the ID for consistency with other golden fixtures.

Other golden files use placeholder patterns like "msg_\u003cgenerated\u003e" for IDs, but this file uses a raw ID value. This inconsistency may cause test failures if the fixture is re-recorded with a different ID.

♻️ Suggested change
-  "id": "4zpyaaPLGYXlxN8Pq7jUkQs",
+  "id": "chatcmpl-\u003cgenerated\u003e",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"id": "4zpyaaPLGYXlxN8Pq7jUkQs",
"id": "chatcmpl-\u003cgenerated\u003e",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/gemini/chat_with_tools.golden.json` at line
23, Replace the hard-coded ID value "4zpyaaPLGYXlxN8Pq7jUkQs" under the "id" key
with the normalized placeholder pattern used in other golden fixtures (e.g.,
"msg_\u003cgenerated\u003e") so the fixture matches the standardized ID format;
locate the JSON object containing the "id" property and update its value to the
placeholder string to ensure consistency across golden files.

Comment on lines +26 to +39
{
"Done": false,
"Name": "response.completed",
"Payload": {
"response": {
"created_at": 0,
"id": "resp_\u003cgenerated\u003e",
"model": "gemini-2.5-flash",
"object": "response",
"provider": "gemini",
"status": "completed"
},
"type": "response.completed"
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check for usage field in Gemini golden fixtures
rg -n "usage" tests/contract/testdata/golden/gemini/ --type json

Repository: ENTERPILOT/GOModel

Length of output: 384


🏁 Script executed:

# Also check the Anthropic streaming fixture to see the difference
rg -B5 -A5 "response.completed" tests/contract/testdata/golden/anthropic/ --type json

Repository: ENTERPILOT/GOModel

Length of output: 2102


🏁 Script executed:

# Get a broader view of the Gemini responses_stream.golden.json file
wc -l tests/contract/testdata/golden/gemini/responses_stream.golden.json

Repository: ENTERPILOT/GOModel

Length of output: 131


🏁 Script executed:

# Read the complete Gemini streaming fixture
cat tests/contract/testdata/golden/gemini/responses_stream.golden.json

Repository: ENTERPILOT/GOModel

Length of output: 1110


🏁 Script executed:

# Check all streaming golden fixtures for usage field
find tests/contract/testdata/golden -name "*stream*.json" -exec echo "=== {} ===" \; -exec rg "usage" {} \;

Repository: ENTERPILOT/GOModel

Length of output: 1067


🏁 Script executed:

# Search for test setup/expectations for Gemini streaming
rg -n "responses_stream" tests/contract/ --type go --type json -A 2 -B 2

Repository: ENTERPILOT/GOModel

Length of output: 3444


🏁 Script executed:

# Check if there's implementation code specific to Gemini streaming
rg -n "gemini.*stream" . --type go -i

Repository: ENTERPILOT/GOModel

Length of output: 774


Add missing usage field to response.completed event.

The Gemini streaming fixture omits usage (input_tokens, output_tokens, total_tokens) in the response.completed event, unlike all other providers (Anthropic, Groq, OpenAI, xAI) which include it even in streaming responses. Additionally, Gemini's own non-streaming fixtures (chat_completion.golden.json, responses.golden.json, etc.) include usage. The missing field should be added to maintain consistency with both Gemini non-streaming behavior and other providers' streaming patterns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/gemini/responses_stream.golden.json` around
lines 26 - 39, The Gemini streaming golden fixture's response.completed event is
missing the usage object; update the "response.completed" Payload.response to
include a "usage" field matching other providers and Gemini non-streaming
fixtures with keys input_tokens, output_tokens, and total_tokens (e.g., usage: {
"input_tokens": 0, "output_tokens": 0, "total_tokens": 0 }), ensuring the
"response.completed" event in responses_stream.golden.json contains this usage
object alongside created_at, id, model, object, provider, and status.

Comment on lines +2 to +123
"data": [
{
"created": 0,
"id": "canopylabs/orpheus-v1-english",
"object": "model",
"owned_by": "Canopy Labs"
},
{
"created": 0,
"id": "whisper-large-v3",
"object": "model",
"owned_by": "OpenAI"
},
{
"created": 0,
"id": "moonshotai/kimi-k2-instruct",
"object": "model",
"owned_by": "Moonshot AI"
},
{
"created": 0,
"id": "canopylabs/orpheus-arabic-saudi",
"object": "model",
"owned_by": "Canopy Labs"
},
{
"created": 0,
"id": "moonshotai/kimi-k2-instruct-0905",
"object": "model",
"owned_by": "Moonshot AI"
},
{
"created": 0,
"id": "meta-llama/llama-4-maverick-17b-128e-instruct",
"object": "model",
"owned_by": "Meta"
},
{
"created": 0,
"id": "openai/gpt-oss-20b",
"object": "model",
"owned_by": "OpenAI"
},
{
"created": 0,
"id": "meta-llama/llama-guard-4-12b",
"object": "model",
"owned_by": "Meta"
},
{
"created": 0,
"id": "llama-3.1-8b-instant",
"object": "model",
"owned_by": "Meta"
},
{
"created": 0,
"id": "meta-llama/llama-4-scout-17b-16e-instruct",
"object": "model",
"owned_by": "Meta"
},
{
"created": 0,
"id": "allam-2-7b",
"object": "model",
"owned_by": "SDAIA"
},
{
"created": 0,
"id": "groq/compound-mini",
"object": "model",
"owned_by": "Groq"
},
{
"created": 0,
"id": "whisper-large-v3-turbo",
"object": "model",
"owned_by": "OpenAI"
},
{
"created": 0,
"id": "qwen/qwen3-32b",
"object": "model",
"owned_by": "Alibaba Cloud"
},
{
"created": 0,
"id": "meta-llama/llama-prompt-guard-2-22m",
"object": "model",
"owned_by": "Meta"
},
{
"created": 0,
"id": "openai/gpt-oss-120b",
"object": "model",
"owned_by": "OpenAI"
},
{
"created": 0,
"id": "groq/compound",
"object": "model",
"owned_by": "Groq"
},
{
"created": 0,
"id": "meta-llama/llama-prompt-guard-2-86m",
"object": "model",
"owned_by": "Meta"
},
{
"created": 0,
"id": "openai/gpt-oss-safeguard-20b",
"object": "model",
"owned_by": "OpenAI"
},
{
"created": 0,
"id": "llama-3.3-70b-versatile",
"object": "model",
"owned_by": "Meta"
}
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider canonical ordering for model-list goldens.

If upstream model order is not guaranteed, sorting entries by id before writing this fixture will reduce non-semantic golden churn on re-record.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/groq/models.golden.json` around lines 2 - 123,
The golden fixture models.golden.json is sensitive to upstream ordering; to
prevent non-semantic churn, sort the model list by the "id" field before writing
the fixture (i.e., perform a stable sort on each entry's id property) wherever
the model-list is serialized for tests (the code that emits the model-list
golden/testdata). Ensure the serializer or test fixture writer sorts entries by
entry["id"] (or model.id) deterministically prior to marshaling so the resulting
models.golden.json is consistently ordered.

Comment on lines +25 to +31
"prompt_tokens": 17,
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0,
"image_tokens": 0,
"text_tokens": 0
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix inconsistent prompt_tokens_details accounting.

prompt_tokens is 17, but all detail fields are 0. For a text-only prompt, text_tokens should reflect the prompt total.

Suggested fix
     "prompt_tokens": 17,
     "prompt_tokens_details": {
       "audio_tokens": 0,
       "cached_tokens": 0,
       "image_tokens": 0,
-      "text_tokens": 0
+      "text_tokens": 17
     },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"prompt_tokens": 17,
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0,
"image_tokens": 0,
"text_tokens": 0
},
"prompt_tokens": 17,
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0,
"image_tokens": 0,
"text_tokens": 17
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_completion.golden.json` around
lines 25 - 31, The golden test fixture has inconsistent token accounting:
"prompt_tokens" is 17 while all entries in "prompt_tokens_details" are 0; update
the fixture so the detail fields sum to the total (for a text-only prompt set
"text_tokens": 17 or otherwise distribute the 17 across
"audio_tokens","cached_tokens","image_tokens","text_tokens")—look for the keys
"prompt_tokens", "prompt_tokens_details" and "text_tokens" in the JSON and make
the values consistent.

}
],
"created": 0,
"id": "chatcmpl-D0qUlBEghKIASKSVRE4fnjs5QUDGF",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sanitize volatile completion IDs in the golden fixture.

Line 13 uses a concrete chatcmpl-* id, which can create avoidable golden churn when fixtures are regenerated.

Suggested normalization
-  "id": "chatcmpl-D0qUlBEghKIASKSVRE4fnjs5QUDGF",
+  "id": "chatcmpl_<generated>",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"id": "chatcmpl-D0qUlBEghKIASKSVRE4fnjs5QUDGF",
"id": "chatcmpl_<generated>",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_multimodal.golden.json` at line
13, Replace the concrete volatile completion id value in the golden fixture: the
"id" field currently set to "chatcmpl-D0qUlBEghKIASKSVRE4fnjs5QUDGF" should be
normalized to a stable placeholder (e.g., "chatcmpl-000000000000000" or
"chatcmpl-<SANITIZED_ID>") so regenerated fixtures don't cause churn; update the
"id" key in chat_multimodal.golden.json accordingly.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/TESTING_STRATEGY.md (2)

160-167: ⚠️ Potential issue | 🟠 Major

Use the standardized golden-recording command (make record-api) in this strategy doc.

This section currently promotes direct recordapi usage, but team guidance standardizes recording via make record-api.

🔧 Suggested doc fix
-Use the `recordapi` tool or manual curl commands:
+Use `make record-api` (preferred) or manual curl commands:

 ```bash
-# Using recordapi
-go run ./cmd/recordapi -provider=openai -endpoint=chat -output=tests/contract/testdata/openai/chat_completion.json
+# Preferred: standardized recording flow
+make record-api

Based on learnings: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout. Record new golden files using make record-api.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/TESTING_STRATEGY.md` around lines 160 - 167, Replace the direct
recordapi invocation under the "Recording New Golden Files" section with the
standardized make target; update the example that currently shows `go run
./cmd/recordapi -provider=openai -endpoint=chat -output=...` to instruct users
to run `make record-api` instead and add a short note reminding contract tests
to use golden-file validation with the `-tags=contract` build tag and a 5-minute
timeout; target the "Recording New Golden Files" header and the example command
lines in that section for this change.

104-114: ⚠️ Potential issue | 🟠 Major

Add the required 5-minute timeout to contract test commands.

The contract commands in this section omit the required timeout flag, which diverges from the project’s contract test standard.

🔧 Suggested doc fix
 # Run contract tests
-go test -v -tags=contract ./tests/contract/...
+go test -v -tags=contract -timeout=5m ./tests/contract/...

 # Run specific provider tests
-go test -v -tags=contract ./tests/contract/... -run TestOpenAI
+go test -v -tags=contract -timeout=5m ./tests/contract/... -run TestOpenAI

Based on learnings: Contract tests must use golden file validation and include -tags=contract build tag with 5-minute timeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/TESTING_STRATEGY.md` around lines 104 - 114, Update the contract test
command examples in TESTING_STRATEGY.md to include the required 5-minute timeout
flag; specifically modify the two command lines that run contract tests (the
generic run and the provider-specific run) to add the Go test flag `-timeout 5m`
alongside the existing `-v -tags=contract` so they match the project’s contract
test standard.
♻️ Duplicate comments (3)
tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json (1)

25-30: ⚠️ Potential issue | 🟡 Minor

Fix inconsistent prompt token breakdown.

prompt_tokens is 13, but on Line [30] text_tokens is 0. The detail fields should reconcile with the parent total for deterministic contract assertions.

🔧 Proposed fix
     "prompt_tokens": 13,
     "prompt_tokens_details": {
       "audio_tokens": 0,
       "cached_tokens": 0,
       "image_tokens": 0,
-      "text_tokens": 0
+      "text_tokens": 13
     },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`
around lines 25 - 30, The prompt token breakdown is inconsistent: prompt_tokens
is 13 but prompt_tokens_details fields sum to 0; update prompt_tokens_details
(e.g., text_tokens, audio_tokens, image_tokens, cached_tokens) so their sum
equals the parent prompt_tokens (13). Locate the JSON object with keys
"prompt_tokens" and "prompt_tokens_details" in the golden file and set the
appropriate detail values (most likely set "text_tokens" to 13 or distribute
across the detail keys) so the totals reconcile.
.github/workflows/test.yml (1)

88-102: ⚠️ Potential issue | 🟡 Minor

Please ship the contract README update with this migration.

The new replay CI path changes contributor workflow, so tests/contract/README.md should be updated in the same PR with the replay run/validation instructions.

Based on learnings: Applies to tests/contract/README.md : Update tests/contract/README.md when modifying contract testing procedures or golden file validation processes

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 88 - 102, The CI now adds a
"test-contract" job that runs "go test -v -tags=contract -timeout=5m
./tests/contract/..." (step name "Run contract replay tests"), so update
tests/contract/README.md in the same PR to include the new replay CI path and
explicit contributor instructions for running the replay locally and validating
golden files (commands, flags, expected outputs, and how to accept/update golden
files); ensure the README references the "test-contract" job and the exact go
test invocation used by CI so contributors can reproduce the CI run and
validation steps.
tests/contract/testdata/golden/gemini/responses_stream.golden.json (1)

30-37: ⚠️ Potential issue | 🟠 Major

Add usage to response.completed payload.

response.completed should include usage tokens to keep streaming golden shape consistent with other providers and non-stream responses.

💡 Suggested patch
       "Payload": {
         "response": {
           "created_at": 0,
           "id": "resp_\u003cgenerated\u003e",
           "model": "gemini-2.5-flash",
           "object": "response",
           "provider": "gemini",
-          "status": "completed"
+          "status": "completed",
+          "usage": {
+            "input_tokens": 0,
+            "output_tokens": 0,
+            "total_tokens": 0
+          }
         },
         "type": "response.completed"
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/gemini/responses_stream.golden.json` around
lines 30 - 37, The response.completed payload is missing the usage field; update
the "response" object (the completed case with "status": "completed" and id
"resp_<generated>") to include a "usage" key matching the other
providers/non-stream shapes (e.g., token counts like total_tokens and
per-category fields used in other golden files) so the streaming golden shape is
consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/recordapi/main.go`:
- Around line 95-111: Add documentation for the new responses and
responses_stream endpoints to tests/contract/README.md: describe the
/v1/responses endpoint, the two example request bodies shown (keys "model",
"input" and the "stream": true option for responses_stream), and list which
providers support this capability (openai and xai); update the "endpoint
options" section to include the streaming flag and model choice and note any
provider-specific behavior for recording so contract tests and golden-file
validation reference the new endpoints.

In `@internal/providers/anthropic/anthropic.go`:
- Around line 653-656: The code assigns raw b.Input (possibly with varying
whitespace) directly to the local variable arguments which later becomes
Function.Arguments; to avoid golden-file churn, parse b.Input as JSON into an
interface{} and re-marshal it with encoding/json to produce canonicalized JSON
(e.g., json.Unmarshal -> json.Marshal) before assigning to
arguments/Function.Arguments, and if parsing fails fall back to "{}" or the
original string while logging or handling the error appropriately; update the
block around the variables arguments and b.Input in anthropic.go to perform this
canonicalization.

In `@internal/providers/gemini/gemini.go`:
- Around line 205-244: The current logic in the Gemini provider rejects valid
responses like {"models":[]} because the if condition checks both json.Unmarshal
success and len(geminiResp.Models) > 0; update the branch handling
json.Unmarshal into geminiResp so it treats an empty geminiResp.Models slice as
a valid response: remove the len(geminiResp.Models) > 0 requirement (or
otherwise allow zero-length slices), keep the existing loop that builds models
(which will produce an empty models slice), and return a core.ModelsResponse
with Object:"list" and Data:models; adjust code around geminiResp,
ModelsResponse, and the supportsGenerate/supportsEmbed checks accordingly.

In `@tests/contract/openai_test.go`:
- Line 3: Update the contract tests README to reflect the new replay-harness
workflow and changed fixture/golden flow: in the contract tests README (the
README.md next to the contract tests) describe the new execution steps
(including running contract tests with the -tags=contract -timeout=5m flags as
used by openai_test.go), explain how fixtures and golden files are produced and
validated under the replay-harness (how to run replays, how to regenerate
goldens), and update CI/PR guidance (what to check, when to commit updated
golden files and how to run the local validation commands). Ensure the README
clearly lists commands, expected outputs, and the exact developer workflow for
replay vs live runs so contributors can reproduce and update goldens correctly.

In `@tests/contract/testdata/golden/openai/chat_with_params.golden.json`:
- Around line 1-34: Update the contract testing README to reflect the new
replay-harness workflow: add concise, ordered steps for recording new
interactions, replaying tests, and validating golden files (including the
command-line invocations, required env vars, and CI expectations), document how
to regenerate/update golden files such as chat_with_params.golden.json and where
to place them, and include troubleshooting notes about common failures and the
expected validation output so contributors follow the new process consistently.

---

Outside diff comments:
In `@docs/TESTING_STRATEGY.md`:
- Around line 160-167: Replace the direct recordapi invocation under the
"Recording New Golden Files" section with the standardized make target; update
the example that currently shows `go run ./cmd/recordapi -provider=openai
-endpoint=chat -output=...` to instruct users to run `make record-api` instead
and add a short note reminding contract tests to use golden-file validation with
the `-tags=contract` build tag and a 5-minute timeout; target the "Recording New
Golden Files" header and the example command lines in that section for this
change.
- Around line 104-114: Update the contract test command examples in
TESTING_STRATEGY.md to include the required 5-minute timeout flag; specifically
modify the two command lines that run contract tests (the generic run and the
provider-specific run) to add the Go test flag `-timeout 5m` alongside the
existing `-v -tags=contract` so they match the project’s contract test standard.

---

Duplicate comments:
In @.github/workflows/test.yml:
- Around line 88-102: The CI now adds a "test-contract" job that runs "go test
-v -tags=contract -timeout=5m ./tests/contract/..." (step name "Run contract
replay tests"), so update tests/contract/README.md in the same PR to include the
new replay CI path and explicit contributor instructions for running the replay
locally and validating golden files (commands, flags, expected outputs, and how
to accept/update golden files); ensure the README references the "test-contract"
job and the exact go test invocation used by CI so contributors can reproduce
the CI run and validation steps.

In `@tests/contract/testdata/golden/gemini/responses_stream.golden.json`:
- Around line 30-37: The response.completed payload is missing the usage field;
update the "response" object (the completed case with "status": "completed" and
id "resp_<generated>") to include a "usage" key matching the other
providers/non-stream shapes (e.g., token counts like total_tokens and
per-category fields used in other golden files) so the streaming golden shape is
consistent.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`:
- Around line 25-30: The prompt token breakdown is inconsistent: prompt_tokens
is 13 but prompt_tokens_details fields sum to 0; update prompt_tokens_details
(e.g., text_tokens, audio_tokens, image_tokens, cached_tokens) so their sum
equals the parent prompt_tokens (13). Locate the JSON object with keys
"prompt_tokens" and "prompt_tokens_details" in the golden file and set the
appropriate detail values (most likely set "text_tokens" to 13 or distribute
across the detail keys) so the totals reconcile.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07900ae and d884ab0.

📒 Files selected for processing (59)
  • .github/workflows/test.yml
  • cmd/recordapi/main.go
  • docs/TESTING_STRATEGY.md
  • internal/core/types.go
  • internal/providers/anthropic/anthropic.go
  • internal/providers/gemini/gemini.go
  • tests/contract/README.md
  • tests/contract/anthropic_test.go
  • tests/contract/gemini_test.go
  • tests/contract/golden_helpers_test.go
  • tests/contract/groq_test.go
  • tests/contract/openai_test.go
  • tests/contract/replay_harness_test.go
  • tests/contract/replay_provider_helpers_test.go
  • tests/contract/testdata/golden/anthropic/messages.golden.json
  • tests/contract/testdata/golden/anthropic/messages_extended_thinking.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multi_turn.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multimodal.golden.json
  • tests/contract/testdata/golden/anthropic/messages_stream.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_params.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json
  • tests/contract/testdata/golden/anthropic/responses.golden.json
  • tests/contract/testdata/golden/anthropic/responses_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_params.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_tools.golden.json
  • tests/contract/testdata/golden/gemini/models.golden.json
  • tests/contract/testdata/golden/gemini/responses.golden.json
  • tests/contract/testdata/golden/gemini/responses_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_completion.golden.json
  • tests/contract/testdata/golden/groq/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_with_params.golden.json
  • tests/contract/testdata/golden/groq/chat_with_tools.golden.json
  • tests/contract/testdata/golden/groq/models.golden.json
  • tests/contract/testdata/golden/groq/responses.golden.json
  • tests/contract/testdata/golden/groq/responses_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_completion.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_json_mode.golden.json
  • tests/contract/testdata/golden/openai/chat_multi_turn.golden.json
  • tests/contract/testdata/golden/openai/chat_multimodal.golden.json
  • tests/contract/testdata/golden/openai/chat_with_params.golden.json
  • tests/contract/testdata/golden/openai/chat_with_tools.golden.json
  • tests/contract/testdata/golden/openai/models.golden.json
  • tests/contract/testdata/golden/openai/responses.golden.json
  • tests/contract/testdata/golden/openai/responses_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_completion.golden.json
  • tests/contract/testdata/golden/xai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_with_params.golden.json
  • tests/contract/testdata/golden/xai/models.golden.json
  • tests/contract/testdata/golden/xai/responses.golden.json
  • tests/contract/testdata/golden/xai/responses_stream.golden.json
  • tests/contract/testdata/openai/responses.json
  • tests/contract/testdata/openai/responses_stream.txt
  • tests/contract/testdata/xai/responses.json
  • tests/contract/testdata/xai/responses_stream.txt
  • tests/contract/xai_test.go

Comment on lines +95 to +111
"responses": {
path: "/v1/responses",
method: http.MethodPost,
requestBody: map[string]interface{}{
"model": "gpt-4o-mini",
"input": "Say 'Hello, World!' and nothing else.",
},
},
"responses_stream": {
path: "/v1/responses",
method: http.MethodPost,
requestBody: map[string]interface{}{
"model": "gpt-4o-mini",
"input": "Say 'Hello, World!' and nothing else.",
"stream": true,
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider updating tests/contract/README.md with new endpoint documentation.

The new responses and responses_stream endpoints add recording capabilities that users should be aware of. The README should document:

  • The new endpoint options
  • Which providers support the /v1/responses capability (openai, xai)

Based on learnings: "Update tests/contract/README.md when modifying contract testing procedures or golden file validation processes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/recordapi/main.go` around lines 95 - 111, Add documentation for the new
responses and responses_stream endpoints to tests/contract/README.md: describe
the /v1/responses endpoint, the two example request bodies shown (keys "model",
"input" and the "stream": true option for responses_stream), and list which
providers support this capability (openai and xai); update the "endpoint
options" section to include the streaming flag and model choice and note any
provider-specific behavior for recording so contract tests and golden-file
validation reference the new endpoints.

Comment on lines +653 to +656
arguments := "{}"
if len(b.Input) > 0 {
arguments = string(b.Input)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Canonicalize tool-call argument JSON before assigning Function.Arguments.

Line 655 currently preserves raw whitespace from upstream input, which can cause golden churn for semantically identical argument objects.

♻️ Suggested refinement
-		arguments := "{}"
-		if len(b.Input) > 0 {
-			arguments = string(b.Input)
-		}
+		arguments := "{}"
+		if len(bytes.TrimSpace(b.Input)) > 0 {
+			var compact bytes.Buffer
+			if err := json.Compact(&compact, b.Input); err == nil {
+				arguments = compact.String()
+			} else {
+				arguments = string(b.Input)
+			}
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
arguments := "{}"
if len(b.Input) > 0 {
arguments = string(b.Input)
}
arguments := "{}"
if len(bytes.TrimSpace(b.Input)) > 0 {
var compact bytes.Buffer
if err := json.Compact(&compact, b.Input); err == nil {
arguments = compact.String()
} else {
arguments = string(b.Input)
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/anthropic/anthropic.go` around lines 653 - 656, The code
assigns raw b.Input (possibly with varying whitespace) directly to the local
variable arguments which later becomes Function.Arguments; to avoid golden-file
churn, parse b.Input as JSON into an interface{} and re-marshal it with
encoding/json to produce canonicalized JSON (e.g., json.Unmarshal ->
json.Marshal) before assigning to arguments/Function.Arguments, and if parsing
fails fall back to "{}" or the original string while logging or handling the
error appropriately; update the block around the variables arguments and b.Input
in anthropic.go to perform this canonicalization.

Comment on lines +205 to +244
if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && len(geminiResp.Models) > 0 {
models := make([]core.Model, 0, len(geminiResp.Models))

for _, gm := range geminiResp.Models {
// Extract model ID from name (format: "models/gemini-...")
modelID := strings.TrimPrefix(gm.Name, "models/")

// Only include models that support generateContent (chat/completion)
supportsGenerate := false
for _, method := range gm.SupportedMethods {
if method == "generateContent" || method == "streamGenerateContent" {
supportsGenerate = true
break
}
}
}

supportsEmbed := false
for _, method := range gm.SupportedMethods {
if method == "embedContent" {
supportsEmbed = true
break
supportsEmbed := false
for _, method := range gm.SupportedMethods {
if method == "embedContent" {
supportsEmbed = true
break
}
}

isOpenAICompatModel := strings.HasPrefix(modelID, "gemini-") || strings.HasPrefix(modelID, "text-embedding-")
if (supportsGenerate || supportsEmbed) && isOpenAICompatModel {
models = append(models, core.Model{
ID: modelID,
Object: "model",
OwnedBy: "google",
Created: now,
})
}
}

isOpenAICompatModel := strings.HasPrefix(modelID, "gemini-") || strings.HasPrefix(modelID, "text-embedding-")
if (supportsGenerate || supportsEmbed) && isOpenAICompatModel {
return &core.ModelsResponse{
Object: "list",
Data: models,
}, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle empty native model lists as valid responses.

Line 205 currently requires len(geminiResp.Models) > 0, so a valid native response like {"models":[]} falls through and eventually returns an unexpected-format error instead of an empty list.

💡 Proposed fix
-	if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && len(geminiResp.Models) > 0 {
+	if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && geminiResp.Models != nil {
 		models := make([]core.Model, 0, len(geminiResp.Models))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && len(geminiResp.Models) > 0 {
models := make([]core.Model, 0, len(geminiResp.Models))
for _, gm := range geminiResp.Models {
// Extract model ID from name (format: "models/gemini-...")
modelID := strings.TrimPrefix(gm.Name, "models/")
// Only include models that support generateContent (chat/completion)
supportsGenerate := false
for _, method := range gm.SupportedMethods {
if method == "generateContent" || method == "streamGenerateContent" {
supportsGenerate = true
break
}
}
}
supportsEmbed := false
for _, method := range gm.SupportedMethods {
if method == "embedContent" {
supportsEmbed = true
break
supportsEmbed := false
for _, method := range gm.SupportedMethods {
if method == "embedContent" {
supportsEmbed = true
break
}
}
isOpenAICompatModel := strings.HasPrefix(modelID, "gemini-") || strings.HasPrefix(modelID, "text-embedding-")
if (supportsGenerate || supportsEmbed) && isOpenAICompatModel {
models = append(models, core.Model{
ID: modelID,
Object: "model",
OwnedBy: "google",
Created: now,
})
}
}
isOpenAICompatModel := strings.HasPrefix(modelID, "gemini-") || strings.HasPrefix(modelID, "text-embedding-")
if (supportsGenerate || supportsEmbed) && isOpenAICompatModel {
return &core.ModelsResponse{
Object: "list",
Data: models,
}, nil
}
if err := json.Unmarshal(rawResp.Body, &geminiResp); err == nil && geminiResp.Models != nil {
models := make([]core.Model, 0, len(geminiResp.Models))
for _, gm := range geminiResp.Models {
// Extract model ID from name (format: "models/gemini-...")
modelID := strings.TrimPrefix(gm.Name, "models/")
// Only include models that support generateContent (chat/completion)
supportsGenerate := false
for _, method := range gm.SupportedMethods {
if method == "generateContent" || method == "streamGenerateContent" {
supportsGenerate = true
break
}
}
supportsEmbed := false
for _, method := range gm.SupportedMethods {
if method == "embedContent" {
supportsEmbed = true
break
}
}
isOpenAICompatModel := strings.HasPrefix(modelID, "gemini-") || strings.HasPrefix(modelID, "text-embedding-")
if (supportsGenerate || supportsEmbed) && isOpenAICompatModel {
models = append(models, core.Model{
ID: modelID,
Object: "model",
OwnedBy: "google",
Created: now,
})
}
}
return &core.ModelsResponse{
Object: "list",
Data: models,
}, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/providers/gemini/gemini.go` around lines 205 - 244, The current
logic in the Gemini provider rejects valid responses like {"models":[]} because
the if condition checks both json.Unmarshal success and len(geminiResp.Models) >
0; update the branch handling json.Unmarshal into geminiResp so it treats an
empty geminiResp.Models slice as a valid response: remove the
len(geminiResp.Models) > 0 requirement (or otherwise allow zero-length slices),
keep the existing loop that builds models (which will produce an empty models
slice), and return a core.ModelsResponse with Object:"list" and Data:models;
adjust code around geminiResp, ModelsResponse, and the
supportsGenerate/supportsEmbed checks accordingly.

@@ -1,221 +1,149 @@
//go:build contract

// Contract tests in this file are intended to run with: -tags=contract -timeout=5m.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update contract testing README for replay-harness workflow changes.

This migration changes contract test execution and fixture/golden flow; please update tests/contract/README.md accordingly so contributors follow the new process.

Based on learnings: Applies to tests/contract/README.md : Update tests/contract/README.md when modifying contract testing procedures or golden file validation processes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/openai_test.go` at line 3, Update the contract tests README to
reflect the new replay-harness workflow and changed fixture/golden flow: in the
contract tests README (the README.md next to the contract tests) describe the
new execution steps (including running contract tests with the -tags=contract
-timeout=5m flags as used by openai_test.go), explain how fixtures and golden
files are produced and validated under the replay-harness (how to run replays,
how to regenerate goldens), and update CI/PR guidance (what to check, when to
commit updated golden files and how to run the local validation commands).
Ensure the README clearly lists commands, expected outputs, and the exact
developer workflow for replay vs live runs so contributors can reproduce and
update goldens correctly.

Comment on lines +1 to +34
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "1, 2, 3, 4, ",
"role": "assistant"
}
}
],
"created": 0,
"id": "chatcmpl-D0qOmQKlnUcfmQ1UR9UNvcYikTjhe",
"model": "gpt-4o-mini-2024-07-18",
"object": "chat.completion",
"provider": "",
"usage": {
"completion_tokens": 13,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": 0,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 0
},
"prompt_tokens": 27,
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0,
"image_tokens": 0,
"text_tokens": 0
},
"total_tokens": 40
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update contract docs for the replay-harness migration.

This PR changes the contract testing workflow; please ensure tests/contract/README.md is updated with replay/record/validation steps so contributors follow the new process consistently.
Based on learnings: Applies to tests/contract/README.md : Update tests/contract/README.md when modifying contract testing procedures or golden file validation processes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_with_params.golden.json` around
lines 1 - 34, Update the contract testing README to reflect the new
replay-harness workflow: add concise, ordered steps for recording new
interactions, replaying tests, and validating golden files (including the
command-line invocations, required env vars, and CI expectations), document how
to regenerate/update golden files such as chat_with_params.golden.json and where
to place them, and include troubleshooting notes about common failures and the
expected validation output so contributors follow the new process consistently.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (6)
tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json (1)

25-30: ⚠️ Potential issue | 🟡 Minor

Fix inconsistent prompt token details.

prompt_tokens is 13, but prompt_tokens_details currently sums to 0.

Suggested fix
     "prompt_tokens": 13,
     "prompt_tokens_details": {
       "audio_tokens": 0,
       "cached_tokens": 0,
       "image_tokens": 0,
-      "text_tokens": 0
+      "text_tokens": 13
     },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`
around lines 25 - 30, The JSON has inconsistent token counts: "prompt_tokens" is
13 but "prompt_tokens_details" (audio_tokens, cached_tokens, image_tokens,
text_tokens) sums to 0; update the "prompt_tokens_details" fields so their
summed value equals 13 (or if the detailed breakdown is unknown, set
"text_tokens": 13 and others 0) to keep the top-level "prompt_tokens" consistent
with the detailed keys.
.github/workflows/test.yml (1)

88-102: ⚠️ Potential issue | 🟡 Minor

Please include the contract-testing procedure update in tests/contract/README.md.

This PR changes contributor-facing contract execution flow in CI; the README should be updated in the same change set.

Based on learnings: Applies to tests/contract/README.md : Update tests/contract/README.md when modifying contract testing procedures or golden file validation processes

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/test.yml around lines 88 - 102, Update
tests/contract/README.md to reflect the changed CI contract-testing flow
introduced by the test-contract job: document the exact command and flags used
in CI (go test -v -tags=contract -timeout=5m ./tests/contract/...), any required
environment variables (e.g., GO_VERSION) or setup steps, and the updated
contributor workflow for running contract replay tests locally; also describe
the golden file validation process and how to regenerate or approve golden files
so contributors know how to reproduce CI results and fix failures locally.
tests/contract/testdata/golden/openai/chat_multimodal.golden.json (1)

13-13: ⚠️ Potential issue | 🟡 Minor

Normalize volatile completion IDs in this golden fixture.

The concrete chatcmpl-* value is run-specific and will create avoidable golden churn when fixtures are re-recorded.

Suggested stabilization
-  "id": "chatcmpl-D0qUlBEghKIASKSVRE4fnjs5QUDGF",
+  "id": "chatcmpl_<generated>",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_multimodal.golden.json` at line
13, Replace the run-specific "id" value that starts with "chatcmpl-" in the
golden fixture by normalizing it to a stable placeholder (e.g.,
"chatcmpl-<normalized>" or "<CHAT_COMPLETION_ID>") in
tests/contract/testdata/golden/openai/chat_multimodal.golden.json; update the
JSON "id" field instead of keeping the concrete run-specific value so re-records
won’t change the golden, and if the test harness performs strict compares ensure
it treats this placeholder as the canonical ID (or add normalization logic
around the comparison that strips or masks the "chatcmpl-*" pattern).
tests/contract/testdata/golden/gemini/responses_stream.golden.json (1)

26-39: ⚠️ Potential issue | 🟠 Major

Add missing usage field to response.completed event.

The response.completed event lacks usage metrics (input_tokens, output_tokens, total_tokens), which other providers include in streaming responses. This creates inconsistency with Anthropic/OpenAI streaming fixtures and Gemini's own non-streaming fixtures.

Proposed fix
       "Payload": {
         "response": {
           "created_at": 0,
           "id": "resp_\u003cgenerated\u003e",
           "model": "gemini-2.5-flash",
           "object": "response",
           "provider": "gemini",
-          "status": "completed"
+          "status": "completed",
+          "usage": {
+            "input_tokens": 0,
+            "output_tokens": 0,
+            "total_tokens": 0
+          }
         },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/gemini/responses_stream.golden.json` around
lines 26 - 39, The golden fixture for the "response.completed" streaming event
is missing the "usage" metrics; update the "response.completed" event inside
responses_stream.golden.json so the payload.response includes a "usage" object
with numeric fields "input_tokens", "output_tokens", and "total_tokens" (match
values/format used by other providers' streaming fixtures and Gemini
non-streaming fixtures) to keep streaming responses consistent.
tests/contract/xai_test.go (1)

127-136: ⚠️ Potential issue | 🟡 Minor

Add [DONE] termination assertion to TestXAIReplayStreamResponses.

Unlike the Groq and Anthropic streaming response tests, this test does not verify that the stream terminates with the [DONE] marker. This inconsistency could allow regressions in stream termination behavior to go undetected.

Proposed fix
 	raw := readAllStream(t, stream)
 	events := parseResponsesStream(t, raw)
 	require.True(t, hasResponsesEvent(events, "response.created"))
 	require.True(t, hasResponsesEvent(events, "response.output_text.delta"))
 	require.True(t, hasResponsesEvent(events, "response.completed"))
+
+	hasDone := false
+	for _, event := range events {
+		if event.Done {
+			hasDone = true
+			break
+		}
+	}
+	require.True(t, hasDone, "responses stream should terminate with [DONE]")

 	compareGoldenJSON(t, goldenPathForFixture("xai/responses_stream.txt"), map[string]any{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/xai_test.go` around lines 127 - 136, The test reads the full
stream into raw but doesn't assert the stream termination marker; modify the
test (TestXAIReplayStreamResponses) to assert the stream ends with the expected
"[DONE]" marker by checking raw (or the final parsed event) for the "[DONE]"
terminator (e.g., use strings.HasSuffix(raw, "[DONE]") or verify the last
element of events equals the termination marker) and add a
require.True/assertion using raw or events to fail the test if the marker is
missing.
tests/contract/openai_test.go (1)

136-148: ⚠️ Potential issue | 🟡 Minor

Add [DONE] termination assertion for consistency with Gemini tests.

The Gemini streaming responses test validates that the stream terminates with a [DONE] marker (see TestGeminiReplayStreamResponses lines 115-122), but this test does not include the same validation. This could miss regressions in stream termination behavior.

🛡️ Suggested fix to add termination check
 	require.True(t, hasResponsesEvent(events, "response.created"))
 	require.True(t, hasResponsesEvent(events, "response.output_text.delta"))
 	require.True(t, hasResponsesEvent(events, "response.completed"))
 	require.NotEmpty(t, extractResponsesStreamText(events))

+	hasDone := false
+	for _, event := range events {
+		if event.Done {
+			hasDone = true
+			break
+		}
+	}
+	require.True(t, hasDone, "responses stream should terminate with [DONE]")
+
 	compareGoldenJSON(t, "openai/responses_stream.golden.json", map[string]any{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/openai_test.go` around lines 136 - 148, After obtaining the
raw stream in tests/contract/openai_test.go (variable raw produced by
readAllStream) add an assertion that the stream termination marker is present
for consistency with Gemini tests: check that raw ends with or contains the
"[DONE]" marker (e.g., assert strings.HasSuffix(raw, "[DONE]\n") or
require.Contains(raw, "[DONE]")) before parsing and comparing events (so keep
the existing calls to parseResponsesStream, hasResponsesEvent,
extractResponsesStreamText, and compareGoldenJSON).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/providers/gemini/gemini.go`:
- Around line 167-174: The current headers closure adds the API key as a query
parameter (q.Set("key", p.apiKey)), which exposes credentials in logs; update
the headers function used in gemini.go to stop writing p.apiKey to
req.URL.RawQuery and instead set the API key on the request header using the
x-goog-api-key header (e.g., call req.Header.Set("x-goog-api-key", p.apiKey));
ensure you remove the query parameter logic so only header-based authentication
is used for the Gemini models endpoint.

In `@tests/contract/gemini_test.go`:
- Around line 83-96: TestGeminiReplayResponses lacks a pre-check that the golden
file exists; add a call to goldenFileExists(t, "gemini/responses.golden.json")
at the start of TestGeminiReplayResponses (before creating the provider and
calling provider.Responses) similar to TestOpenAIReplayResponses so the test
fails with a clear message if the golden file is missing; keep
compareGoldenJSON(t, "gemini/responses.golden.json", resp) as-is.

In `@tests/contract/replay_harness_test.go`:
- Around line 140-143: The test's SSE parsing is stripping significant
whitespace from `data:` payloads by using TrimSpace; change the `data:` branch
to preserve raw payload bytes by replacing dataLines = append(dataLines,
strings.TrimSpace(strings.TrimPrefix(line, "data:"))) with appending the result
of only removing the "data:" prefix (e.g. strings.TrimPrefix(line, "data:") or
line[len("data:"):]) so leading/trailing spaces are kept; leave the `event:`
handling (`currentName = strings.TrimSpace(strings.TrimPrefix(line, "event:"))`)
unchanged unless events need different treatment.
- Around line 38-42: The current fallback logic in the request matching (the
block that checks req.URL.RawQuery and then sets key = replayKey(req.Method,
req.URL.Path) and looks up rt.routes[key]) masks regressions by allowing a
path-only route to match when query parameters differ; remove this fallback so
matching uses the full RequestURI/RawQuery and fails when the query is missing
or different. Update the matching code around replayKey, req.URL.RawQuery,
req.URL.Path and rt.routes to require the original key (including query) and do
not attempt the path-only lookup; if desired, add an explicit test or flag to
opt into query-agnostic matching rather than silently falling back.

In `@tests/contract/testdata/golden/groq/responses_stream.golden.json`:
- Around line 54-60: Normalize the volatile timing fields in the usage payload
by replacing or rounding completion_time, prompt_time, queue_time, and
total_time to a stable value (e.g., 0.0 or a fixed sentinel) before
serializing/generating the golden file; leave token counts (completion_tokens,
prompt_tokens, total_tokens) unchanged. Locate the code path that builds or
serializes the usage payload (the function that emits the JSON containing keys
completion_time, prompt_time, queue_time, total_time) and apply the
normalization there so tests always produce deterministic timing values. Ensure
the change only affects those timing fields and not token metrics or other
payload data.

In `@tests/contract/testdata/golden/openai/chat_completion_stream.golden.json`:
- Around line 17-23: Replace concrete volatile metadata values in each chat
completion chunk by normalizing the "id", "obfuscation", and
"system_fingerprint" fields to stable placeholders (e.g. "<ID>",
"<OBFUSCATION>", "<SYSTEM_FINGERPRINT>") so golden snapshots do not churn;
update every chunk instance (the examples around the shown chunk and the ones at
the other ranges) to use these placeholders or a deterministic generator used by
your test harness that returns stable values for "id", "obfuscation", and
"system_fingerprint".

In `@tests/contract/testdata/golden/openai/chat_multi_turn.golden.json`:
- Line 13: The golden fixture contains a concrete completion ID ("id":
"chatcmpl-D0qOykS3WniK2xU97yFf96Aa4Me8d") which makes tests brittle; replace
that value with a sanitized/deterministic placeholder (e.g.,
"chatcmpl-<SANITIZED>" or "chatcmpl-0000000000") so the "id" field in
tests/contract/testdata/golden/openai/chat_multi_turn.golden.json is stable
across re-recordings and does not contain real/variable IDs.

In `@tests/contract/testdata/golden/openai/responses_stream.golden.json`:
- Line 146: Replace run-specific volatile fields in the stream fixture by
normalizing the "obfuscation" and "completed_at" JSON keys to stable sentinel
values (e.g., fixed string like "REDACTED" or a constant timestamp) wherever
they appear (including the instances at the noted locations), so golden
comparisons are stable; update the entries that currently contain dynamic tokens
(the "obfuscation" values and any "completed_at" timestamps) to the chosen
normalized value and ensure any test helpers that generate fixtures follow the
same normalization convention.

In `@tests/contract/testdata/golden/xai/chat_completion.golden.json`:
- Line 13: Replace the concrete UUID value assigned to the JSON "id" field in
the golden fixture with a stable placeholder to make the snapshot deterministic;
locate the "id" key in the chat completion golden fixture and substitute the
UUID with a fixed token such as 00000000-0000-0000-0000-000000000000 or a clear
placeholder like <ID_PLACEHOLDER>, and ensure any tests or snapshot update logic
expect this placeholder instead of a variable UUID.

---

Duplicate comments:
In @.github/workflows/test.yml:
- Around line 88-102: Update tests/contract/README.md to reflect the changed CI
contract-testing flow introduced by the test-contract job: document the exact
command and flags used in CI (go test -v -tags=contract -timeout=5m
./tests/contract/...), any required environment variables (e.g., GO_VERSION) or
setup steps, and the updated contributor workflow for running contract replay
tests locally; also describe the golden file validation process and how to
regenerate or approve golden files so contributors know how to reproduce CI
results and fix failures locally.

In `@tests/contract/openai_test.go`:
- Around line 136-148: After obtaining the raw stream in
tests/contract/openai_test.go (variable raw produced by readAllStream) add an
assertion that the stream termination marker is present for consistency with
Gemini tests: check that raw ends with or contains the "[DONE]" marker (e.g.,
assert strings.HasSuffix(raw, "[DONE]\n") or require.Contains(raw, "[DONE]"))
before parsing and comparing events (so keep the existing calls to
parseResponsesStream, hasResponsesEvent, extractResponsesStreamText, and
compareGoldenJSON).

In `@tests/contract/testdata/golden/gemini/responses_stream.golden.json`:
- Around line 26-39: The golden fixture for the "response.completed" streaming
event is missing the "usage" metrics; update the "response.completed" event
inside responses_stream.golden.json so the payload.response includes a "usage"
object with numeric fields "input_tokens", "output_tokens", and "total_tokens"
(match values/format used by other providers' streaming fixtures and Gemini
non-streaming fixtures) to keep streaming responses consistent.

In `@tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json`:
- Around line 25-30: The JSON has inconsistent token counts: "prompt_tokens" is
13 but "prompt_tokens_details" (audio_tokens, cached_tokens, image_tokens,
text_tokens) sums to 0; update the "prompt_tokens_details" fields so their
summed value equals 13 (or if the detailed breakdown is unknown, set
"text_tokens": 13 and others 0) to keep the top-level "prompt_tokens" consistent
with the detailed keys.

In `@tests/contract/testdata/golden/openai/chat_multimodal.golden.json`:
- Line 13: Replace the run-specific "id" value that starts with "chatcmpl-" in
the golden fixture by normalizing it to a stable placeholder (e.g.,
"chatcmpl-<normalized>" or "<CHAT_COMPLETION_ID>") in
tests/contract/testdata/golden/openai/chat_multimodal.golden.json; update the
JSON "id" field instead of keeping the concrete run-specific value so re-records
won’t change the golden, and if the test harness performs strict compares ensure
it treats this placeholder as the canonical ID (or add normalization logic
around the comparison that strips or masks the "chatcmpl-*" pattern).

In `@tests/contract/xai_test.go`:
- Around line 127-136: The test reads the full stream into raw but doesn't
assert the stream termination marker; modify the test
(TestXAIReplayStreamResponses) to assert the stream ends with the expected
"[DONE]" marker by checking raw (or the final parsed event) for the "[DONE]"
terminator (e.g., use strings.HasSuffix(raw, "[DONE]") or verify the last
element of events equals the termination marker) and add a
require.True/assertion using raw or events to fail the test if the marker is
missing.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07900ae and 61ca107.

📒 Files selected for processing (59)
  • .github/workflows/test.yml
  • cmd/recordapi/main.go
  • docs/TESTING_STRATEGY.md
  • internal/core/types.go
  • internal/providers/anthropic/anthropic.go
  • internal/providers/gemini/gemini.go
  • tests/contract/README.md
  • tests/contract/anthropic_test.go
  • tests/contract/gemini_test.go
  • tests/contract/golden_helpers_test.go
  • tests/contract/groq_test.go
  • tests/contract/openai_test.go
  • tests/contract/replay_harness_test.go
  • tests/contract/replay_provider_helpers_test.go
  • tests/contract/testdata/golden/anthropic/messages.golden.json
  • tests/contract/testdata/golden/anthropic/messages_extended_thinking.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multi_turn.golden.json
  • tests/contract/testdata/golden/anthropic/messages_multimodal.golden.json
  • tests/contract/testdata/golden/anthropic/messages_stream.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_params.golden.json
  • tests/contract/testdata/golden/anthropic/messages_with_tools.golden.json
  • tests/contract/testdata/golden/anthropic/responses.golden.json
  • tests/contract/testdata/golden/anthropic/responses_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion.golden.json
  • tests/contract/testdata/golden/gemini/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_params.golden.json
  • tests/contract/testdata/golden/gemini/chat_with_tools.golden.json
  • tests/contract/testdata/golden/gemini/models.golden.json
  • tests/contract/testdata/golden/gemini/responses.golden.json
  • tests/contract/testdata/golden/gemini/responses_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_completion.golden.json
  • tests/contract/testdata/golden/groq/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/groq/chat_with_params.golden.json
  • tests/contract/testdata/golden/groq/chat_with_tools.golden.json
  • tests/contract/testdata/golden/groq/models.golden.json
  • tests/contract/testdata/golden/groq/responses.golden.json
  • tests/contract/testdata/golden/groq/responses_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_completion.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_reasoning.golden.json
  • tests/contract/testdata/golden/openai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/openai/chat_json_mode.golden.json
  • tests/contract/testdata/golden/openai/chat_multi_turn.golden.json
  • tests/contract/testdata/golden/openai/chat_multimodal.golden.json
  • tests/contract/testdata/golden/openai/chat_with_params.golden.json
  • tests/contract/testdata/golden/openai/chat_with_tools.golden.json
  • tests/contract/testdata/golden/openai/models.golden.json
  • tests/contract/testdata/golden/openai/responses.golden.json
  • tests/contract/testdata/golden/openai/responses_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_completion.golden.json
  • tests/contract/testdata/golden/xai/chat_completion_stream.golden.json
  • tests/contract/testdata/golden/xai/chat_with_params.golden.json
  • tests/contract/testdata/golden/xai/models.golden.json
  • tests/contract/testdata/golden/xai/responses.golden.json
  • tests/contract/testdata/golden/xai/responses_stream.golden.json
  • tests/contract/testdata/openai/responses.json
  • tests/contract/testdata/openai/responses_stream.txt
  • tests/contract/testdata/xai/responses.json
  • tests/contract/testdata/xai/responses_stream.txt
  • tests/contract/xai_test.go

Comment on lines +54 to +60
"completion_time": 0.027554055,
"completion_tokens": 4,
"prompt_time": 0.003210785,
"prompt_tokens": 38,
"queue_time": 0.090321431,
"total_time": 0.03076484,
"total_tokens": 42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize volatile latency metrics in usage payload.

Timing values are environment-dependent and can create unnecessary golden diffs.

Suggested stabilization
-            "completion_time": 0.027554055,
+            "completion_time": 0,
...
-            "prompt_time": 0.003210785,
+            "prompt_time": 0,
...
-            "queue_time": 0.090321431,
+            "queue_time": 0,
...
-            "total_time": 0.03076484,
+            "total_time": 0,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"completion_time": 0.027554055,
"completion_tokens": 4,
"prompt_time": 0.003210785,
"prompt_tokens": 38,
"queue_time": 0.090321431,
"total_time": 0.03076484,
"total_tokens": 42
"completion_time": 0,
"completion_tokens": 4,
"prompt_time": 0,
"prompt_tokens": 38,
"queue_time": 0,
"total_time": 0,
"total_tokens": 42
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/groq/responses_stream.golden.json` around
lines 54 - 60, Normalize the volatile timing fields in the usage payload by
replacing or rounding completion_time, prompt_time, queue_time, and total_time
to a stable value (e.g., 0.0 or a fixed sentinel) before serializing/generating
the golden file; leave token counts (completion_tokens, prompt_tokens,
total_tokens) unchanged. Locate the code path that builds or serializes the
usage payload (the function that emits the JSON containing keys completion_time,
prompt_time, queue_time, total_time) and apply the normalization there so tests
always produce deterministic timing values. Ensure the change only affects those
timing fields and not token metrics or other payload data.

Comment on lines +17 to +23
"id": "chatcmpl-D0q8CZXN1uEzngZGkU7FuV92lnXWq",
"model": "gpt-4o-mini-2024-07-18",
"obfuscation": "iPgRzm",
"object": "chat.completion.chunk",
"service_tier": "default",
"system_fingerprint": "fp_29330a9688"
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sanitize volatile stream metadata across all chunks.

Concrete id, obfuscation, and system_fingerprint fields are unstable and likely to churn on re-record.

Suggested normalization pattern (apply to each chunk)
-      "id": "chatcmpl-D0q8CZXN1uEzngZGkU7FuV92lnXWq",
+      "id": "chatcmpl_<generated>",
...
-      "obfuscation": "iPgRzm",
+      "obfuscation": "<generated>",
...
-      "system_fingerprint": "fp_29330a9688"
+      "system_fingerprint": "fp_<generated>"

Also applies to: 36-42, 55-61, 74-80, 93-99, 110-116

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_completion_stream.golden.json`
around lines 17 - 23, Replace concrete volatile metadata values in each chat
completion chunk by normalizing the "id", "obfuscation", and
"system_fingerprint" fields to stable placeholders (e.g. "<ID>",
"<OBFUSCATION>", "<SYSTEM_FINGERPRINT>") so golden snapshots do not churn;
update every chunk instance (the examples around the shown chunk and the ones at
the other ranges) to use these placeholders or a deterministic generator used by
your test harness that returns stable values for "id", "obfuscation", and
"system_fingerprint".

}
],
"created": 0,
"id": "chatcmpl-D0qOykS3WniK2xU97yFf96Aa4Me8d",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a sanitized completion ID in golden data.

A concrete completion ID makes this fixture unstable across re-recordings.

Suggested stabilization
-  "id": "chatcmpl-D0qOykS3WniK2xU97yFf96Aa4Me8d",
+  "id": "chatcmpl_<generated>",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"id": "chatcmpl-D0qOykS3WniK2xU97yFf96Aa4Me8d",
"id": "chatcmpl_<generated>",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/chat_multi_turn.golden.json` at line
13, The golden fixture contains a concrete completion ID ("id":
"chatcmpl-D0qOykS3WniK2xU97yFf96Aa4Me8d") which makes tests brittle; replace
that value with a sanitized/deterministic placeholder (e.g.,
"chatcmpl-<SANITIZED>" or "chatcmpl-0000000000") so the "id" field in
tests/contract/testdata/golden/openai/chat_multi_turn.golden.json is stable
across re-recordings and does not contain real/variable IDs.

"delta": "Hello",
"item_id": "msg_\u003cgenerated\u003e",
"logprobs": [],
"obfuscation": "fIVBptC5V5x",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Sanitize volatile timestamp/obfuscation fields in stream fixtures.

obfuscation tokens and completed_at are run-specific; keep them normalized for stable golden comparisons.

Suggested normalization
-        "obfuscation": "fIVBptC5V5x",
+        "obfuscation": "<generated>",
...
-          "completed_at": 1772295220,
+          "completed_at": 0,

Also applies to: 160-160, 174-174, 188-188, 253-253

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/openai/responses_stream.golden.json` at line
146, Replace run-specific volatile fields in the stream fixture by normalizing
the "obfuscation" and "completed_at" JSON keys to stable sentinel values (e.g.,
fixed string like "REDACTED" or a constant timestamp) wherever they appear
(including the instances at the noted locations), so golden comparisons are
stable; update the entries that currently contain dynamic tokens (the
"obfuscation" values and any "completed_at" timestamps) to the chosen normalized
value and ensure any test helpers that generate fixtures follow the same
normalization convention.

}
],
"created": 0,
"id": "2ac1119b-8e3f-969e-1c85-9a6dd6d91720",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Normalize the fixture response ID.

The concrete UUID should be replaced with a stable placeholder for deterministic golden updates.

Suggested stabilization
-  "id": "2ac1119b-8e3f-969e-1c85-9a6dd6d91720",
+  "id": "<generated>",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"id": "2ac1119b-8e3f-969e-1c85-9a6dd6d91720",
"id": "<generated>",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/contract/testdata/golden/xai/chat_completion.golden.json` at line 13,
Replace the concrete UUID value assigned to the JSON "id" field in the golden
fixture with a stable placeholder to make the snapshot deterministic; locate the
"id" key in the chat completion golden fixture and substitute the UUID with a
fixed token such as 00000000-0000-0000-0000-000000000000 or a clear placeholder
like <ID_PLACEHOLDER>, and ensure any tests or snapshot update logic expect this
placeholder instead of a variable UUID.

@SantiagoDePolonia SantiagoDePolonia merged commit 06f79eb into main Feb 28, 2026
13 checks passed
@SantiagoDePolonia SantiagoDePolonia deleted the refactoring-tests branch March 22, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant