fix(openai): responses instrumentation broken traces for async streaming #3475

galkleinman · 2025-11-26T18:26:39Z

Important

Fixes trace context propagation in OpenAI async streaming responses to maintain trace continuity, adds tests, and updates documentation with examples.

Behavior:
- Fixes trace context propagation in responses_get_or_create_wrapper and async_responses_get_or_create_wrapper in responses_wrappers.py to maintain trace continuity across async operations.
- Adds trace_context to TracedData to store and restore trace context.
Tests:
- Adds test_responses_streaming_with_parent_span and test_responses_streaming_async_with_parent_span in test_responses.py to validate trace context propagation.
- Adds YAML cassettes test_responses_streaming_async_with_parent_span.yaml and test_responses_streaming_with_parent_span.yaml for VCR testing.
Documentation:
- Adds openai_guardrails_example.py to demonstrate OpenAI Guardrails integration with tracing.
Chores:
- Updates Python runtime to 3.11 in .python-version.

^{This description was created by}^{for 9c7804a. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Improved trace-context propagation to maintain trace continuity across sync, async, and streaming OpenAI responses and cancellations.
Tests
- Added extensive unit/integration tests and streaming cassettes to validate parent-span propagation, streaming content, and context-manager scenarios.
Documentation
- Added a sample app script demonstrating Guardrails integration with tracing.
Chores
- Updated sample app Python runtime to 3.11.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-26T18:26:51Z

Walkthrough

Capture and persist OpenTelemetry trace context in response wrappers by adding a trace_context field to TracedData and propagating/restoring that context across synchronous, asynchronous, streaming, and cancellation paths to maintain trace continuity.

Changes

Cohort / File(s)	Summary
Core trace context propagation `packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py`	Add `trace_context: Any = pydantic.Field(default=None)` and `Config(arbitrary_types_allowed=True)` to `TracedData`; capture `context_api.get_current()` when starting spans; persist `trace_context` into `TracedData`; restore and pass the stored context when creating subsequent spans across sync/async/streaming/cancellation wrappers; add comments documenting trace-context usage.
Streaming cassette fixtures `packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_with_parent_span.yaml`, `packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_async_with_parent_span.yaml`	Add SSE-style YAML cassettes recording streaming response sequences (response.created → in_progress → output_item/content_part/output_text deltas → completed) to exercise parent-span scenarios.
Tests for propagation and streaming `packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py`	Add unit and integration-like tests validating capture, persistence, propagation, and restoration of trace context across TracedData and spans (sync/async/streaming/context-manager flows); include tests for tools=None handling and additional streaming scenarios.
Sample app & runtime pin `packages/sample-app/.python-version`, `packages/sample-app/sample_app/openai_guardrails_example.py`	Update sample app Python version to 3.11 and add an example script demonstrating async Guardrails usage with Traceloop tracing and streaming response handling.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Wrapper as Response Wrapper
    participant ContextAPI as OpenTelemetry Context API
    participant TracedData as TracedData Model
    participant Span as OpenTelemetry Span

    App->>Wrapper: invoke response wrapper (sync/async/stream)
    Wrapper->>ContextAPI: capture current context (context_api.get_current())
    ContextAPI-->>Wrapper: trace_context
    Wrapper->>TracedData: create TracedData(trace_context=...)
    TracedData-->>Wrapper: traced_data instance
    Wrapper->>Span: start span with provided/restored context
    Span-->>Wrapper: started span

    alt streaming continuation
        Wrapper->>TracedData: persist trace_context on traced_data
        Wrapper->>ContextAPI: restore trace_context when creating subsequent spans
        ContextAPI-->>Wrapper: restored context
        Wrapper->>Span: start child span with restored context
    end

    alt cancellation / error cleanup
        Wrapper->>ContextAPI: restore trace_context and start cleanup span
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to:
- responses_wrappers.py: correct capture/restore points for sync/async/stream and cancellation branches.
- Tests and cassette fixtures: ensure SSE ordering and assertions reflect runtime behavior.
- Pydantic TracedData.Config.arbitrary_types_allowed: verify serialization/comparison implications in tests.

Possibly related PRs

feat(openai): add reasoning attributes #3336 — touches the same responses_wrappers.py and TracedData model; overlaps on span/context handling.
fix(openai): record service_tier attribute #3458 — modifies responses_wrappers.py span-creation paths and metadata persistence; related to context propagation changes.
fix(openai): add streaming support for responses.create() api #3437 — updates streaming instrumentation in the same area; may interact with the streaming span lifecycle changes.

Suggested reviewers

nirga

Poem

🐰 I tucked a trace_context in a cozy hop,
From start to stream, I never let it drop.
Spans hop in line, parent then child,
Across async streams I stayed mild.
Hooray — continuity in every hop and stop! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main fix: resolving broken trace context propagation in async streaming responses for OpenAI instrumentation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gk/openai-guardrails-fix

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 34a9f6f and 9c7804a.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (1 hunks)
packages/sample-app/sample_app/openai_guardrails_example.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py
packages/sample-app/sample_app/openai_guardrails_example.py

🧠 Learnings (1)

📚 Learning: 2025-08-17T15:06:48.109Z

Learnt from: CR
Repo: traceloop/openllmetry PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T15:06:48.109Z
Learning: For debugging OpenTelemetry spans, use ConsoleSpanExporter with Traceloop to print spans to console

Applied to files:

packages/sample-app/sample_app/openai_guardrails_example.py

🧬 Code graph analysis (1)

packages/sample-app/sample_app/openai_guardrails_example.py (3)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/dispatcher_wrapper.py (1)

end (103-111)

packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)

flush (218-223)

🪛 Ruff (0.14.6)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

593-593: Unused function argument: instrument_legacy

(ARG001)

665-665: Unused function argument: instrument_legacy

(ARG001)

packages/sample-app/sample_app/openai_guardrails_example.py

142-142: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Lint

🔇 Additional comments (6)

packages/sample-app/sample_app/openai_guardrails_example.py (3)

1-32: LGTM - Well-structured example with proper initialization order.

The example correctly initializes Traceloop before importing guardrails (lines 19-20), which is critical for proper instrumentation. The conditional import with helpful error messaging (lines 22-32) provides good UX for developers. API key handling via os.getenv("OPENAI_API_KEY") follows security best practices per coding guidelines.

35-53: LGTM!

The function creates a minimal, temporary config file with proper cleanup handled later in the workflow.

56-78: Good trace context demonstration.

The workflow correctly captures and displays the parent trace_id (line 77-78), which is essential for verifying that all subsequent OpenAI calls maintain trace continuity. This aligns well with the PR's objective of fixing broken traces for async streaming.

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (3)

510-589: Excellent unit test for trace context propagation.

This test effectively validates the core fix by:

Creating an isolated tracing environment (lines 533-539)

Capturing trace context within a parent span (lines 542-559)

Creating a span outside the original context using stored context (lines 563-569)

Verifying trace continuity with clear assertions (lines 580-588)

The test accurately simulates how guardrails and other wrappers make calls across different execution contexts.

591-660: Well-structured integration test for sync streaming trace propagation.

The test correctly simulates the guardrails wrapper scenario and validates trace hierarchy. The instrument_legacy fixture (flagged by Ruff ARG001) is intentionally present to trigger instrumentation setup via pytest fixture mechanism—this is a standard pattern and not an issue.

Good assertions verify:

Same trace_id between parent and OpenAI span (lines 641-645)

Correct parent-child relationship (lines 648-653)

Streaming functionality works correctly (lines 656-659)

662-732: Good async counterpart to the sync streaming test.

This test validates the async streaming path which is specifically mentioned in the PR title ("broken traces for async streaming"). The test structure appropriately mirrors the sync version while using async patterns (async for chunk in stream).

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to aaf7fdf in 2 minutes and 17 seconds. Click for details.

Reviewed 920 lines of code in 7 files
Skipped 1 files when reviewing.
Skipped posting 9 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py:142

Draft comment:
Good addition of the trace_context field with arbitrary_types_allowed. It ensures trace continuity across async calls.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py:520

Draft comment:
Trace context retrieval appears repeatedly (e.g. using existing_data.get('trace_context', context_api.get_current())). Consider refactoring to a helper to reduce duplication and potential inconsistencies.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

3. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py:602

Draft comment:
The restoration of trace context for completed responses uses a ternary check on traced_data.trace_context. Verify that falsy values (e.g., empty objects) won’t override an intentionally absent context.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

4. packages/sample-app/sample_app/openai_guardrails_example.py:20

Draft comment:
Initialization of Traceloop before importing guardrails is well-done. Consider providing more descriptive error handling or configuration hints if guardrails not installed.
Reason this comment was not posted:
Confidence changes required: 20% <= threshold 50% None

5. packages/sample-app/pyproject.toml:25

Draft comment:
Python version update to 3.11 and dependency updates look appropriate; ensure compatibility with pydantic v2 if used.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

6. packages/test_responses.py:510

Draft comment:
The new unit test for trace context propagation is comprehensive. It correctly simulates parent span behavior and verifies child span’s trace_id.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

7.

packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_async_with_parent_span.yaml:135

Draft comment:
Typo: The Set-Cookie header includes an expires date formatted as '26-Nov-25'. For consistency with standard HTTP-date formats (and with the Date header), it should likely be '26-Nov-2025'.
Reason this comment was not posted:
Comment was on unchanged code.

8. packages/sample-app/sample_app/openai_guardrails_example.py:5

Draft comment:
Typo: 'OpenLLMetry' should be 'OpenTelemetry' in the docstring.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% "OpenLLMetry" appears twice in the file (lines 5 and 64), suggesting it's intentional. The file is about using Traceloop SDK, and "OpenLLMetry" is likely the name of Traceloop's LLM observability product built on OpenTelemetry. The consistent usage across the file and the context of LLM tracing strongly suggests this is not a typo but an actual product name. The comment is likely incorrect - it's assuming this is a typo when it's actually the correct terminology for this specific library/product. I might be wrong if "OpenLLMetry" is indeed a typo and the author accidentally used it twice. However, given that this is a new file being added (not a modification), and the term is used consistently in both documentation and print statements, it seems very unlikely to be an accidental typo. Even if there's a small chance this could be a typo, the consistent usage across multiple contexts (docstring and print statement) and the fact that it makes sense as a product name (OpenTelemetry for LLMs = OpenLLMetry) strongly indicates this is intentional. The comment lacks strong evidence that this is actually wrong. This comment should be deleted. "OpenLLMetry" appears to be the correct product name (Traceloop's LLM observability product), not a typo. It's used consistently twice in the file and makes sense in context. There's no strong evidence this is actually incorrect.

9. packages/sample-app/sample_app/openai_guardrails_example.py:64

Draft comment:
Typo: 'OpenLLMetry' should be 'OpenTelemetry' in the output message.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment assumes "OpenLLMetry" is a typo, but this appears to be incorrect. The docstring on line 5 uses "OpenLLMetry tracing" consistently, and this seems to be the actual name of the Traceloop SDK's tracing functionality. The file is demonstrating integration with the Traceloop SDK (which provides OpenLLMetry), not directly with OpenTelemetry. While OpenTelemetry is used under the hood (as seen in the import), OpenLLMetry appears to be the branded name for the observability solution. The comment is making an incorrect assumption about what is a typo. Could "OpenLLMetry" actually be a typo that was repeated in multiple places? Maybe the author intended to use "OpenTelemetry" throughout but made a consistent mistake? While it's possible there's a consistent typo, the evidence strongly suggests "OpenLLMetry" is intentional: it's used in the docstring (line 5), the print statement (line 64), and aligns with the Traceloop SDK being used. The naming pattern "OpenLLMetry" (Open + LLM + Telemetry) makes sense as a branded name for LLM observability. This is not a typo. This comment should be deleted. "OpenLLMetry" is not a typo - it appears to be the correct name for the Traceloop SDK's tracing functionality. The comment incorrectly assumes this is a mistake when it's actually intentional branding/naming.

Workflow ID: wflow_XoaO6wM4EKXSfWOb

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (2)
524-530: Remove unused imports.

Static analysis correctly identifies unused imports. Mock, Response, ResponseOutputItem, and ResponseUsage are imported but not used in this test function.
-    from unittest.mock import MagicMock, Mock
-    from opentelemetry import trace, context
-    from opentelemetry.sdk.trace import TracerProvider
-    from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
-    from opentelemetry.instrumentation.openai.v1.responses_wrappers import TracedData
-    from openai.types.responses import Response, ResponseOutputItem, ResponseUsage
-    import time
+    from opentelemetry import trace, context
+    from opentelemetry.sdk.trace import TracerProvider
+    from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
+    from opentelemetry.instrumentation.openai.v1.responses_wrappers import TracedData
+    import time
637-638: Remove unused variable assignment.

The parent variable is assigned but never used. This is flagged by both Ruff and Flake8.
-    parent = parent_spans[0]
-    openai_span = openai_spans[0]
+    openai_span = openai_spans[0]
Apply the same fix at line 710-711.
packages/sample-app/pyproject.toml (1)

25-62: Python and dependency updates look coherent; document & verify downstream impact.

The stricter Python constraint (>=3.11,<3.13), the openai-agents bump to ^0.3.3, and the new openai-guardrails dependency are all consistent with the new Guardrails tracing example and the .python-version change. To avoid surprises:

Call out the Python 3.11+ requirement in any sample‑app docs/README.

Ensure CI, containers, and sample‑app runners use Python 3.11.

Quickly sanity‑check that your OpenAI/Guardrails/OpenAI‑agents instrumentation tests pass against these specific versions.

packages/sample-app/sample_app/openai_guardrails_example.py (2)

27-37: Avoid exiting the process at import time for missing Guardrails.

Catching ImportError and printing install instructions is helpful, but calling exit(1) at module import time means any code that merely imports openai_guardrails_example (e.g., for tests or documentation builds) will terminate the whole process if Guardrails isn’t installed.

Consider refactoring so that:

The import is wrapped in a feature flag at module scope, e.g. _GUARDRAILS_AVAILABLE = True/False, and

The user‑facing message and sys.exit(1) happen in main() (or right before running reproduce_issue()), not during import.

This keeps imports side‑effect‑free while preserving the current CLI behavior.

96-101: Tighten or explicitly justify the broad except Exception (Ruff BLE001).

The generic except Exception as e: block is fine as a last‑resort handler for a demo script, but Ruff flags it (BLE001). To keep linters and intent aligned, consider either:

Narrowing this to the specific error types you expect from Guardrails/OpenAI/network failures, or

Keeping the broad handler but marking it as intentional (e.g. except Exception as e: # noqa: BLE001) and optionally re‑raising after logging so unexpected failures still produce a non‑zero exit.

Also ensure flake8/Ruff are run over packages/sample-app so this file doesn’t introduce new lint failures.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e63ebe3 and aaf7fdf.

⛔ Files ignored due to path filters (1)

packages/sample-app/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py (9 hunks)
packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_async_with_parent_span.yaml (1 hunks)
packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_with_parent_span.yaml (1 hunks)
packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (1 hunks)
packages/sample-app/.python-version (1 hunks)
packages/sample-app/pyproject.toml (2 hunks)
packages/sample-app/sample_app/openai_guardrails_example.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/cassettes/**/*.{yaml,yml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Never commit secrets or PII in VCR cassettes; scrub sensitive data

Files:

packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_async_with_parent_span.yaml
packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_with_parent_span.yaml

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py
packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py
packages/sample-app/sample_app/openai_guardrails_example.py

🧠 Learnings (1)

📚 Learning: 2025-08-22T14:41:26.962Z

Learnt from: prane-eth
Repo: traceloop/openllmetry PR: 3336
File: packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/utils.py:8-8
Timestamp: 2025-08-22T14:41:26.962Z
Learning: In the openllmetry project, the `packaging` library is available as a transitive dependency through other packages (visible in poetry.lock) and doesn't need to be explicitly declared in pyproject.toml dependencies.

Applied to files:

packages/sample-app/pyproject.toml

🧬 Code graph analysis (2)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (2)

packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (3)

export (45-51)

InMemorySpanExporter (22-61)

get_finished_spans (40-43)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py (1)

TracedData (115-146)

packages/sample-app/sample_app/openai_guardrails_example.py (3)

packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)

export (45-51)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)

flush (218-223)

🪛 Flake8 (7.3.0)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

[error] 524-524: 'unittest.mock.MagicMock' imported but unused

(F401)

[error] 524-524: 'unittest.mock.Mock' imported but unused

(F401)

[error] 529-529: 'openai.types.responses.Response' imported but unused

(F401)

[error] 529-529: 'openai.types.responses.ResponseOutputItem' imported but unused

(F401)

[error] 529-529: 'openai.types.responses.ResponseUsage' imported but unused

(F401)

[error] 637-637: local variable 'parent' is assigned to but never used

(F841)

[error] 710-710: local variable 'parent' is assigned to but never used

(F841)

🪛 Ruff (0.14.5)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py

524-524: Do not catch blind exception: Exception

(BLE001)

597-597: Do not catch blind exception: Exception

(BLE001)

683-683: Do not catch blind exception: Exception

(BLE001)

757-757: Do not catch blind exception: Exception

(BLE001)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

592-592: Unused function argument: instrument_legacy

(ARG001)

637-637: Local variable parent is assigned to but never used

Remove assignment to unused variable parent

(F841)

665-665: Unused function argument: instrument_legacy

(ARG001)

710-710: Local variable parent is assigned to but never used

Remove assignment to unused variable parent

(F841)

packages/sample-app/sample_app/openai_guardrails_example.py

98-98: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (14)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py (7)

142-146: LGTM - trace context field and config addition.

The trace_context field with arbitrary_types_allowed = True in the Config is correctly implemented to store OpenTelemetry Context objects, which are not JSON-serializable by default. This enables trace continuity across async boundaries.

465-471: LGTM - context capture for streaming path.

The trace context is correctly captured at span creation time and passed to start_span, ensuring the span inherits the correct parent context.

521-536: LGTM - error path context propagation.

The error handling path correctly captures and restores trace context, ensuring spans created during error handling maintain proper parent-child relationships.

593-607: LGTM - completed response context propagation.

The pattern for capturing and restoring trace context in the success path is consistent with other paths and correctly ensures trace continuity.

628-634: LGTM - async streaming context propagation.

The async streaming path correctly captures and propagates trace context, mirroring the synchronous implementation.

788-795: LGTM - cancel wrapper context propagation.

The cancel wrapper correctly restores trace context from existing data, ensuring cancellation spans are properly linked to their parent traces.

524-525: Broad exception catch is acceptable for instrumentation code.

The bare except Exception clause here is intentional defensive programming. Instrumentation code should never break the underlying API calls. If TracedData creation fails, setting traced_data = None and continuing is the correct behavior to prevent instrumentation issues from affecting the user's application.

The static analysis hint (BLE001) can be safely ignored in this context.

packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_async_with_parent_span.yaml (1)

1-165: LGTM - cassette appears properly sanitized.

The cassette file contains no API keys or sensitive credentials. The organization and project identifiers are non-secret metadata. Cloudflare cookies present are session management tokens, not user credentials. This follows the coding guidelines for not committing secrets or PII in VCR cassettes.

packages/opentelemetry-instrumentation-openai/tests/traces/cassettes/test_responses/test_responses_streaming_with_parent_span.yaml (1)

1-186: LGTM - cassette appears properly sanitized.

The sync streaming cassette file follows the same sanitization pattern as the async variant. No API keys or sensitive credentials are present.

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (3)

590-660: Well-designed integration test for trace context propagation.

This test effectively validates the core fix by:

Simulating the guardrails wrapper pattern with a parent span

Making a streaming call within the parent context

Asserting both trace_id matching and proper parent-child relationships

Verifying streaming functionality still works

The test will catch regressions where trace context is lost during streaming operations.

591-596: Fixture parameter is used for side effects - static analysis false positive.

The instrument_legacy fixture is likely used for its side effect of setting up OpenAI instrumentation. This is a common pytest pattern where fixtures are included in the signature for their setup/teardown behavior, not for direct usage. The Ruff warning (ARG001) can be safely ignored.

510-538: I'll help you verify this review comment. Let me start by examining the test file and understanding the test isolation setup, particularly around TracerProvider management.
<function_calls>

Search for TracerProvider usage and test setup/teardown patterns in the test file

#!/bin/bash

First, let's get the full test file to understand context

wc -l packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

</function_calls>
<function_calls>

Read the specific test function and surrounding context

#!/bin/bash

Read lines around the test function in question

sed -n '500,600p' packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

</function_calls>
<function_calls>

Look for pytest fixtures and test setup/teardown mechanisms in the test file

#!/bin/bash

Search for fixture definitions and test setup patterns

rg -n "@pytest.fixture|def setup|def teardown|conftest" packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py -B 2 -A 5

</function_calls>
<function_calls>

Search for conftest.py files that might contain shared fixtures

#!/bin/bash

Find conftest files

find packages/opentelemetry-instrumentation-openai/tests -name "conftest.py" -type f

</function_calls>
<function_calls>

Check conftest.py for fixture definitions related to tracer or context

#!/bin/bash

Read conftest files to understand test isolation

find packages/opentelemetry-instrumentation-openai/tests -name "conftest.py" -exec cat {} ;

</function_calls>
<function_calls>

Check for pytest marks or autouse fixtures that reset tracer state

#!/bin/bash

Search for context management and tracer provider restoration patterns

rg -n "set_tracer_provider|detach_context|Token|_TRACER_PROVIDER" packages/opentelemetry-instrumentation-openai/tests/traces/ --type py -B 2 -A 2 | head -100

</function_calls>

packages/sample-app/.python-version (1)

1-1: Python version bump is consistent; ensure environments are updated.

3.11 here matches the python = ">=3.11,<3.13" constraint in pyproject.toml, so tooling/runtime are aligned. Please double‑check CI images, devcontainers, and any Dockerfiles to ensure they also run Python 3.11 to avoid version skew.

packages/sample-app/sample_app/openai_guardrails_example.py (1)

19-25: Traceloop initialization order is correct for instrumentation.

Initializing Traceloop (with disable_batch=True and ConsoleSpanExporter) before importing Guardrails ensures the OpenAI client created inside GuardrailsAsyncOpenAI participates in tracing, which is what you want for this repro. No changes needed here; just confirm opentelemetry-sdk (or the package providing ConsoleSpanExporter) is present in the sample‑app environment so imports don’t fail at runtime.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 0a42cff in 2 minutes and 45 seconds. Click for details.

Reviewed 183 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 7 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/pyproject.toml:24

Draft comment:
The Python version requirement was changed from '>=3.11,<3.13' to '>=3.10,<3.13', which contradicts the PR summary that mentioned an update to 3.11+. Please confirm if lowering the minimum version is intentional.
Reason this comment was not posted:
Comment looked like it was already resolved.

2. packages/sample-app/pyproject.toml:59

Draft comment:
The openai-agents version was downgraded from '^0.3.3' to '^0.2.7', which does not match the PR summary stating an upgrade. Please verify if this version change is intentional.
Reason this comment was not posted:
Comment looked like it was already resolved.

3. packages/sample-app/pyproject.toml:61

Draft comment:
The dependency for openai-guardrails has been removed, but the sample app code imports and relies on it. Consider re-adding the dependency to avoid runtime import errors.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment is about a dependency removal, which is a change in the diff. However, the comment makes a specific claim that "the sample app code imports and relies on it" - this is information I cannot verify from the diff alone. According to the rules, I should only keep comments where I see STRONG EVIDENCE that the comment is correct. The rule also states "If you are not sure, or if understanding the comment requires more context (such as seeing other files or parts of the codebase), you should delete the comment." I cannot see the sample app code files to verify if openai-guardrails is actually imported or used. This requires cross-file context that I don't have. The reviewer provided evidence showing the context around the change, but this doesn't actually prove whether openai-guardrails is used in the code. Maybe the automated tool scanned the codebase and found actual imports, which would make this a valid comment. The removal of a dependency that's still in use would cause runtime errors, which is a serious issue. While it's true that removing a used dependency would be serious, I cannot verify this claim from the diff alone. The rules explicitly state to ignore cross-file issues and only think about the file being reviewed. Without seeing the actual import statements in the sample app code, I cannot confirm this is a real issue. The automated tool could be wrong or outdated. This comment requires cross-file context to verify - I would need to see the actual sample app code to confirm if openai-guardrails is imported. Since I cannot see strong evidence in this diff alone, and the rules state to delete comments that require more context, I should delete this comment.

4. packages/sample-app/sample_app/openai_guardrails_example.py:20

Draft comment:
The Traceloop.init() call was modified by removing the 'api_endpoint' and 'exporter' parameters. Confirm that this change is intentional, as these settings may affect trace export behavior.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment is asking the PR author to "confirm that this change is intentional" - this is explicitly against the rules which state "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended." The comment doesn't identify a clear bug or issue that needs fixing; it's just asking for confirmation. The fact that ConsoleSpanExporter is still imported but unused might be worth noting, but the comment doesn't focus on that - it focuses on asking for confirmation about the intentionality of the change. This is a sample/example app, and simplifying the initialization by removing unnecessary parameters seems like a reasonable change for demonstration purposes. Perhaps the removal of the exporter parameter could actually break functionality if the example was specifically designed to show console output of traces. The unused import of ConsoleSpanExporter suggests this might be an oversight rather than intentional simplification. While the unused import is suspicious, the comment doesn't actually point that out or suggest removing it. Instead, it asks the author to "confirm" the change is intentional, which is explicitly prohibited. If the comment had said "Remove the unused ConsoleSpanExporter import" or "The exporter parameter was removed but ConsoleSpanExporter is still imported", that would be actionable. As written, it's just asking for confirmation. This comment should be deleted because it asks the PR author to confirm their intention, which violates the rules. The comment doesn't provide an actionable code change suggestion - it just asks for verification that the change was intentional.

5. packages/sample-app/sample_app/openai_guardrails_example.py:170

Draft comment:
Consider using a try/finally block to ensure that the guardrails configuration file is always cleaned up, even if an exception occurs.
Reason this comment was not posted:
Comment was on unchanged code.

6. packages/sample-app/sample_app/openai_guardrails_example.py:63

Draft comment:
Typo: The printed text "OpenAI Guardrails with OpenLLMetry Tracing - Multiple Calls Test" may be intended to read "OpenTelemetry" instead of "OpenLLMetry" to match the import from opentelemetry.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This comment assumes "OpenLLMetry" is a typo for "OpenTelemetry", but looking at the file, "OpenLLMetry" appears multiple times (lines 5, 63, 158) and seems to be a deliberate branding choice. The Traceloop SDK appears to use "OpenLLMetry" as its product name for LLM observability built on top of OpenTelemetry. The comment is making an incorrect assumption that this is a typo when it's actually intentional terminology. This is not a useful comment because it would change correct branding to incorrect branding. Could "OpenLLMetry" actually be a typo that was repeated multiple times? Maybe the author did intend to write "OpenTelemetry" and made the same mistake in several places? While it's theoretically possible the typo was repeated, the consistency of "OpenLLMetry" across the docstring, print statements, and explanatory text strongly suggests it's intentional branding. Additionally, a quick search would reveal that OpenLLMetry is indeed the name of Traceloop's observability product. The comment lacks strong evidence that this is actually a typo. This comment should be deleted. "OpenLLMetry" appears to be intentional branding for the Traceloop SDK's observability product, not a typo for "OpenTelemetry". The term is used consistently throughout the file in multiple contexts, indicating deliberate usage.

7. packages/sample-app/sample_app/openai_guardrails_example.py:158

Draft comment:
Typo: "OpenLLMetry" might be a mistake. Should this be "OpenTelemetry"?
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment assumes "OpenLLMetry" is a typo for "OpenTelemetry", but this appears to be incorrect. OpenLLMetry is a real product - it's a tracing library specifically designed for LLM applications that builds on OpenTelemetry. The file uses "OpenLLMetry" consistently in multiple places (docstring, print statements), and the imports show usage of Traceloop SDK which is part of the OpenLLMetry ecosystem. The comment is factually wrong - this is not a typo. The author intentionally used "OpenLLMetry" because that's the specific library being demonstrated in this example. Could "OpenLLMetry" be a typo that was consistently made throughout the file? Perhaps the author intended to refer to the underlying OpenTelemetry framework rather than the specific OpenLLMetry wrapper? While it's possible for a typo to be repeated consistently, the context strongly suggests this is intentional. The file is specifically about tracing with the Traceloop SDK (which is OpenLLMetry), and the docstring explicitly mentions "OpenLLMetry tracing" as a feature being demonstrated. The print statement on line 158 is explaining what was fixed in "OpenLLMetry (responses_wrappers.py)" - referring to a specific file in the OpenLLMetry codebase. This is clearly the correct, intended term. This comment should be deleted. "OpenLLMetry" is not a typo - it's the correct name of the tracing library being used in this code. The comment is factually incorrect and would mislead the PR author into making an incorrect change.

Workflow ID: wflow_pA2VLThE7IoLmplU

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

packages/sample-app/sample_app/openai_guardrails_example.py (2)

71-73: Handle missing OPENAI_API_KEY explicitly

If OPENAI_API_KEY is unset, api_key=os.getenv("OPENAI_API_KEY") passes None to GuardrailsAsyncOpenAI, which will likely fail later with a less clear error.

Consider adding an explicit check before constructing the client:

-    client = GuardrailsAsyncOpenAI(
-        api_key=os.getenv("OPENAI_API_KEY"), config=config_path
-    )
+    api_key = os.getenv("OPENAI_API_KEY")
+    if not api_key:
+        raise RuntimeError(
+            "OPENAI_API_KEY is not set in the environment; cannot create client"
+        )
+
+    client = GuardrailsAsyncOpenAI(api_key=api_key, config=config_path)

This keeps secrets in env vars (as required) while giving a clearer failure mode.

66-75: Make guardrails config cleanup robust with try/finally

guardrails_config.json is cleaned up at the end, but if an error occurs before the try block (e.g., during client construction), the file can be left behind.

You can wrap everything after create_sample_config() in a try/finally to guarantee cleanup:

-    # Create guardrails configuration
-    config_path = create_sample_config()
-    print(f"\n[Setup] Created guardrails config: {config_path}")
-
-    # Initialize GuardrailsAsyncOpenAI (wraps AsyncOpenAI)
-    client = GuardrailsAsyncOpenAI(
-        api_key=os.getenv("OPENAI_API_KEY"), config=config_path
-    )
-    print("[Setup] Created GuardrailsAsyncOpenAI client\n")
-
-    # Get current span context to display trace ID
-    current_span = trace.get_current_span()
-    parent_trace_id = current_span.get_span_context().trace_id
-    print(f"Parent trace_id: {hex(parent_trace_id)}\n")
-
-    try:
+    # Create guardrails configuration
+    config_path = create_sample_config()
+    print(f"\n[Setup] Created guardrails config: {config_path}")
+
+    try:
+        # Initialize GuardrailsAsyncOpenAI (wraps AsyncOpenAI)
+        client = GuardrailsAsyncOpenAI(
+            api_key=os.getenv("OPENAI_API_KEY"), config=config_path
+        )
+        print("[Setup] Created GuardrailsAsyncOpenAI client\n")
+
+        # Get current span context to display trace ID
+        current_span = trace.get_current_span()
+        parent_trace_id = current_span.get_span_context().trace_id
+        print(f"Parent trace_id: {hex(parent_trace_id)}\n")
+
+        # Main test calls
         ...
-    print("\n" + "=" * 80)
-    print("Trace Analysis")
-    ...
-
-    # Cleanup
-    if config_path.exists():
-        config_path.unlink()
-        print(f"\n[Cleanup] Removed {config_path}")
+    finally:
+        # Cleanup
+        if config_path.exists():
+            config_path.unlink()
+            print(f"\n[Cleanup] Removed {config_path}")

This keeps the sample tidy even when early failures occur.

Also applies to: 171-174

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between aaf7fdf and 0a42cff.

📒 Files selected for processing (1)

packages/sample-app/sample_app/openai_guardrails_example.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/openai_guardrails_example.py

🪛 Flake8 (7.3.0)

packages/sample-app/sample_app/openai_guardrails_example.py

[error] 16-16: 'opentelemetry.sdk.trace.export.ConsoleSpanExporter' imported but unused

(F401)

🪛 Ruff (0.14.5)

packages/sample-app/sample_app/openai_guardrails_example.py

143-143: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (1)

packages/sample-app/sample_app/openai_guardrails_example.py (1)

57-170: Nice end‑to‑end reproduction workflow

The reproduce_issue workflow clearly exercises non‑streaming, streaming, and chat APIs under the Traceloop @workflow span, which is exactly what you need to validate trace‑ID continuity. The conditional handling of llm_response vs raw SDK responses keeps the example usable across wrapper versions.

packages/sample-app/sample_app/openai_guardrails_example.py

coderabbitai · 2025-11-26T19:48:52Z

packages/sample-app/sample_app/openai_guardrails_example.py

+    except GuardrailTripwireTriggered as e:
+        print(f"[Guardrails] Safety check triggered: {e}")
+    except Exception as e:
+        print(f"[Error] {type(e).__name__}: {e}")
+        import traceback
+
+        traceback.print_exc()


⚠️ Potential issue | 🟡 Minor

Avoid blind except Exception (Ruff BLE001)

Catching a broad Exception here trips Ruff BLE001 and can mask unexpected failures.

Given this is a sample script, two options:

Narrow the exception(s) you actually expect (e.g., HTTP/client errors), or

If you intentionally want a catch‑all with traceback printing, annotate it to silence the lint rule:

- except Exception as e: - print(f"[Error] {type(e).__name__}: {e}") - import traceback - - traceback.print_exc() + except Exception as e: # noqa: BLE001 # Intentional catch‑all for demo logging + print(f"[Error] {type(e).__name__}: {e}") + import traceback + + traceback.print_exc()

This makes the intent explicit and keeps CI noise down.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except GuardrailTripwireTriggered as e:

print(f"[Guardrails] Safety check triggered: {e}")

except Exception as e:

print(f"[Error] {type(e).__name__}: {e}")

import traceback

traceback.print_exc()

except GuardrailTripwireTriggered as e:

print(f"[Guardrails] Safety check triggered: {e}")

except Exception as e: # noqa: BLE001 # Intentional catch‑all for demo logging

print(f"[Error] {type(e).__name__}: {e}")

import traceback

traceback.print_exc()

🧰 Tools

🪛 Ruff (0.14.5)

143-143: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In packages/sample-app/sample_app/openai_guardrails_example.py around lines 141 to 147, the code uses a blind "except Exception" which triggers Ruff BLE001; either narrow the catch to the specific exceptions you expect (e.g., HTTP/client-related exceptions or other known runtime errors) and handle them explicitly, or if you truly want a catch-all for this sample script, add an inline lint exemption comment to make the intent explicit (e.g., annotate the except line to silence BLE001) and keep the traceback printing as-is.

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 34a9f6f in 2 minutes and 54 seconds. Click for details.

Reviewed 4461 lines of code in 84 files
Skipped 0 files when reviewing.
Skipped posting 31 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/version.py:1

Draft comment:
Version update looks consistent. Ensure that the version bump aligns with the underlying fix for async streaming trace propagation.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is related to a version bump, which falls under the category of dependency changes. The comment is asking the author to ensure alignment with an underlying fix, which is similar to asking them to double-check or verify something. This violates the rules against asking for confirmation or verification of intentions.

2. packages/opentelemetry-instrumentation-openai/pyproject.toml:10

Draft comment:
Version and metadata updated consistently; dependency bounds (e.g., openai "1.99.7") seem correct.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only states that the version and metadata are updated consistently and that the dependency bounds seem correct. It does not provide any actionable feedback or suggestions for improvement.

3. packages/opentelemetry-instrumentation-pinecone/pyproject.toml:10

Draft comment:
Pinecone instrumentation metadata is consistent; no issues detected.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that there are no issues detected, which is not useful for the PR author.

4. packages/opentelemetry-instrumentation-qdrant/pyproject.toml:10

Draft comment:
Qdrant instrumentation configuration looks proper; version bump is consistent.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, stating that the configuration looks proper and the version bump is consistent. It doesn't provide any actionable feedback or suggestions for improvement.

5. packages/opentelemetry-instrumentation-replicate/pyproject.toml:10

Draft comment:
Replicate instrumentation dependencies and version metadata updated appropriately.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment seems to be purely informative and does not provide any actionable feedback or suggestions. It does not ask for confirmation or suggest any changes. It simply states that something was updated, which is not useful for the PR author.

6. packages/opentelemetry-instrumentation-sagemaker/pyproject.toml:10

Draft comment:
SageMaker instrumentation metadata and dependency configuration are consistent.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that the configuration is consistent, which does not help the PR author in any way.

7. packages/opentelemetry-instrumentation-together/pyproject.toml:10

Draft comment:
Together AI instrumentation configuration is updated properly; note the inclusion of a new author.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions. It simply notes a change without offering any insight or guidance.

8. packages/opentelemetry-instrumentation-transformers/pyproject.toml:10

Draft comment:
Transformers instrumentation metadata and dependency settings updated correctly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that something was done correctly, which is not necessary in a code review context.

9. packages/opentelemetry-instrumentation-vertexai/pyproject.toml:10

Draft comment:
Vertex AI instrumentation configuration is consistent; proper dependency and version metadata.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that the configuration is consistent, which does not align with the rules for useful comments.

10. packages/opentelemetry-instrumentation-watsonx/pyproject.toml:1

Draft comment:
Watsonx instrumentation metadata has been updated correctly; dependency on ibm-watson-machine-learning looks valid.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it states that the metadata has been updated correctly and the dependency looks valid. It doesn't provide any actionable feedback or suggestions for improvement.

11. packages/opentelemetry-instrumentation-weaviate/pyproject.toml:10

Draft comment:
Weaviate instrumentation configuration and dependency on weaviate-client appear correct.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that the configuration and dependency appear correct, which is not useful for the PR author.

12. packages/opentelemetry-instrumentation-writer/pyproject.toml:10

Draft comment:
Writer instrumentation configuration is updated; note the Python requirement now is '>=3.10,<4'. Ensure that this is intentional.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is about a change in the Python version requirement. It asks to ensure that this change is intentional, which is not allowed by the rules. The comment does not provide a specific suggestion or point out a specific issue with the change.

13. packages/sample-app/sample_app/dataset_attachments_example.py:1

Draft comment:
Sample app for dataset attachments demonstrates various attachment types. Code is clear and examples are well-documented.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the code without offering any specific insights or recommendations.

14. packages/sample-app/sample_app/dataset_example.py:1

Draft comment:
Dataset example script looks good; usage of from_csv and from_dataframe are clear.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

15. packages/sample-app/sample_app/experiment/run_research_experiment.py:1

Draft comment:
Experiment script for CI/CD is well-organized. Integration with GitHub context is properly handled.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the organization of the script and the integration with GitHub context, which does not align with the rules provided.

16. packages/traceloop-sdk/pyproject.toml:10

Draft comment:
Traceloop SDK package metadata is updated consistently. Dependencies and repository info are properly set.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only states that the package metadata is updated consistently and that dependencies and repository info are properly set. It does not provide any actionable feedback or suggestions for improvement.

17. packages/traceloop-sdk/tests/datasets/test_create_dataset.py:16

Draft comment:
Dataset creation tests from CSV and DataFrame are straightforward and cover expected behaviors.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

18. packages/traceloop-sdk/tests/datasets/test_datasets_operations.py:1

Draft comment:
Tests for dataset operations (fetch, delete, get_version_csv) are comprehensive.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, as it only praises the comprehensiveness of the tests without providing any actionable feedback or suggestions for improvement. It does not align with the rules for useful comments.

19. packages/traceloop-sdk/traceloop/sdk/datasets/__init__.py:17

Draft comment:
The all block in the datasets package cleanly exposes the public API. Good practice.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the existing code without offering any constructive criticism or questions.

20. packages/traceloop-sdk/traceloop/sdk/datasets/attachment.py:24

Draft comment:
Attachment classes are well-structured. The input validation is robust.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the existing code structure and validation, which is not aligned with the guidelines for useful comments.

21. packages/traceloop-sdk/traceloop/sdk/datasets/model.py:1

Draft comment:
Data models in the datasets module are clearly defined and use Pydantic effectively.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the use of Pydantic in the data models, which does not align with the rules for useful comments.

22. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:1

Draft comment:
Experiment class implementation is well modularized with clear separation between local and GitHub execution flows.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply praises the existing implementation without offering any constructive criticism or questions.

23. packages/traceloop-sdk/traceloop/sdk/experiment/model.py:1

Draft comment:
Experiment models are detailed and capture essential metadata. Looks good.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

24. packages/traceloop-sdk/traceloop/sdk/version.py:1

Draft comment:
Version in the SDK is updated consistently. No issues.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that there are no issues with the version update, which is not necessary for the PR author to know.

25. CHANGELOG.md:5

Draft comment:
Typo: In '- exp: Add run in github experiment (#3459)', consider capitalizing 'GitHub' to maintain proper noun formatting.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a comment about capitalizing "GitHub" as a proper noun. While technically correct that GitHub should be capitalized, I need to consider the rules: 1. This is about a CHANGELOG file, which is documentation 2. The rules say "Do NOT comment unless there is clearly a code change required" 3. This is a style/formatting issue, not a functional code issue 4. The rules emphasize not making "obvious or unimportant" comments 5. Changelog entries are typically written quickly and minor typos are generally acceptable 6. This doesn't affect functionality at all The comment is technically correct but falls into the category of being overly pedantic about a minor style issue in a changelog. It's not a code change, it's documentation, and it's a very minor issue that doesn't impact understanding or functionality. Could this be considered important for maintaining professional documentation standards? Brand names should typically be capitalized correctly, and maintaining consistency in a changelog could be valuable for the project's professional appearance. While professional documentation is important, the rules explicitly state not to make "obvious or unimportant" comments. A single lowercase letter in a changelog entry is extremely minor and doesn't warrant a review comment. The rules prioritize actionable code changes over style nitpicks in documentation. This comment should be deleted. It's a minor style issue in a changelog file that doesn't affect code functionality. The rules explicitly discourage obvious or unimportant comments, and this falls into that category.

26. packages/opentelemetry-instrumentation-groq/pyproject.toml:14

Draft comment:
Typo in the repository URL: "openllmetry" should likely be spelled "opentelemetry".
Reason this comment was not posted:
Comment was on unchanged code.

27. packages/opentelemetry-instrumentation-ollama/pyproject.toml:14

Draft comment:
Typo: The repository URL contains "openllmetry". It looks like it should be "opentelemetry". Please verify and update if necessary.
Reason this comment was not posted:
Comment was on unchanged code.

28. packages/opentelemetry-instrumentation-openai-agents/pyproject.toml:14

Draft comment:
There's a typo in the repository URL: "openllmetry" should be "opentelemetry".
Reason this comment was not posted:
Comment was on unchanged code.

29. packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py:734

Draft comment:
Typo detected: The model string used here is "gpt-4o", which seems likely to be a typo for "gpt-4" as the tests below use "gpt-4". Please review and correct if necessary.
Reason this comment was not posted:
Comment was on unchanged code.

30. packages/opentelemetry-instrumentation-sagemaker/pyproject.toml:14

Draft comment:
Typo found in the repository URL. It currently reads "openllmetry" but given the context (the package name is "opentelemetry-instrumentation-sagemaker"), it seems it should be "opentelemetry". Please verify and correct if necessary.
Reason this comment was not posted:
Comment was on unchanged code.

31. packages/traceloop-sdk/traceloop/sdk/experiment/model.py:79

Draft comment:
Minor consistency issue: The class name 'GithubContext' uses a lowercase 'h' while its docstring and other references use 'GitHub' (with an uppercase H). Consider renaming it (and the related classes 'RunInGithubRequest' and 'RunInGithubResponse') to maintain consistent capitalization throughout the code.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 30% vs. threshold = 50% This comment is about a naming/style consistency issue. It's suggesting a refactor for consistency. The rules say "Comments that suggest code quality refactors are good! But only if they are actionable and clear." This comment is actionable and clear - it suggests renaming the classes. However, I need to consider: 1) Is this an "obvious or unimportant" issue? 2) Is this clearly a code change that's required? The capitalization of "GitHub" is a minor style issue. While technically inconsistent, it's not a bug or functional issue. The comment is more of a "nice to have" rather than something that "clearly requires a code change". It's a minor style preference. Given the rule "Do NOT comment unless there is clearly a code change required", this seems like it falls into the category of optional/minor improvements rather than required changes. This could be considered an actionable code quality refactor that improves consistency. The proper capitalization of "GitHub" is actually the official brand name, so this might be more important than just a minor style issue. The comment is clear and actionable. While the comment is actionable, the rules emphasize not commenting unless there is "clearly a code change required." This is a minor naming consistency issue that doesn't affect functionality. The code works fine as-is, and this is more of a stylistic preference. The bar for keeping comments should be high - "STRONG EVIDENCE that the comment is correct" and clearly required changes. This comment should be deleted. While it points out a valid consistency issue, it's a minor stylistic concern rather than a clearly required code change. The code functions correctly as-is, and the comment doesn't meet the threshold of being a necessary change.

Workflow ID: wflow_1xDS7okIDHCPEa1x

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0a42cff and 34a9f6f.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py (9 hunks)
packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py
packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py

🪛 Flake8 (7.3.0)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

[error] 524-524: 'unittest.mock.MagicMock' imported but unused

(F401)

[error] 524-524: 'unittest.mock.Mock' imported but unused

(F401)

[error] 531-531: 'openai.types.responses.Response' imported but unused

(F401)

[error] 531-531: 'openai.types.responses.ResponseOutputItem' imported but unused

(F401)

[error] 531-531: 'openai.types.responses.ResponseUsage' imported but unused

(F401)

[error] 640-640: local variable 'parent' is assigned to but never used

(F841)

[error] 713-713: local variable 'parent' is assigned to but never used

(F841)

🪛 Ruff (0.14.6)

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

595-595: Unused function argument: instrument_legacy

(ARG001)

640-640: Local variable parent is assigned to but never used

Remove assignment to unused variable parent

(F841)

668-668: Unused function argument: instrument_legacy

(ARG001)

713-713: Local variable parent is assigned to but never used

Remove assignment to unused variable parent

(F841)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py

567-567: Do not catch blind exception: Exception

(BLE001)

640-640: Do not catch blind exception: Exception

(BLE001)

729-729: Do not catch blind exception: Exception

(BLE001)

803-803: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (9)

packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/responses_wrappers.py (6)

182-186: LGTM! Trace context field properly configured.

The addition of the trace_context field with arbitrary_types_allowed = True correctly handles OpenTelemetry context objects, which aren't Pydantic-compatible by default. This is the appropriate approach for storing trace context.

508-514: LGTM! Context capture for streaming responses.

The trace context is correctly captured before creating the span and passed to start_span, ensuring trace continuity for streaming responses.

564-579: LGTM! Exception path correctly handles trace context.

The trace context is properly captured when creating TracedData (with fallback to current context) and correctly restored when creating the error span. The pattern ensures trace continuity even in error scenarios.

636-650: LGTM! Success path trace context propagation.

The trace context is correctly captured in TracedData and restored when creating the completion span. This ensures that completed responses maintain trace continuity with their originating context.

674-680: LGTM! Async wrapper consistently implements trace context propagation.

The async wrapper functions follow the same trace context capture/restore pattern as the sync versions, ensuring trace continuity across asynchronous operations. The implementation is consistent and correct.

Also applies to: 726-741, 799-813

836-843: LGTM! Cancellation paths preserve trace context.

Both sync and async cancellation wrappers correctly restore the trace context from existing data when recording cancellation spans, maintaining trace continuity even when responses are cancelled.

Also applies to: 868-875

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py (3)

510-591: Excellent unit test for trace context propagation!

This test comprehensively validates that trace context is correctly captured and restored, ensuring spans maintain the proper parent-child relationship. The detailed assertions and comments make the test's purpose clear.

593-663: Excellent integration test for sync streaming!

This test thoroughly validates trace context propagation for synchronous streaming responses, with comprehensive assertions and clear documentation. It effectively prevents regressions of the trace continuity issue.

665-736: Excellent integration test for async streaming!

This test provides comprehensive coverage for asynchronous streaming responses, ensuring trace context is properly propagated in async scenarios. The assertions and documentation are thorough and clear.

packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 9c7804a in 56 seconds. Click for details.

Reviewed 47 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py:521

Draft comment:
Removed unused import of MagicMock and Mock. This cleanup improves clarity.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative as it just states that an unused import was removed. It doesn't provide any actionable feedback or suggestions for improvement.

2. packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py:637

Draft comment:
Removed assignment to 'parent' in streaming test as it was unused. Good clean-up.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, stating that an assignment was removed because it was unused. It does not provide a suggestion, ask for confirmation, or highlight a potential issue. It violates the rule against making purely informative comments.

3. packages/opentelemetry-instrumentation-openai/tests/traces/test_responses.py:710

Draft comment:
Removed redundant assignment to 'parent' in async streaming test; this variable was not needed.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, stating that a redundant assignment was removed. It does not provide a suggestion, ask for confirmation, or highlight a potential issue. According to the rules, purely informative comments should be removed.

4. packages/sample-app/sample_app/openai_guardrails_example.py:13

Draft comment:
Removed unused import of ConsoleSpanExporter, which simplifies the code.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, stating that an unused import was removed. It doesn't provide any actionable feedback or suggestions for improvement. According to the rules, purely informative comments should be removed.

Workflow ID: wflow_qlGYexzyzhBWKi7M

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

galkleinman added 2 commits November 26, 2025 18:07

fix(openai): broken traces when using openai-guardrails

e510980

fix broken traces and add tests

aaf7fdf

ellipsis-dev bot reviewed Nov 26, 2025

View reviewed changes

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

galkleinman changed the title ~~fix(openai): responses instrumentation broken traces due to missing context propagation~~ fix(openai): responses instrumentation broken traces for async streaming Nov 26, 2025

more complex reproducing sample_app

0a42cff

ellipsis-dev bot reviewed Nov 26, 2025

View reviewed changes

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

doronkopit5 approved these changes Nov 27, 2025

View reviewed changes

merging remote

34a9f6f

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

fix lint

9c7804a

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

galkleinman merged commit a005fe0 into main Nov 27, 2025
12 checks passed

fix(openai): responses instrumentation broken traces for async streaming #3475

fix(openai): responses instrumentation broken traces for async streaming #3475

Uh oh!

Conversation

galkleinman commented Nov 26, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

First, let's get the full test file to understand context

Read lines around the test function in question

Search for fixture definitions and test setup patterns

Find conftest files

Read conftest files to understand test isolation

Search for context management and tracer provider restoration patterns

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galkleinman commented Nov 26, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading