Skip to content

Fix Claude Agent SDK tracing by capturing messages from receive_messages#20778

Merged
smoorjani merged 25 commits intomlflow:masterfrom
smoorjani:claude-agents-sdk-bug
Feb 18, 2026
Merged

Fix Claude Agent SDK tracing by capturing messages from receive_messages#20778
smoorjani merged 25 commits intomlflow:masterfrom
smoorjani:claude-agents-sdk-bug

Conversation

@smoorjani
Copy link
Collaborator

@smoorjani smoorjani commented Feb 12, 2026

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

  • Fix mlflow.anthropic.autolog() not creating traces when using the Claude Agent SDK. The previous hook-based approach read transcript files that suddenly started to only contain queue-operation metadata, so no trace was ever created.
  • Wrap query() and receive_response() on the SDK client to capture messages directly and build the trace when the response stream is exhausted.
  • Use native Anthropic message format for LLM span inputs/outputs, include cache tokens in input totals.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests
import asyncio

import mlflow

mlflow.anthropic.autolog()
mlflow.set_tracking_uri("databricks")
mlflow.set_experiment(experiment_id="98459650931566")


async def main():
    from claude_agent_sdk import ClaudeSDKClient

    async with ClaudeSDKClient() as client:
        await client.query(
            "Read through the MLflow MemAlign implementation in this codebase "
            "(check mlflow/metrics/ and related files). Briefly explain what it does, "
            "then suggest 2-3 concrete performance optimizations. "
            "Use the Read and Grep tools to explore the code."
        )
        async for msg in client.receive_response():
            print(f"  [{type(msg).__name__}] {msg}")

    print("\nDone! Check experiment 98459650931566 for the trace.")


asyncio.run(main())

Results:
image

image

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Does this PR require updating the MLflow Skills repository?

  • No. You can skip the rest of this section.
  • Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/bug-fix Mention under Bug Fixes in Changelogs. labels Feb 12, 2026
@github-actions
Copy link
Contributor

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20778/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20778/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20778/merge

@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

Documentation preview for 36a732e is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

…query instead of transcript

The SDK transcript files only contain queue-operation metadata, not actual
conversation content, so process_transcript() could never find user messages.
This wraps query() and receive_messages() on the client instance to accumulate
messages into a buffer, then builds the trace from typed SDK message objects
via a new process_sdk_messages() function. Also extracts shared trace
finalization logic into _finalize_trace() to reduce duplication.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani force-pushed the claude-agents-sdk-bug branch from ab341c9 to 0b8ff74 Compare February 13, 2026 18:00
smoorjani and others added 16 commits February 13, 2026 10:28
…improve docstrings

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
…sult_message

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
…esultMessage

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
- Wrap receive_response() to capture ResultMessage (which contains token
  usage and duration but is only yielded by receive_response, not
  receive_messages)
- Remove fake custom timestamps from SDK path — spans now use real
  wall-clock timing instead of computed timestamps that showed 1 second
- Include cache tokens (cache_creation + cache_read) in input token count
- Verify trace-level token usage aggregation in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
The SDK fires the Stop hook BEFORE yielding ResultMessage (which carries
token usage and duration). This caused both execution_duration and
token_usage to be missing from traces.

Fix: build the trace when receive_response() is fully consumed instead
of in the Stop hook. A receiving_response flag prevents the stop hook
from building a partial trace mid-stream. The stop hook still serves
as a fallback for code paths that only use receive_messages().

Also sets token usage directly on trace_metadata as belt-and-suspenders,
and uses ResultMessage.duration_ms for custom span timestamps.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
- Replace 3 wrappers + stop hook + 2 flags with a single
  receive_response() wrapper that builds the trace on exhaustion
- Use native Anthropic message format instead of converting to OpenAI
- Only include messages since last LLM span (not full history)
- Set MESSAGE_FORMAT: "anthropic" on LLM spans for Chat UI rendering
- Remove _build_trace helper, query/receive_messages wrappers, and
  all hook/flag machinery (-267 lines net)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
- query() doesn't echo through receive_response(), so wrap it to
  capture the user prompt in the message buffer
- Remove implementation-detail docstrings from internal methods
- Rename _sdk_msg_to_dict to _serialize_sdk_message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani requested a review from B-Step62 February 17, 2026 03:13
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good once #20778 (comment) is addressed

if options is None:
options = ClaudeAgentOptions()
# query() sends the user prompt but doesn't echo it through receive_response()
original_query = self.query
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the @safe_patch mechanism like other autologging itegration?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already present here:

These are patches on instance methods so I think the existing patch is sufficient, but LMK if not. These also have some limitations (e.g., async methods, they are not stateless) which make them hard to use with safe_patch

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, that was misread of the code.

Re:async, safe_patch should handle async now (we implement tracing for async LLM calls with it) so there may be some way to use it. But definitely not blocking.

return tool_result_map


def _serialize_sdk_message(msg) -> dict[str, Any] | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does asdict of dataclass work?

Copy link
Collaborator Author

@smoorjani smoorjani Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a shot. asdict on the full message doesn't help because it still requires significant postporcessing. We do use asdict for serialization in _serialize_content_block where it replaces manual field extraction.

smoorjani and others added 5 commits February 17, 2026 09:06
- Only include cache_creation_input_tokens in input count (not cache_read)
  since cache reads are significantly cheaper and would inflate cost estimates
- Skip response key in outputs when final_response is None
- Remove session ID fallback generation; omit if unavailable
- Simplify async flush: call flush_trace_async_logging() directly
  (it already handles errors internally)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
… fix docstring

- Move cache token detail from docstring to inline comment
- Extract _is_async_trace_logging_enabled() utility with unit tests
  (reviewer flagged that _async_queue field could change silently)
- Use dataclasses.asdict for SDK content block serialization
- Fix process_sdk_messages docstring (no longer generates session IDs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Encapsulates the async queue check + flush call so the fragile
_async_queue field name is tested directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
smoorjani and others added 3 commits February 17, 2026 15:49
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
query() can receive an async generator of message dicts (not just a
string).  Wrap the generator to capture user content for the trace
while passing items through to the SDK.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
…st imports to top level

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani added this pull request to the merge queue Feb 18, 2026
Merged via the queue into mlflow:master with commit 74c2e60 Feb 18, 2026
54 checks passed
@smoorjani smoorjani deleted the claude-agents-sdk-bug branch February 18, 2026 06:17
@github-actions github-actions bot added the size/XL Extra-large PR (500+ LoC) label Feb 18, 2026
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Feb 20, 2026
…ges (mlflow#20778)

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-authored-by: Claude <noreply@anthropic.com>
daniellok-db pushed a commit that referenced this pull request Feb 20, 2026
…ges (#20778)

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracking Tracking service, tracking client APIs, autologging rn/bug-fix Mention under Bug Fixes in Changelogs. size/XL Extra-large PR (500+ LoC) v3.10.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants