Fix RecursionError in strands, semantic_kernel, and haystack autologgers with shared tracer provider by cgrierson-smartsheet · Pull Request #20809 · mlflow/mlflow

cgrierson-smartsheet · 2026-02-14T05:11:43Z

Related Issues/PRs

Fixes #20808

What changes are proposed in this pull request?

This PR fixes a RecursionError that occurs when using mlflow.strands.autolog(), mlflow.semantic_kernel.autolog(), or mlflow.haystack.autolog() while using a shared OpenTelemetry tracer provider (MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false).

Root Cause: StrandsSpanProcessor, SemanticKernelSpanProcessor, and HaystackSpanProcessor all share the same pattern — their on_start() methods call tracer.span_processor.on_start(span, parent_context), and their on_end() methods call tracer.span_processor.on_end(span). With shared provider mode, this tracer's span processor is the same composite processor that includes the calling processor itself, creating infinite recursion:

Provider.on_start()
  -> XxxSpanProcessor.on_start()
    -> tracer.span_processor.on_start()
      -> Provider.on_start()
        -> XxxSpanProcessor.on_start()
          -> ... (infinite loop)

With isolated provider mode (the default), each autologger gets its own tracer provider, so tracer.span_processor points to MLflow's internal processor rather than the composite one — no recursion occurs.

Solution: Add a threading.local() recursion guard as an instance attribute on each span processor. On first entry, the guard flag is set; if a re-entrant call is detected, it returns early. A finally block ensures the flag is always reset.

Motivation: Shared tracer provider mode is required to get unified traces where spans from different autologgers appear together. In agentic AI applications using Strands + Bedrock, Semantic Kernel + OpenAI, or Haystack pipelines, this enables end-to-end observability — latency tracking, causal relationships between agent and LLM spans, and token usage attribution — that is lost when each autologger produces separate traces.

Affected integrations:

Integration	Span Processor	File
Strands	`StrandsSpanProcessor`	`mlflow/strands/autolog.py`
Semantic Kernel	`SemanticKernelSpanProcessor`	`mlflow/semantic_kernel/autolog.py`
Haystack	`HaystackSpanProcessor`	`mlflow/haystack/autolog.py`

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

New automated tests — one per integration, each enabling shared provider mode via monkeypatch.setenv and verifying the autologger produces a valid trace without RecursionError:

tests/strands/test_strands_tracing.py::test_strands_autolog_shared_provider_no_recursion
tests/haystack/test_haystack_tracing.py::test_haystack_autolog_shared_provider_no_recursion
tests/semantic_kernel/test_semantic_kernel_autolog.py::test_sk_shared_provider_no_recursion

Test verification: With the fix reverted (source files restored to master, tests kept), all three new tests fail with RecursionError, confirming they correctly reproduce the bug. All existing tests across the three integrations continue to pass.

Manual testing (strands only, all 4 combinations):

Library	`MLFLOW_USE_DEFAULT_TRACER_PROVIDER`	Result
Stock	`true`	✅ Works
Stock	`false`	❌ RecursionError
Patched	`true`	✅ Works
Patched	`false`	✅ Works

Does this PR require documentation update?

No. You can skip the rest of this section.

Does this PR require updating the MLflow Skills repository?

No. You can skip the rest of this section.

Release Notes

Is this a user-facing change?

Yes. Give a description of this change to be included in the release notes for MLflow users.

Fix RecursionError in mlflow.strands.autolog(), mlflow.semantic_kernel.autolog(), and mlflow.haystack.autolog() when using shared tracer provider mode (MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false), enabling unified traces across multiple autologgers.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

area/tracking: Tracking Service, tracking client APIs, autologging
area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality

How should the PR be classified in the release notes? Choose one:

rn/bug-fix - A user-facing bug fix worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

🤖 Generated with Claude Code

Add thread-local recursion guard to StrandsSpanProcessor.on_start() and on_end() methods to prevent infinite recursion when using MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false (shared provider mode). The recursion occurs because on_start() calls tracer.span_processor.on_start(), which with shared provider mode points back to the same composite processor containing StrandsSpanProcessor itself. This fix enables using strands.autolog() together with other autologgers (e.g., bedrock.autolog()) while sharing a single tracer provider, which is required for unified traces where spans from different components appear in the same trace. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Chris Grierson <christopher.grierson@smartsheet.com>

github-actions · 2026-02-14T05:11:57Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20809/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20809/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20809/merge

github-actions · 2026-02-14T05:11:58Z

@cgrierson-smartsheet Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

cgrierson-smartsheet · 2026-02-14T05:19:27Z

@cgrierson-smartsheet Thank you for the contribution! Could you fix the following issue(s)?

⚠ Invalid PR template

This PR does not appear to have been filed using the MLflow PR template. Please copy the PR template from here and fill it out.

Updated.

Copilot

Pull request overview

This PR fixes a RecursionError that occurs when using mlflow.strands.autolog() with a shared OpenTelemetry tracer provider (MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false). The root cause is that StrandsSpanProcessor.on_start() calls tracer.span_processor.on_start(), which in shared provider mode is the same composite processor containing StrandsSpanProcessor itself, creating infinite recursion.

Changes:

Added thread-local recursion guards to StrandsSpanProcessor.on_start() and on_end() methods
Wrapped processing logic in try-finally blocks to ensure guard flags are reset

Copilot · 2026-02-16T01:58:11Z

mlflow/strands/autolog.py

+            if isinstance(tracer, NoOpTracer):
+                return


The early return when the tracer is a NoOpTracer will leave the recursion guard flag permanently set to True, preventing all subsequent calls from executing. The guard flag is set at line 49, but this return statement bypasses the finally block that resets it at line 60. This will break functionality after the first NoOpTracer check succeeds.

The guard flag should be reset before this return, or this check should be moved before setting the guard flag.

The in_on_start flag is a recursion guard. When the shared tracer provider is active (MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false), on_start calls tracer.span_processor.on_start(), which routes back through the same composite processor and re-enters this method. The flag detects that re-entrant call and returns early to break the loop. If the flag were not reset, all subsequent calls to on_start — not just recursive ones — would be silently skipped, effectively disabling the processor after its first invocation.

However, the flag is always reset. Copilot claims "this return statement bypasses the finally block," but in Python, finally blocks execute regardless of how the try block exits, including via return (language reference). The in_on_start flag set to True on line 49 will always be reset to False on line 60.

Copilot · 2026-02-16T01:58:11Z

mlflow/strands/autolog.py

    def on_start(self, span: OTelSpan, parent_context: Context | None = None):
-        tracer = _get_tracer(__name__)
-        if isinstance(tracer, NoOpTracer):
+        # Recursion guard - prevent infinite loop with shared provider
+        if getattr(_processing_local, "in_on_start", False):
            return
+        _processing_local.in_on_start = True
+        try:
+            tracer = _get_tracer(__name__)
+            if isinstance(tracer, NoOpTracer):
+                return

-        tracer.span_processor.on_start(span, parent_context)
-        trace_id = get_otel_attribute(span, SpanAttributeKey.REQUEST_ID)
-        mlflow_span = create_mlflow_span(span, trace_id)
-        InMemoryTraceManager.get_instance().register_span(mlflow_span)
+            tracer.span_processor.on_start(span, parent_context)
+            trace_id = get_otel_attribute(span, SpanAttributeKey.REQUEST_ID)
+            mlflow_span = create_mlflow_span(span, trace_id)
+            InMemoryTraceManager.get_instance().register_span(mlflow_span)
+        finally:
+            _processing_local.in_on_start = False

    def on_end(self, span: OTelReadableSpan) -> None:
-        mlflow_span = get_mlflow_span_for_otel_span(span)
-        if mlflow_span is None:
-            _logger.debug("Span not found in the map. Skipping end.")
+        # Recursion guard - prevent infinite loop with shared provider
+        if getattr(_processing_local, "in_on_end", False):
            return
-        with _bypass_attribute_guard(mlflow_span._span):
-            _set_span_type(mlflow_span, span)
-            _set_inputs_outputs(mlflow_span, span)
-            if model := span.attributes.get("gen_ai.request.model"):
-                mlflow_span.set_attribute(SpanAttributeKey.MODEL, model)
-            _set_token_usage(mlflow_span, span)
-        tracer = _get_tracer(__name__)
-        tracer.span_processor.on_end(span)
+        _processing_local.in_on_end = True
+        try:
+            mlflow_span = get_mlflow_span_for_otel_span(span)
+            if mlflow_span is None:
+                _logger.debug("Span not found in the map. Skipping end.")
+                return
+            with _bypass_attribute_guard(mlflow_span._span):
+                _set_span_type(mlflow_span, span)
+                _set_inputs_outputs(mlflow_span, span)
+                if model := span.attributes.get("gen_ai.request.model"):
+                    mlflow_span.set_attribute(SpanAttributeKey.MODEL, model)
+                _set_token_usage(mlflow_span, span)
+            tracer = _get_tracer(__name__)
+            tracer.span_processor.on_end(span)
+        finally:
+            _processing_local.in_on_end = False


The recursion guard implementation uses separate flags for on_start and on_end methods. However, this approach may not be sufficient if on_start is called while on_end is executing (or vice versa) in complex scenarios. Consider using a single shared guard or a reentrant lock pattern that blocks the entire processor, or document why separate guards are necessary for the correctness of this implementation.

Separate guards are intentional. on_start and on_end are called independently by the OTel SDK for different lifecycle events — on_start when a span begins, on_end when it finishes. A shared guard would incorrectly block on_end from executing while on_start is processing (or vice versa), breaking normal span lifecycle handling. Each method only needs to guard against re-entering itself.

Copilot · 2026-02-16T01:58:11Z

mlflow/strands/autolog.py

    def on_start(self, span: OTelSpan, parent_context: Context | None = None):
-        tracer = _get_tracer(__name__)
-        if isinstance(tracer, NoOpTracer):
+        # Recursion guard - prevent infinite loop with shared provider
+        if getattr(_processing_local, "in_on_start", False):
            return
+        _processing_local.in_on_start = True
+        try:
+            tracer = _get_tracer(__name__)
+            if isinstance(tracer, NoOpTracer):
+                return

-        tracer.span_processor.on_start(span, parent_context)
-        trace_id = get_otel_attribute(span, SpanAttributeKey.REQUEST_ID)
-        mlflow_span = create_mlflow_span(span, trace_id)
-        InMemoryTraceManager.get_instance().register_span(mlflow_span)
+            tracer.span_processor.on_start(span, parent_context)
+            trace_id = get_otel_attribute(span, SpanAttributeKey.REQUEST_ID)
+            mlflow_span = create_mlflow_span(span, trace_id)
+            InMemoryTraceManager.get_instance().register_span(mlflow_span)
+        finally:
+            _processing_local.in_on_start = False

    def on_end(self, span: OTelReadableSpan) -> None:
-        mlflow_span = get_mlflow_span_for_otel_span(span)
-        if mlflow_span is None:
-            _logger.debug("Span not found in the map. Skipping end.")
+        # Recursion guard - prevent infinite loop with shared provider
+        if getattr(_processing_local, "in_on_end", False):
            return
-        with _bypass_attribute_guard(mlflow_span._span):
-            _set_span_type(mlflow_span, span)
-            _set_inputs_outputs(mlflow_span, span)
-            if model := span.attributes.get("gen_ai.request.model"):
-                mlflow_span.set_attribute(SpanAttributeKey.MODEL, model)
-            _set_token_usage(mlflow_span, span)
-        tracer = _get_tracer(__name__)
-        tracer.span_processor.on_end(span)
+        _processing_local.in_on_end = True
+        try:
+            mlflow_span = get_mlflow_span_for_otel_span(span)
+            if mlflow_span is None:
+                _logger.debug("Span not found in the map. Skipping end.")
+                return
+            with _bypass_attribute_guard(mlflow_span._span):
+                _set_span_type(mlflow_span, span)
+                _set_inputs_outputs(mlflow_span, span)
+                if model := span.attributes.get("gen_ai.request.model"):
+                    mlflow_span.set_attribute(SpanAttributeKey.MODEL, model)
+                _set_token_usage(mlflow_span, span)
+            tracer = _get_tracer(__name__)
+            tracer.span_processor.on_end(span)
+        finally:
+            _processing_local.in_on_end = False


No tests are included for the shared tracer provider scenario that this PR is meant to fix. Consider adding a test that enables shared tracer provider mode (MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false) and verifies that strands.autolog() works without RecursionError, similar to the tests in tests/tracing/opentelemetry/test_integration.py that test other autologgers with shared provider mode.

Added shared-provider tests for all three affected integrations:

tests/strands/test_strands_tracing.py::test_strands_autolog_shared_provider_no_recursion

tests/haystack/test_haystack_tracing.py::test_haystack_autolog_shared_provider_no_recursion

tests/semantic_kernel/test_semantic_kernel_autolog.py::test_sk_shared_provider_no_recursion

Each test enables shared provider mode via monkeypatch.setenv and verifies that the autologger produces a valid trace without RecursionError. With the fix reverted, all three tests fail with RecursionError, confirming they correctly reproduce the bug.

TomeHirata · 2026-02-16T02:47:01Z

mlflow/strands/autolog.py

    def on_start(self, span: OTelSpan, parent_context: Context | None = None):
-        tracer = _get_tracer(__name__)
-        if isinstance(tracer, NoOpTracer):
+        # Recursion guard - prevent infinite loop with shared provider


Can you include a bit more context that this occurs only when MLFLOW_USE_DEFAULT_TRACER_PROVIDER=False?

Good call — updated the comment to be more explicit about the condition. Also applied the same fix and comment pattern to SemanticKernelSpanProcessor and HaystackSpanProcessor, which had the same vulnerability.

TomeHirata · 2026-02-16T02:48:02Z

mlflow/strands/autolog.py

 _logger = logging.getLogger(__name__)

+# Thread-local storage for recursion guard
+_processing_local = threading.local()


Can we make this lock an attribute of StrandsSpanProcessor?

Makes sense — moved the recursion guard to an instance attribute initialized in __init__. Applied the same pattern to SemanticKernelSpanProcessor and HaystackSpanProcessor.

TomeHirata · 2026-02-16T02:48:23Z

Overall looks great, can you add a test too?

- Move recursion guard from module-level threading.local() to instance attribute on StrandsSpanProcessor (per TomeHirata review) - Expand inline comments to reference MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false and explain the recursion mechanism (per TomeHirata review) - Add test_strands_autolog_shared_provider_no_recursion exercising shared provider mode (per Copilot review) - Update conftest.py with OTel provider state reset for test isolation Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Chris Grierson <christopher.grierson@smartsheet.com>

TomeHirata · 2026-02-18T03:42:07Z

tests/strands/test_strands_tracing.py

+
+
+def test_strands_autolog_shared_provider_no_recursion(monkeypatch):
+    """Verify strands.autolog() works with shared tracer provider (no RecursionError)."""


Let's remove this docstring

Done — converted to a regular comment to satisfy the clint linter while preserving the context.

github-actions · 2026-02-18T03:51:13Z

Documentation preview for 31bb767 is available at:

https://pr-20809--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

…ggers Both SemanticKernelSpanProcessor and HaystackSpanProcessor have the same recursion vulnerability as StrandsSpanProcessor when using shared tracer provider mode (MLFLOW_USE_DEFAULT_TRACER_PROVIDER=false). - Add threading.local() recursion guard to SemanticKernelSpanProcessor - Add threading.local() recursion guard to HaystackSpanProcessor - Add shared provider tests for all three integrations - Add haystack conftest.py with OTel provider state reset - Remove redundant test docstrings (clint lint fix) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Chris Grierson <christopher.grierson@smartsheet.com>

cgrierson-smartsheet · 2026-02-18T21:34:14Z

Overall looks great, can you add a test too?

Added shared-provider tests for all three affected integrations — see the "How is this PR tested?" section for details.

TomeHirata

LGTM!

…ers with shared tracer provider (mlflow#20809) Signed-off-by: Chris Grierson <christopher.grierson@smartsheet.com> Co-authored-by: Chris Grierson <christopher.grierson@smartsheet.com> Co-authored-by: Claude <noreply@anthropic.com>

…ers with shared tracer provider (#20809) Signed-off-by: Chris Grierson <christopher.grierson@smartsheet.com> Co-authored-by: Chris Grierson <christopher.grierson@smartsheet.com> Co-authored-by: Claude <noreply@anthropic.com>

github-actions bot added the size/M label Feb 14, 2026

cgrierson-smartsheet changed the title ~~Fix RecursionError in strands.autolog() with shared tracer provider~~ Fix RecursionError in strands.autolog() when using shared tracer provider (#20808) Feb 14, 2026

cgrierson-smartsheet changed the title ~~Fix RecursionError in strands.autolog() when using shared tracer provider (#20808)~~ Fix RecursionError in strands.autolog() when using shared tracer provider Feb 14, 2026

github-actions bot added area/tracing MLflow Tracing and its integrations area/tracking Tracking service, tracking client APIs, autologging rn/bug-fix Mention under Bug Fixes in Changelogs. v3.10.0 labels Feb 14, 2026

TomeHirata requested a review from Copilot February 16, 2026 01:55

Copilot started reviewing on behalf of TomeHirata February 16, 2026 01:55 View session

Copilot AI reviewed Feb 16, 2026

View reviewed changes

TomeHirata reviewed Feb 16, 2026

View reviewed changes

github-actions bot assigned TomeHirata Feb 16, 2026

TomeHirata reviewed Feb 18, 2026

View reviewed changes

cgrierson-smartsheet changed the title ~~Fix RecursionError in strands.autolog() when using shared tracer provider~~ Fix RecursionError in strands, semantic_kernel, and haystack autologgers with shared tracer provider Feb 18, 2026

TomeHirata approved these changes Feb 20, 2026

View reviewed changes

TomeHirata enabled auto-merge February 20, 2026 03:36

TomeHirata added this pull request to the merge queue Feb 20, 2026

Merged via the queue into mlflow:master with commit c345318 Feb 20, 2026
98 of 100 checks passed

github-actions bot removed the size/M label Feb 20, 2026

github-actions bot added the size/L Large PR (200-499 LoC) label Feb 20, 2026

This was referenced Feb 24, 2026

[BUG] flush_trace_async_logging fails with AttributeError when using non-default tracer provider #21104

Closed

Fix flush_trace_async_logging AttributeError with non-default tracer provider #21105

Merged



		def test_strands_autolog_shared_provider_no_recursion(monkeypatch):
		"""Verify strands.autolog() works with shared tracer provider (no RecursionError)."""

Conversation

cgrierson-smartsheet commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Feb 14, 2026

Install mlflow from this PR

Uh oh!

github-actions bot commented Feb 14, 2026

⚠ Invalid PR template

Uh oh!

cgrierson-smartsheet commented Feb 14, 2026

⚠ Invalid PR template

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata commented Feb 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgrierson-smartsheet commented Feb 18, 2026

Uh oh!

TomeHirata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cgrierson-smartsheet commented Feb 14, 2026 •

edited

Loading

github-actions bot commented Feb 18, 2026 •

edited

Loading