Skip to content

Agno V2 fixes#18345

Merged
B-Step62 merged 8 commits intomlflow:masterfrom
joelrobin18:fix_18335
Nov 21, 2025
Merged

Agno V2 fixes#18345
B-Step62 merged 8 commits intomlflow:masterfrom
joelrobin18:fix_18335

Conversation

@joelrobin18
Copy link
Collaborator

@joelrobin18 joelrobin18 commented Oct 16, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18345/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18345/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18345/merge

Related Issues/PRs

Fix #18335

What changes are proposed in this pull request?

Agno v2 introduces several breaking changes, including full support for OpenTelemetry instrumentation. This PR updates the tracing implementation to be compatible with Agno v2 by using MLflow’s native integration with OTel-based tracing.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions github-actions bot added area/tracing MLflow Tracing and its integrations rn/bug-fix Mention under Bug Fixes in Changelogs. labels Oct 16, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 16, 2025

Documentation preview for f7e2f52 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@BenWilson2
Copy link
Member

@joelrobin18 we might want to think about creating a v2 autologging module for the scope of the breaking changes to simplify things here. We could do version validation handling (there are other autologging integrations where this has been done) to prevent having to complicate the maintainability with embedding large amounts of try/catch logic or conditional logic within a single implementation.

@joelrobin18
Copy link
Collaborator Author

Hi @BenWilson2 Thank you for the feedbacks. Im trying to refactor the code to use lesser try/catch blocks as well as address the above comments as well.

@ashdam
Copy link

ashdam commented Nov 11, 2025

Very much appreatiated if we could fully integrate v2.
imho it not worth trying to be compatible with v1 agno is updating very fast :)

@joelrobin18
Copy link
Collaborator Author

Hi @ashdam Could you please add the below code at the top of your agent and check?

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from openinference.instrumentation.agno import AgnoInstrumentor

# Configure OTLP to export to MLflow
exporter = OTLPSpanExporter(
    endpoint="http://localhost:5000/v1/traces",
    headers={"x-mlflow-experiment-id": "0"}
)

tracer_provider = TracerProvider()
tracer_provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(tracer_provider)

AgnoInstrumentor().instrument()

@ashdam
Copy link

ashdam commented Nov 12, 2025

@joelrobin18 i have tested it.

Not really an expert on MLFlow but currenly im using MLflow 3.6.0 OSS with PostgreSQL backend with Agno v2.2.6

Its agno is capturing calls but it raises the following error

i got this error in agno-os:

{"TimeStamp": "2025-11-12T16:55:42.7841868+00:00", "Log": "error Not Implemented encountered while exporting span batch, retrying in 0.96s."}
{"TimeStamp": "2025-11-12T16:55:43.74478+00:00", "Log": "error Not Implemented encountered while exporting span batch, retrying in 1.63s."}
{"TimeStamp": "2025-11-12T16:55:45.3754366+00:00", "Log": "error Not Implemented encountered while exporting span batch, retrying in 3.62s."}
{"TimeStamp": "2025-11-12T16:55:49.0009375+00:00", "Log": "to export span batch code: 501, reason: {\"detail\":\"REST OTLP span logging is not supported by FileStore\"}"}

@joelrobin18
Copy link
Collaborator Author

It looks like we are using FileStore as backend here. Can you share a simple repo code for the same?

Signed-off-by: joelrobin18 <joelrobin1818@gmail.com>
_AUTOLOGGING_CLEANUP_CALLBACKS = {}


def register_cleanup_callback(autologging_integration, callback):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to clean up the OTel instrumentation after we disable them using mlflow.autolog(disable=True). Let me know if there is any better way to do this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the approach we do in some flavors like DSPy works? https://github.com/mlflow/mlflow/blob/master/mlflow/dspy/autolog.py#L54-L60

Basically

  1. Add an empty _autolog function
  2. Decorate it with@autologging_integration, instead of the main autolog function.
  3. Call that function inside the main autolog function.

This is hacky, but in this way, we can let autolog() function to be called when disable=True is specified.

@ashdam
Copy link

ashdam commented Nov 13, 2025

Hi @joelrobin18,

Thanks for looking into this! Yes, I'm definitely using PostgreSQL as the backend store, not FileStore.
I would like to add that im not python expert and I am using Sonnet 4.5 to help me out :P

Here's the configuration:

MLflow Server Setup

Deployment: Azure Container Apps running MLflow v3.6.0
Start Command:

mlflow server \
  --host 0.0.0.0 \
  --port 5000 \
  --backend-store-uri postgresql://agnoadmin:****@psql-agno-storage.postgres.database.azure.com:5432/mlflow_db?sslmode=require \
  --default-artifact-root wasbs://mlflow-artifacts@saagenticaifinancedemo.blob.core.windows.net/ \
  --serve-artifacts

Verified Backend Configuration:

$ az containerapp show --name mlflow-server --query "properties.template.containers[0].env"
[
  {
    "name": "MLFLOW_BACKEND_STORE_URI",
    "secretRef": "postgres-uri"  # Points to PostgreSQL connection string
  },
  ...
]

Agno v2.2.6 Integration Code

Following your recommendation, I'm using OpenInference AgnoInstrumentor:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from openinference.instrumentation.agno import AgnoInstrumentor

mlflow_tracking_uri = os.getenv("MLFLOW_TRACKING_URI")
mlflow_experiment_id = os.getenv("MLFLOW_EXPERIMENT_ID", "0")

exporter = OTLPSpanExporter(
    endpoint=f"{mlflow_tracking_uri}/v1/traces",
    headers={"x-mlflow-experiment-id": mlflow_experiment_id},
)

tracer_provider = TracerProvider()
tracer_provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(tracer_provider)

AgnoInstrumentor().instrument()

Dependencies:

  • agno==2.2.6
  • mlflow==3.6.0
  • openinference-instrumentation-agno==0.1.22
  • opentelemetry-sdk==1.38.0
  • opentelemetry-exporter-otlp-proto-http==1.38.0

The Issue

What's Working:

  • OpenInference AgnoInstrumentor successfully hooks into Agno v2.2.6 Team/Model calls
  • Spans are being generated and batched
  • OTLP exporter attempts to send to MLflow

The Error:

error Not Implemented encountered while exporting span batch, retrying in 0.96s.
error Not Implemented encountered while exporting span batch, retrying in 1.63s.
error Not Implemented encountered while exporting span batch, retrying in 3.62s.
Failed to export span batch code: 501, reason: {"detail":"REST OTLP span logging is not supported by FileStore"}

From Agent Execution Logs:
The instrumentation is definitely working - I can see the OpenInference wrappers in the stack traces:

File "/app/.venv/lib/python3.12/site-packages/openinference/instrumentation/agno/_runs_wrapper.py", line 512, in arun_stream
    async for response in wrapped(*args, **kwargs):
File "/app/.venv/lib/python3.12/site-packages/agno/team/team.py", line 2452, in _arun_stream
    async for event in self._ahandle_model_response_stream(
...
File "/app/.venv/lib/python3.12/site-packages/openinference/instrumentation/agno/_model_wrapper.py", line 493, in arun_stream
    async for chunk in wrapped(*args, **kwargs):

The Paradox

MLflow is configured with PostgreSQL backend (--backend-store-uri postgresql://...), but the 501 error indicates traces are still using FileStore. This suggests MLflow 3.6.0 has a separate trace storage layer that defaults to FileStore even when the main backend is PostgreSQL.

Is this a known limitation, or is there a missing configuration flag for enabling database trace storage?

I'm happy to provide a minimal reproducible repo if that helps debug this further!

Signed-off-by: joelrobin18 <joelrobin1818@gmail.com>
@ashdam
Copy link

ashdam commented Nov 14, 2025

Thank you very much for your work @joelrobin18 . this is highly anticipated in my company :)

@ashdam
Copy link

ashdam commented Nov 17, 2025

@BenWilson2 @joelrobin18 any news? :)

Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

_AUTOLOGGING_CLEANUP_CALLBACKS = {}


def register_cleanup_callback(autologging_integration, callback):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the approach we do in some flavors like DSPy works? https://github.com/mlflow/mlflow/blob/master/mlflow/dspy/autolog.py#L54-L60

Basically

  1. Add an empty _autolog function
  2. Decorate it with@autologging_integration, instead of the main autolog function.
  3. Call that function inside the main autolog function.

This is hacky, but in this way, we can let autolog() function to be called when disable=True is specified.

_logger.info("OpenTelemetry instrumentation enabled for Agno V2")

except ImportError as exc:
_logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise this as an exception (with the current message)? Enable tracing is the single purpose of calling mlflow.agno.autolog(), so it does not make much sense to pass through if we fail to do that.

@B-Step62
Copy link
Collaborator

B-Step62 commented Nov 18, 2025

Is this a known limitation, or is there a missing configuration flag for enabling database trace storage?

@ashdam I can see Agno traces logged successfuly via otel on my local, could not reproduce the error. Could you double check if the tracking URI points to the correct MLflow instance. The error message indicates the backend is actually file store.

You can also test it locally to see if this is related to your azure app container settings or not.

pip install mlflow==3.6.0
mlflow ui --backend-store-uri sqlite:///mlruns.db 

@ashdam
Copy link

ashdam commented Nov 18, 2025

Yes, we only have 1 server of MLFlow and it had postgreSQL configured. Its really hard (devops + security) for me to replicate everything , including agno, in local tbh :(

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
@ashdam
Copy link

ashdam commented Nov 21, 2025

@B-Step62 @joelrobin18 Thank you guys for your work :)

@B-Step62
Copy link
Collaborator

@ashdam I still believe what happens inside the app container is that MLflow is started with file store. The error message incldues the class name of the store and

detail=f"REST OTLP span logging is not supported by {store_name}",

If the store is configured to SQL properly, you should see server logs like this

Registry store URI not provided. Using sqlite:///mlruns.db
2025/11/21 20:56:58 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2025/11/21 20:56:58 INFO mlflow.store.db.utils: Updating database tables

One common gotcha is that the multi-line commands are not properly formatted in the YAML file and it only runs the first line mlflow ui, which defaults to a file store. If that happens, you should see a server log like this instead:

Backend store URI not provided. Using ./mlruns
Registry store URI not provided. Using ./mlruns
.../server/handlers.py:258: FutureWarning: The filesystem tracking backend (e.g., './mlruns') will be deprecated in February 2026. Consider transitioning to a database backend (e.g., 'sqlite:///mlflow.db') to take advantage of the latest MLflow features. See https://github.com/mlflow/mlflow/issues/18534 for more details and migration guidance.
  return FileStore(store_uri, artifact_uri)

For example, this works

    command:
      - /bin/bash
      - -c
      - |
        mlflow server \
            --backend-store-uri postgresql://... \
            --port 5000

but this does not work (only mlflow server will be executed).

    command: >
      /bin/bash -c "
        mlflow server \
            --backend-store-uri postgresql://... \
            --port 5000
        "

@B-Step62 B-Step62 added this pull request to the merge queue Nov 21, 2025
Merged via the queue into mlflow:master with commit 43f06f9 Nov 21, 2025
50 checks passed
@ashdam
Copy link

ashdam commented Nov 21, 2025

Thank you for the help :) I will test it next week :D thank you!

jimilp7 pushed a commit to backspace-org/mlflow that referenced this pull request Nov 21, 2025
Signed-off-by: joelrobin18 <joelrobin1818@gmail.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Co-authored-by: B-Step62 <yuki.watanabe@databricks.com>
Tian-Sky-Lan pushed a commit to Tian-Sky-Lan/mlflow that referenced this pull request Nov 24, 2025
Signed-off-by: joelrobin18 <joelrobin1818@gmail.com>
Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>
Co-authored-by: B-Step62 <yuki.watanabe@databricks.com>
Signed-off-by: Tian Lan <sky.blue266000@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracing MLflow Tracing and its integrations rn/bug-fix Mention under Bug Fixes in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Error “No module named ‘agno.storage’” when using tracing after upgrading to Agno v2

4 participants