Skip to content

[Python] Open Telemetry tracing (part 1)#40556

Open
Zgoda91 wants to merge 37 commits into
grpc:masterfrom
Zgoda91:A72_python_otel_tracing
Open

[Python] Open Telemetry tracing (part 1)#40556
Zgoda91 wants to merge 37 commits into
grpc:masterfrom
Zgoda91:A72_python_otel_tracing

Conversation

@Zgoda91

@Zgoda91 Zgoda91 commented Aug 26, 2025

Copy link
Copy Markdown
Contributor

Implementing gRFC A72 grpc/proposal/pull/474

ToDo's:

  1. Implement injection / extraction headers by utilizing TextMapPropagator object - handled in a separate PR due to the huge number of code changes

@sergiitk

sergiitk commented Sep 8, 2025

Copy link
Copy Markdown
Member

At the moment gRPC Python team still lacks in-house knowledge of the o11y project, requesting @XuanWang-Amos to do the first pass.

@sergiitk sergiitk added release notes: yes Indicates if PR needs to be in release notes kokoro:run labels Sep 8, 2025

@XuanWang-Amos XuanWang-Amos left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good to me, thanks for the work!

Comment thread src/python/grpcio_observability/grpc_observability/_observability.py Outdated
Comment thread src/python/grpcio_observability/grpc_observability/_cyobservability.pyx Outdated
Comment thread src/python/grpcio_observability/grpc_observability/client_call_tracer.cc Outdated
Comment thread src/python/grpcio_observability/grpc_observability/client_call_tracer.cc Outdated
@Zgoda91

Zgoda91 commented Sep 11, 2025

Copy link
Copy Markdown
Contributor Author

@XuanWang-Amos PR ready for second round

Comment thread src/python/grpcio_observability/grpc_observability/client_call_tracer.cc Outdated
@Zgoda91 Zgoda91 force-pushed the A72_python_otel_tracing branch from e567895 to e2b7421 Compare March 22, 2026 13:18
asheshvidyut pushed a commit to asheshvidyut/grpc that referenced this pull request Mar 26, 2026
Closes grpc#39800 and grpc#39061

Work done:
1. Added metrics exporting mechanism for AsyncIO stack in Python
2. Added AsyncIO example with observability plugin enabled. Metrics are now correctly recorded (as per sync stack). Tested locally with Python 3.12.
3. Added AsyncIO observability test suite, to confirm metrics collection for all possible RPC's:
* unary unary
* unary stream
* stream unary
* stream stream

Caveats for the current solution:
1. All AsyncIO RPCs currently behave as unregistered methods from a metrics pespective (`grpc.method` field set to `other`)
2. xDS Observability support in for AsyncIO stack is out of scope.

ToDo's:
1. Once grpc#40556 is merged - tracing mechanism will be introduced for AsyncIO stack

Closes grpc#41573

PiperOrigin-RevId: 882655402
@Zgoda91

Zgoda91 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor Author

@asheshvidyut, @sergiitk - PR ready for the next round. Please mind that the TextMapPropagator usage will be introduced in the following PR as it is quite a complex implementation. We need to orchestrate Python API (extract() / inject()) with Cython/C++ span ID and trace ID generation

@Zgoda91 Zgoda91 changed the title [Python] Open Telemetry tracing [Python] Open Telemetry tracing (part 1) Apr 20, 2026
@sergiitk

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements OpenTelemetry tracing support for gRPC Python by introducing span and event management, trace ID propagation, and integration with the OpenTelemetry SDK. The feedback highlights several important issues: timestamps are currently passed as strings which will cause TypeErrors in the OpenTelemetry SDK and should be converted to nanosecond integers; a race condition exists when monkey-patching the shared Tracer's ID generator due to instance-level locking; and a potential IndexError was identified when accessing the plugins list. It is also recommended to handle timestamps as int64_t in the C++ layer for better efficiency.

Comment on lines +282 to +298
span = self._tracer.start_span(
name=tracing_data.name,
context=local_ctx,
kind=trace.SpanKind.INTERNAL,
attributes=tracing_data.span_labels,
links=None,
start_time=tracing_data.start_time,
)

for event in tracing_data.span_events:
span.add_event(
name=event["name"],
attributes=event["attributes"],
timestamp=event["time_stamp"],
)
span.set_status(self._status_to_otel_status(tracing_data.status))
span.end(end_time=tracing_data.end_time)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The start_time, end_time, and event time_stamp are passed as strings (e.g., "2023-05-29 17:07:09.895"), but the OpenTelemetry SDK expects these values to be integers representing nanoseconds since the epoch. Passing strings will lead to a TypeError in the SDK when it attempts to calculate span duration or during export.

It is highly recommended to change the data type of these fields to int64_t in the C++ layer and pass them as integers through Cython to avoid expensive string formatting and parsing, and to ensure compatibility with the OpenTelemetry SDK.

Comment on lines +277 to +281
with self._tracer_lock:
# this step is needed to propagate span ID correctly
self._tracer.id_generator = _GrpcIdGenerator(
trace_id=tracing_data.trace_id, span_id=tracing_data.span_id
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The self._tracer_lock is an instance variable, but the Tracer instance returned by get_tracer is typically shared across the entire process for a given name and version (cached by the TracerProvider). If multiple _OpenTelemetryPlugin instances are active, they will share the same Tracer but use different locks, leading to a race condition when monkey-patching self._tracer.id_generator. Consider using a class-level lock or a lock associated with the tracer name to ensure thread safety when modifying the shared tracer state.

Comment on lines +566 to +568
current_span = trace.get_current_span(
context=self._plugins[0].get_trace_context()
).get_span_context()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing self._plugins[0] without checking if self._plugins is non-empty can lead to an IndexError. This situation might occur if the OpenTelemetryPlugin is initialized without any active providers (e.g., both meter_provider and tracer_provider are None).

        if not self._plugins:
            return (b"TRACE_ID", b"")
        current_span = trace.get_current_span(
            context=self._plugins[0].get_trace_context()
        ).get_span_context()

Comment on lines +263 to 264
event.time_stamp =
absl::FormatTime("%Y-%m-%d %H:%M:%E3S", absl::Now(), absl::UTCTimeZone());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Formatting the timestamp as a human-readable string here and then parsing it back in Python is inefficient and prone to errors. Since OpenTelemetry requires nanoseconds since epoch for its Python SDK, it would be better to store and pass the timestamp as an int64_t using absl::ToUnixNanos(absl::Now()).

asheshvidyut pushed a commit to a-detiste/grpc that referenced this pull request Jun 10, 2026
Closes grpc#39800 and grpc#39061

Work done:
1. Added metrics exporting mechanism for AsyncIO stack in Python
2. Added AsyncIO example with observability plugin enabled. Metrics are now correctly recorded (as per sync stack). Tested locally with Python 3.12.
3. Added AsyncIO observability test suite, to confirm metrics collection for all possible RPC's:
* unary unary
* unary stream
* stream unary
* stream stream

Caveats for the current solution:
1. All AsyncIO RPCs currently behave as unregistered methods from a metrics pespective (`grpc.method` field set to `other`)
2. xDS Observability support in for AsyncIO stack is out of scope.

ToDo's:
1. Once grpc#40556 is merged - tracing mechanism will be introduced for AsyncIO stack

Closes grpc#41573

PiperOrigin-RevId: 882655402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang/Python release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants