[Python] Open Telemetry tracing (part 1)#40556
Conversation
|
At the moment gRPC Python team still lacks in-house knowledge of the o11y project, requesting @XuanWang-Amos to do the first pass. |
XuanWang-Amos
left a comment
There was a problem hiding this comment.
In general looks good to me, thanks for the work!
|
@XuanWang-Amos PR ready for second round |
e567895 to
e2b7421
Compare
Closes grpc#39800 and grpc#39061 Work done: 1. Added metrics exporting mechanism for AsyncIO stack in Python 2. Added AsyncIO example with observability plugin enabled. Metrics are now correctly recorded (as per sync stack). Tested locally with Python 3.12. 3. Added AsyncIO observability test suite, to confirm metrics collection for all possible RPC's: * unary unary * unary stream * stream unary * stream stream Caveats for the current solution: 1. All AsyncIO RPCs currently behave as unregistered methods from a metrics pespective (`grpc.method` field set to `other`) 2. xDS Observability support in for AsyncIO stack is out of scope. ToDo's: 1. Once grpc#40556 is merged - tracing mechanism will be introduced for AsyncIO stack Closes grpc#41573 PiperOrigin-RevId: 882655402
|
@asheshvidyut, @sergiitk - PR ready for the next round. Please mind that the |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements OpenTelemetry tracing support for gRPC Python by introducing span and event management, trace ID propagation, and integration with the OpenTelemetry SDK. The feedback highlights several important issues: timestamps are currently passed as strings which will cause TypeErrors in the OpenTelemetry SDK and should be converted to nanosecond integers; a race condition exists when monkey-patching the shared Tracer's ID generator due to instance-level locking; and a potential IndexError was identified when accessing the plugins list. It is also recommended to handle timestamps as int64_t in the C++ layer for better efficiency.
| span = self._tracer.start_span( | ||
| name=tracing_data.name, | ||
| context=local_ctx, | ||
| kind=trace.SpanKind.INTERNAL, | ||
| attributes=tracing_data.span_labels, | ||
| links=None, | ||
| start_time=tracing_data.start_time, | ||
| ) | ||
|
|
||
| for event in tracing_data.span_events: | ||
| span.add_event( | ||
| name=event["name"], | ||
| attributes=event["attributes"], | ||
| timestamp=event["time_stamp"], | ||
| ) | ||
| span.set_status(self._status_to_otel_status(tracing_data.status)) | ||
| span.end(end_time=tracing_data.end_time) |
There was a problem hiding this comment.
The start_time, end_time, and event time_stamp are passed as strings (e.g., "2023-05-29 17:07:09.895"), but the OpenTelemetry SDK expects these values to be integers representing nanoseconds since the epoch. Passing strings will lead to a TypeError in the SDK when it attempts to calculate span duration or during export.
It is highly recommended to change the data type of these fields to int64_t in the C++ layer and pass them as integers through Cython to avoid expensive string formatting and parsing, and to ensure compatibility with the OpenTelemetry SDK.
| with self._tracer_lock: | ||
| # this step is needed to propagate span ID correctly | ||
| self._tracer.id_generator = _GrpcIdGenerator( | ||
| trace_id=tracing_data.trace_id, span_id=tracing_data.span_id | ||
| ) |
There was a problem hiding this comment.
The self._tracer_lock is an instance variable, but the Tracer instance returned by get_tracer is typically shared across the entire process for a given name and version (cached by the TracerProvider). If multiple _OpenTelemetryPlugin instances are active, they will share the same Tracer but use different locks, leading to a race condition when monkey-patching self._tracer.id_generator. Consider using a class-level lock or a lock associated with the tracer name to ensure thread safety when modifying the shared tracer state.
| current_span = trace.get_current_span( | ||
| context=self._plugins[0].get_trace_context() | ||
| ).get_span_context() |
There was a problem hiding this comment.
Accessing self._plugins[0] without checking if self._plugins is non-empty can lead to an IndexError. This situation might occur if the OpenTelemetryPlugin is initialized without any active providers (e.g., both meter_provider and tracer_provider are None).
if not self._plugins:
return (b"TRACE_ID", b"")
current_span = trace.get_current_span(
context=self._plugins[0].get_trace_context()
).get_span_context()| event.time_stamp = | ||
| absl::FormatTime("%Y-%m-%d %H:%M:%E3S", absl::Now(), absl::UTCTimeZone()); |
There was a problem hiding this comment.
Formatting the timestamp as a human-readable string here and then parsing it back in Python is inefficient and prone to errors. Since OpenTelemetry requires nanoseconds since epoch for its Python SDK, it would be better to store and pass the timestamp as an int64_t using absl::ToUnixNanos(absl::Now()).
Closes grpc#39800 and grpc#39061 Work done: 1. Added metrics exporting mechanism for AsyncIO stack in Python 2. Added AsyncIO example with observability plugin enabled. Metrics are now correctly recorded (as per sync stack). Tested locally with Python 3.12. 3. Added AsyncIO observability test suite, to confirm metrics collection for all possible RPC's: * unary unary * unary stream * stream unary * stream stream Caveats for the current solution: 1. All AsyncIO RPCs currently behave as unregistered methods from a metrics pespective (`grpc.method` field set to `other`) 2. xDS Observability support in for AsyncIO stack is out of scope. ToDo's: 1. Once grpc#40556 is merged - tracing mechanism will be introduced for AsyncIO stack Closes grpc#41573 PiperOrigin-RevId: 882655402
Implementing gRFC A72 grpc/proposal/pull/474
ToDo's:
TextMapPropagatorobject - handled in a separate PR due to the huge number of code changes