Use threading.Events to communicate between shutdown and export by DylanRussell · Pull Request #4511 · open-telemetry/opentelemetry-python

DylanRussell · 2025-03-26T20:47:26Z

Description

It seems like the behavior we want for Shutdown() is:

Don't interrupt an in flight export RPC/Request.
Prevent any new export RPC/Request from being made.
Shutdown() to interrupt the sleep call in export, so we don't idle only to report Failure.

This PR accomplishes these via threading events.

We use an event for export to communicate to shutdown that an RPC is in progress, and to wait until it's done or the shutdown timeout finishes.
We use another event for shutdown to communicate to export that shutdown is happening, and it doesn't need to sleep.

We use these 2 events to communicate between the 2 threads. AFAIK there are only 2 threads we need to worry about, one thread where export is repeatedly called, and the main thread where shutdown is called.

Note that this PR also fixes a bug where were we were needlessly sleeping for 32 seconds only to report failure, because we would simply break out of the loop in the next iteration. I also did some minor code cleanup in the exporters in this PR.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Still need to write tests. Putting this out there now to get early feedback.

Test A

Does This PR Require a Contrib Repo Change?

No.

Checklist:

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

export call is occuring, so that shutdown waits for export call to finish. Use threading.Event() to communicate when shutdown is occuring, so that sleep in export is interrupted if a shutdown is occuring.

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

exporter/opentelemetry-exporter-otlp-proto-grpc/pyproject.toml

aabmass · 2025-03-27T18:42:59Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

-                        metadata=self._headers,
-                        timeout=self._timeout,
+            try:
+                self._export_not_occuring.clear()


The usage of _export_not_occuring looks like a lock to me. Is there a benefit to using an event for it ?

Using an event allows the export thread to communicate to shutdown that there is / is not a pending RPC. In Shutdown we call the wait() method that blocks until the flag is true.

The problem with the lock is export gives it up, only to immediately require it. When 2 threads ask for a lock there's no guarantee on which gets it.

If the behavior that we want is for shutdown to block for any pending RPC and otherwise execute I think an event is best.

Maybe I'm missing something, but if you're doing

while: if shutdown_occuring.is_set(): return event.clear() export() event.set()

there is no guarantee that the thing waiting for the event will have run and set shutdown_occuring before export() gets called again. I think even switching to a lock doesn't necessarily solve everything. Might need to rethink the approach a little.

There must be some delay for the shutdown thread to receive that notification and set shutdown_occurring, but that must be really small.

I'm sure it's less than the sleeps in the retry loop (I think my test covers this, but I'll double check). I can probably test and see exactly how small that delay is. Conceivably a new export call could occur in the milliseconds or nanoseconds it takes. If that happens shutdown will have proceeded and closed the channel which will interrupt this export call. I don't think it's that bad for this to happen, and it's unlikely. Still an improvement on the current behavior IMO.

aabmass · 2025-03-27T18:44:29Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

+                        delay,
+                    )
+                    self._shutdown_occuring.wait(delay)
+                    continue


IMO this would be a little clearer to just return here

exporter/opentelemetry-exporter-otlp-proto-grpc/tests/test_otlp_exporter_mixin.py

to .toml file.

jomcgi · 2025-03-28T19:09:33Z

...xporter-otlp-proto-common/src/opentelemetry/exporter/otlp/proto/common/_internal/__init__.py

+    def run(self):
+        if self._target is not None:  # type: ignore
+            self._return = self._target(*self._args, **self._kwargs)  # type: ignore


Should we include the cleanup from the original run function or is that not a concern here?

Suggested change

def run(self):

if self._target is not None: # type: ignore

self._return = self._target(*self._args, **self._kwargs) # type: ignore

try:

if self._target is not None:

self._return = self._target(*self._args, **self._kwargs)

finally:

# Avoid a refcycle if the thread is running a function with

# an argument that has a member that points to the thread.

del self._target, self._args, self._kwargs

jomcgi · 2025-03-28T19:14:18Z

...xporter-otlp-proto-common/src/opentelemetry/exporter/otlp/proto/common/_internal/__init__.py

+    def join(self, *args):  # type: ignore
+        threading.Thread.join(self, *args)


nit: Could we avoid type ignore by explicitly passing the expected type?

Suggested change

def join(self, *args): # type: ignore

threading.Thread.join(self, *args)

def join(self, timeout: float | None = None) -> Any:

threading.Thread.join(self, timeout=timeout)

aabmass · 2025-03-28T19:49:10Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

-        # value will remain constant.
-        for delay in _create_exp_backoff_generator(max_value=max_value):
-            if delay == max_value or self._shutdown:
+        for delay in [1, 2, 4, 8, 16, 32]:


Should it include 64 as well like max_value before?

aabmass · 2025-03-28T19:57:29Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py

-                        metadata=self._headers,
-                        timeout=self._timeout,
+            try:
+                self._export_not_occuring.clear()


Maybe I'm missing something, but if you're doing

while: if shutdown_occuring.is_set(): return event.clear() export() event.set()

there is no guarantee that the thing waiting for the event will have run and set shutdown_occuring before export() gets called again. I think even switching to a lock doesn't necessarily solve everything. Might need to rethink the approach a little.

aabmass · 2025-03-28T20:10:05Z

...pentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py


    def shutdown(self, timeout_millis: float = 30_000, **kwargs) -> None:
-        if self._shutdown:
+        if self._shutdown_occuring.is_set():


This already had the same problem, but shutdown() is not thread safe. I guess for this PR we can assume only one thread calls it.

github-actions · 2026-03-08T03:57:31Z

This PR has been automatically marked as stale because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 days of this comment.
If you're still working on this, please add a comment or push new commits.

github-actions · 2026-03-22T04:04:08Z

This PR has been closed due to inactivity. Please reopen if you would like to continue working on it.

DylanRussell added 2 commits March 26, 2025 19:50

Use threading.Event() to communicate when an

e8916fc

export call is occuring, so that shutdown waits for export call to finish. Use threading.Event() to communicate when shutdown is occuring, so that sleep in export is interrupted if a shutdown is occuring.

Made changes to log exporter instead of mixin by mistake

8b63e4e

DylanRussell requested a review from a team as a code owner March 26, 2025 20:47

DylanRussell mentioned this pull request Mar 26, 2025

Delete duplicated OTLP Exporter tests, move them to the mixin unit test. Fix broken shutdown unit test. #4504

Merged

5 tasks

DylanRussell added 4 commits March 27, 2025 15:55

Add unit tests for all shutdown scenarios

e02f0a2

test

f00595e

test

91bde64

remove print statements

a1a202d

aabmass reviewed Mar 27, 2025

View reviewed changes

Add test to HTTP export. Revert print statements and changes

25c8f52

to .toml file.

jomcgi mentioned this pull request Mar 28, 2025

Fix oltp exporter shutdown race condition #4490

Closed

7 tasks

jomcgi reviewed Mar 28, 2025

View reviewed changes

aabmass reviewed Mar 28, 2025

View reviewed changes

xrmx added this to Python PR digest Apr 3, 2025

xrmx moved this to Reviewed PR that needs fixing in Python PR digest Apr 3, 2025

github-actions bot added the Stale label Mar 8, 2026

github-actions bot closed this Mar 22, 2026

github-project-automation bot moved this from Reviewed PRs that need fixes to Done in Python PR digest Mar 22, 2026

		def join(self, *args): # type: ignore
		threading.Thread.join(self, *args)

Conversation

DylanRussell commented Mar 26, 2025 • edited by xrmx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Does This PR Require a Contrib Repo Change?

Checklist:

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 8, 2026

Uh oh!

github-actions bot commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DylanRussell commented Mar 26, 2025 •

edited by xrmx

Loading