Skip to content

fix(config): resolve data race in config_watcher inotify fd access#398

Merged
kcenon merged 2 commits into
mainfrom
fix/issue-397-data-race-in-config-watcher-inotify-fd-access
Mar 6, 2026
Merged

fix(config): resolve data race in config_watcher inotify fd access#398
kcenon merged 2 commits into
mainfrom
fix/issue-397-data-race-in-config-watcher-inotify-fd-access

Conversation

@kcenon

@kcenon kcenon commented Mar 6, 2026

Copy link
Copy Markdown
Owner

Closes #397

Summary

  • Make inotify_fd_ and watch_fd_ std::atomic<int> on Linux to eliminate the data race between cleanup_inotify() (main thread) and watch_loop_linux() (watcher thread)
  • Apply the same atomic exchange/load pattern already used for kqueue_fd_ on macOS/BSD
  • Remove the TSan suppression entry for config_watcher as the root cause is now fixed

Changes

  • include/kcenon/common/config/config_watcher.h: Convert plain int fd members to std::atomic<int>, use exchange(-1) in cleanup and load() in watch loop
  • sanitizers/tsan_suppressions.txt: Remove config_watcher suppression entries

Test Plan

  • Full build passes (79/79 targets)
  • common_config_watcher_test passes
  • 121/122 tests pass (1 pre-existing failure unrelated to this change)
  • CI TSan job passes without suppression

Make inotify_fd_ and watch_fd_ atomic integers on Linux to eliminate
the data race between cleanup_inotify() and watch_loop_linux().

This follows the same pattern already used for kqueue_fd_ on macOS/BSD:
- Use std::atomic<int> for file descriptors shared across threads
- Use atomic exchange in cleanup to safely invalidate descriptors
- Use atomic load in watch loop to read current descriptor values

Remove the corresponding TSan suppression as the root cause is fixed.

Closes #397
@kcenon

kcenon commented Mar 6, 2026

Copy link
Copy Markdown
Owner Author

CI/CD Failure Analysis

Analysis Time: 2026-03-06 12:55 UTC
Attempt: #1

Failed Workflows

Workflow Job Step Status
CI Sanitizer / thread Run tests Failed

Root Cause Analysis

Primary Error:

WARNING: ThreadSanitizer: data race (pid=4020)
  Write of size 8 at 0x7ba000000040 by main thread:
    #0 close
    #1 config_watcher::cleanup_inotify() config_watcher.h:453
  Previous read of size 8 at 0x7ba000000040 by thread T5:
    #0 read
    #1 config_watcher::watch_loop_linux() config_watcher.h:478
  Location is file descriptor 4

Analysis:
The race is not on the atomic integer variable itself, but on the underlying file descriptor resource. Even with atomic loads, the watcher thread may copy the fd value into a local variable and begin a read() syscall while the main thread simultaneously calls close() on the same fd. TSan detects this as a race on the file descriptor.

The root cause is in stop() which calls cleanup_platform_watcher() (closing the fd) before join() — so the watcher thread may still be using the fd.

Identified Issues:

  1. stop() closes file descriptors before the watcher thread has exited
  2. Need to reorder: signal stop → join thread → close fds

Proposed Fix

Issue Proposed Solution Files Affected
FD close before thread exit Reorder stop(): set running_=false, join thread, then cleanup fds. Use a self-pipe (eventfd) to wake up poll() immediately instead of closing the inotify fd. config_watcher.h

Next Steps

  • Apply proposed fixes
  • Verify locally
  • Push and monitor CI

Automated failure analysis - Attempt #1

The previous atomic fix addressed the integer variable race but not the
underlying file descriptor race: close() on the main thread could race
with read() on the watcher thread operating on the same fd value.

Fix by restructuring the shutdown sequence:
- Add eventfd for clean shutdown signaling on Linux
- Poll both inotify fd and shutdown fd in watch loop
- Reorder stop(): signal -> join -> cleanup (close fds only after
  the watcher thread has exited)

This eliminates the TSan-detected race between close() and read() on
the inotify file descriptor.
@kcenon

kcenon commented Mar 6, 2026

Copy link
Copy Markdown
Owner Author

CI/CD Fix Verification - Attempt #1 Result

Status: All checks passing

Fix Applied

The eventfd-based shutdown signaling approach successfully resolved the TSan data race:

  1. Added eventfd for clean thread shutdown signaling on Linux
  2. Reordered stop() sequence: signal → join → cleanup (close fds only after thread exit)
  3. Poll both inotify fd and shutdown fd in the watch loop for immediate wakeup

CI Results

Workflow Status
CI (all jobs including Sanitizer/thread) Passed
Code Coverage Passed
CVE Security Scan Passed
Integration Tests Passed
Static Analysis Passed
Performance Benchmark Passed
SBOM Generation Passed

All 27 checks passing. Ready for review.


Automated fix verification - Attempt #1 of 3 (resolved)

@kcenon kcenon merged commit 12f8789 into main Mar 6, 2026
27 checks passed
@kcenon kcenon deleted the fix/issue-397-data-race-in-config-watcher-inotify-fd-access branch March 6, 2026 13:08
kcenon added a commit that referenced this pull request Apr 13, 2026
)

* fix(config): resolve data race in config_watcher inotify fd access

Make inotify_fd_ and watch_fd_ atomic integers on Linux to eliminate
the data race between cleanup_inotify() and watch_loop_linux().

This follows the same pattern already used for kqueue_fd_ on macOS/BSD:
- Use std::atomic<int> for file descriptors shared across threads
- Use atomic exchange in cleanup to safely invalidate descriptors
- Use atomic load in watch loop to read current descriptor values

Remove the corresponding TSan suppression as the root cause is fixed.

Closes #397

* fix(config): use eventfd for shutdown signaling to avoid fd race

The previous atomic fix addressed the integer variable race but not the
underlying file descriptor race: close() on the main thread could race
with read() on the watcher thread operating on the same fd value.

Fix by restructuring the shutdown sequence:
- Add eventfd for clean shutdown signaling on Linux
- Poll both inotify fd and shutdown fd in watch loop
- Reorder stop(): signal -> join -> cleanup (close fds only after
  the watcher thread has exited)

This eliminates the TSan-detected race between close() and read() on
the inotify file descriptor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(config): Data race in config_watcher inotify fd access

1 participant