Skip to content

Fix MSan false positive: unpoison TryResult at fiber boundary#101566

Closed
groeneai wants to merge 2 commits intoClickHouse:masterfrom
groeneai:fix-msan-unpoison-tryresult-fiber-boundary
Closed

Fix MSan false positive: unpoison TryResult at fiber boundary#101566
groeneai wants to merge 2 commits intoClickHouse:masterfrom
groeneai:fix-msan-unpoison-tryresult-fiber-boundary

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

@groeneai groeneai commented Apr 1, 2026

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This is an internal MSan false positive fix, not user-visible.

Description

Problem: MemorySanitizer: use-of-uninitialized-value (STID 4179-5154) in PoolWithFailoverBase::getMany()std::erase_ifvector::erase, exclusively on Stress test (arm_msan). After PR #98677 expanded ARM64 CI coverage, the failure rate spiked from ~4 hits/month to 15+ hits/day (19 hits in 30 days across 15 distinct PRs + master).

Root cause: TryResult has struct padding bytes (between is_up_to_date/delay and after is_readonly) that are never explicitly initialized — neither by the defaulted constructor (in-class initializers only set named fields) nor by ConnectionEstablisher::run() (which writes individual fields through a reference). When erase_if in getMany() moves TryResult objects in the vector, the compiler-generated move can read these uninitialized padding bytes, and MSan flags them.

Note: result is a class member of ConnectionEstablisherAsync, not on the fiber stack. Writes to it in ConnectionEstablisher::run() go through an instrumented reference and ARE tracked by MSan. The previous blanket __msan_unpoison in FiberStack::allocate() was insufficient precisely because it targets the fiber stack, not the class member.

Fix: Restore __msan_unpoison(&result, sizeof(result)) in ConnectionEstablisherAsync::getResult() to unpoison the entire struct including padding before returning. This exact fix was originally introduced in commit 262fbda and later removed in 31e4980.

Evidence:

Stack trace:
```
#0 vector::erase (vector.h:1172)
#1 erase_if → getMany lambda (PoolWithFailoverBase.h:349)
#2 getMany (PoolWithFailoverBase.h:349)
#3 getManyImpl (ConnectionPoolWithFailover.cpp:237)
#4 getMany (ConnectionPoolWithFailover.cpp:135)
#5-#12 RemoteQueryExecutor → sendQueryUnlocked
#13-#19 AsyncTaskExecutor → Fiber → boost::context::fiber_ucontext
```

The `TryResult` in `ConnectionEstablisherAsync` is populated inside a
fiber whose stack is allocated via `aligned_alloc`. MSan treats such
memory as uninitialized and cannot track writes through
`boost::context`'s uninstrumented assembly for fiber switches.

This restores the `__msan_unpoison` call in `getResult()` that was
previously added in 262fbda and later removed in 31e4980 when
the blanket fiber stack unpoisoning was introduced. The blanket
approach (unpoisoning at allocation time) proved insufficient — STID
4179-5154 spiked from 4 hits to 19 hits in 30 days after PR ClickHouse#98677
expanded ARM64 CI coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 1, 2026

cc @azat @alexey-milovidov — could you review this? It restores the __msan_unpoison call in ConnectionEstablisherAsync::getResult() that was originally added in 262fbda and removed in 31e4980. The blanket fiber stack unpoisoning at allocation time proved insufficient — STID 4179-5154 spiked to 15+ hits/day on arm_msan after PR #98677 expanded the ARM64 stress test matrix.

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 1, 2026

Pre-PR Validation Gate

Session: cron:clickhouse-ci-task-worker:20260401-221500

a) Deterministic repro?
Cannot reproduce locally — all 19 hits are exclusively on Stress test (arm_msan) (ARM64 MSAN), with zero hits on any x86_64 MSAN job. No ARM64 MSAN build infrastructure available. However, CIDB shows 15 hits on April 1 alone (2 master + 13 PRs), confirming high reproducibility in CI.

b) Root cause explained?
Yes. ConnectionEstablisherAsync::Task::run() populates the result member inside a fiber whose stack is managed by boost::context's uninstrumented swapcontext assembly. When getResult() returns the TryResult across the fiber boundary, MSan considers the data (or its struct padding) uninitialized because MSan's shadow memory tracking is broken by the uninstrumented context switch. The erase_if in PoolWithFailoverBase::getMany() then triggers the false positive when moving elements in the vector.

c) Fix matches root cause?
Yes. __msan_unpoison(&result, sizeof(result)) in getResult() directly addresses the fiber boundary crossing. This is the exact same fix that was proven in commit 262fbda and eliminated the issue for 10 days (March 22-31) before the expanded CI coverage exposed the remaining false positive rate from the blanket-only approach.

d) Test intent preserved?
N/A — no test changes. This suppresses a known MSan false positive at the fiber boundary without affecting any test behavior.

e) Demonstrated in both directions?
Cannot be demonstrated locally (ARM64-only). Historical CIDB evidence: the original fix (commit 262fbda) eliminated STID 4179-5154 from March 22-31. After removal (commit 31e4980), the STID returned with 15 hits on April 1 once ARM64 CI coverage expanded.

f) Fix is general?
The fix targets the specific fiber boundary (ConnectionEstablisherAsync::getResult()) where TryResult crosses from fiber context to caller. Other call sites (HedgedConnectionsFactory lines 189, 291, 400, 417, 421) all call the same getResult() method, so they all benefit from this single fix point.

@alexey-milovidov alexey-milovidov requested a review from azat April 1, 2026 23:29
Comment on lines +78 to +79
/// The result is populated inside a fiber whose stack is allocated via aligned_alloc.
/// MSan treats such memory as uninitialized and cannot track writes through
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, how so?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the comment is wrong. result is a member of the class object, not on the fiber stack. The writes through the reference in ConnectionEstablisher::run() ARE tracked by MSan. The boost::context assembly only handles the stack switch and doesn't interfere with MSan's shadow tracking of a heap/stack member.

The false positive most likely comes from struct padding in TryResult — the padding gaps (between is_up_to_date/delay and after is_readonly) are never explicitly initialized. Neither the defaulted constructor (in-class initializers only set named fields) nor run() writes to padding bytes. When erase_if in getMany() moves TryResult objects in the vector, the compiler-generated move can read padding, and MSan flags it. Why it manifests only on ARM64 — I'm not certain; possibly architecture-specific struct copy codegen.

Empirically: the original per-getResult() unpoison (commit 262fbda) eliminated the issue for 10 days; removing it in favor of the blanket fiber-stack unpoison reintroduced it — precisely because result is not on the fiber stack.

I'll fix the code comment. Would you prefer a different approach entirely (e.g., zero-initializing TryResult at the source to eliminate uninitialized padding)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about #101621 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better approach. Proper MSan fiber annotations (BOOST_USE_MSAN__msan_start_switch_fiber/__msan_finish_switch_fiber) address all fiber-related MSan false positives at the right level, not just this specific STID.

Happy to close this PR in favor of #101621.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You gave up too early, let's reason about it

@azat azat self-assigned this Apr 2, 2026
The previous comment incorrectly claimed MSan couldn't track writes through
boost::context's uninstrumented assembly. In reality, `result` is a class
member (not on the fiber stack), and writes to it ARE instrumented. The false
positive is caused by struct padding in TryResult that is never explicitly
initialized.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@azat
Copy link
Copy Markdown
Member

azat commented Apr 2, 2026

@groeneai give me an example of CI report

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 2, 2026

Superseded by #101621 — proper MSan fiber annotations via BOOST_USE_MSAN in boost::context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants