Fix MSan false positive: unpoison TryResult at fiber boundary#101566
Fix MSan false positive: unpoison TryResult at fiber boundary#101566groeneai wants to merge 2 commits intoClickHouse:masterfrom
Conversation
The `TryResult` in `ConnectionEstablisherAsync` is populated inside a fiber whose stack is allocated via `aligned_alloc`. MSan treats such memory as uninitialized and cannot track writes through `boost::context`'s uninstrumented assembly for fiber switches. This restores the `__msan_unpoison` call in `getResult()` that was previously added in 262fbda and later removed in 31e4980 when the blanket fiber stack unpoisoning was introduced. The blanket approach (unpoisoning at allocation time) proved insufficient — STID 4179-5154 spiked from 4 hits to 19 hits in 30 days after PR ClickHouse#98677 expanded ARM64 CI coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
cc @azat @alexey-milovidov — could you review this? It restores the |
Pre-PR Validation GateSession: a) Deterministic repro? b) Root cause explained? c) Fix matches root cause? d) Test intent preserved? e) Demonstrated in both directions? f) Fix is general? |
src/Client/ConnectionEstablisher.h
Outdated
| /// The result is populated inside a fiber whose stack is allocated via aligned_alloc. | ||
| /// MSan treats such memory as uninitialized and cannot track writes through |
There was a problem hiding this comment.
You're right — the comment is wrong. result is a member of the class object, not on the fiber stack. The writes through the reference in ConnectionEstablisher::run() ARE tracked by MSan. The boost::context assembly only handles the stack switch and doesn't interfere with MSan's shadow tracking of a heap/stack member.
The false positive most likely comes from struct padding in TryResult — the padding gaps (between is_up_to_date/delay and after is_readonly) are never explicitly initialized. Neither the defaulted constructor (in-class initializers only set named fields) nor run() writes to padding bytes. When erase_if in getMany() moves TryResult objects in the vector, the compiler-generated move can read padding, and MSan flags it. Why it manifests only on ARM64 — I'm not certain; possibly architecture-specific struct copy codegen.
Empirically: the original per-getResult() unpoison (commit 262fbda) eliminated the issue for 10 days; removing it in favor of the blanket fiber-stack unpoison reintroduced it — precisely because result is not on the fiber stack.
I'll fix the code comment. Would you prefer a different approach entirely (e.g., zero-initializing TryResult at the source to eliminate uninitialized padding)?
There was a problem hiding this comment.
Much better approach. Proper MSan fiber annotations (BOOST_USE_MSAN → __msan_start_switch_fiber/__msan_finish_switch_fiber) address all fiber-related MSan false positives at the right level, not just this specific STID.
Happy to close this PR in favor of #101621.
There was a problem hiding this comment.
You gave up too early, let's reason about it
The previous comment incorrectly claimed MSan couldn't track writes through boost::context's uninstrumented assembly. In reality, `result` is a class member (not on the fiber stack), and writes to it ARE instrumented. The false positive is caused by struct padding in TryResult that is never explicitly initialized. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@groeneai give me an example of CI report |
|
Superseded by #101621 — proper MSan fiber annotations via |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
This is an internal MSan false positive fix, not user-visible.
Description
Problem:
MemorySanitizer: use-of-uninitialized-value(STID 4179-5154) inPoolWithFailoverBase::getMany()→std::erase_if→vector::erase, exclusively onStress test (arm_msan). After PR #98677 expanded ARM64 CI coverage, the failure rate spiked from ~4 hits/month to 15+ hits/day (19 hits in 30 days across 15 distinct PRs + master).Root cause:
TryResulthas struct padding bytes (betweenis_up_to_date/delayand afteris_readonly) that are never explicitly initialized — neither by the defaulted constructor (in-class initializers only set named fields) nor byConnectionEstablisher::run()(which writes individual fields through a reference). Whenerase_ifingetMany()movesTryResultobjects in the vector, the compiler-generated move can read these uninitialized padding bytes, and MSan flags them.Note:
resultis a class member ofConnectionEstablisherAsync, not on the fiber stack. Writes to it inConnectionEstablisher::run()go through an instrumented reference and ARE tracked by MSan. The previous blanket__msan_unpoisoninFiberStack::allocate()was insufficient precisely because it targets the fiber stack, not the class member.Fix: Restore
__msan_unpoison(&result, sizeof(result))inConnectionEstablisherAsync::getResult()to unpoison the entire struct including padding before returning. This exact fix was originally introduced in commit 262fbda and later removed in 31e4980.Evidence:
Stress test (arm_msan)— zero x86_64 MSAN hitsStack trace:
```
#0 vector::erase (vector.h:1172)
#1 erase_if → getMany lambda (PoolWithFailoverBase.h:349)
#2 getMany (PoolWithFailoverBase.h:349)
#3 getManyImpl (ConnectionPoolWithFailover.cpp:237)
#4 getMany (ConnectionPoolWithFailover.cpp:135)
#5-#12 RemoteQueryExecutor → sendQueryUnlocked
#13-#19 AsyncTaskExecutor → Fiber → boost::context::fiber_ucontext
```