Skip to content

perf(async): remove unused sequence array and eliminate string copies#535

Merged
kcenon merged 4 commits into
mainfrom
perf/533-remove-unused-sequence
Mar 20, 2026
Merged

perf(async): remove unused sequence array and eliminate string copies#535
kcenon merged 4 commits into
mainfrom
perf/533-remove-unused-sequence

Conversation

@kcenon

@kcenon kcenon commented Mar 20, 2026

Copy link
Copy Markdown
Owner

What

Two performance improvements in the async logging pipeline:

  1. Remove unused sequence_ array from lockfree_spsc_queue (redundant with cells_[i].sequence)
  2. Eliminate unnecessary string copies in high_performance_async_writer hot path

Change Type

  • Feature (performance optimization)

Affected Components

  • src/impl/async/lockfree_queue.h — Remove redundant sequence_ array
  • src/impl/async/high_performance_async_writer.cpp — Construct batch_entry directly with rvalues

Why

  • The sequence_ array was a separate std::array<std::atomic<size_t>, Size> that duplicated the per-cell sequence field already present in the cell struct — wasting cache lines
  • The async writer was creating temporary std::string variables for file, function names before passing to batch_entry constructor — unnecessary copies in a hot path

Closes #533
Closes #532

How

Implementation

  1. lockfree_queue.h: Removed alignas(cache_line_size) std::array<std::atomic<size_t>, Size> sequence_ and updated initialization loop to use cells_[i].sequence instead
  2. high_performance_async_writer.cpp: Eliminated intermediate std::string variables; pass to_string() rvalues directly to batch_entry constructor for move semantics

Testing

  • Initialization loop still correctly sets sequence numbers for ABA prevention
  • batch_entry construction semantics unchanged (values moved, not copied)

kcenon added 3 commits March 20, 2026 10:36
The separate sequence_ array was redundant because each cell already
contains its own atomic sequence number (cell::sequence) which is used
by enqueue_impl() and dequeue(). The standalone array was initialized
in the constructor but never read afterward, wasting Size *
sizeof(std::atomic<size_t>) bytes per queue instance.

Update the constructor to initialize cells_[i].sequence directly
instead of the removed array.

Closes #533
Pass to_string() rvalues directly to batch_entry constructor instead of
storing them in intermediate local variables. The batch_entry constructor
already accepts std::string by value and moves internally, so passing
temporaries directly enables move semantics and avoids extra copies on
each write() call.

Closes #532
Add Performance entries to [Unreleased] section for unused sequence
array removal and string copy elimination in hot path.
@kcenon kcenon added performance Performance improvements priority:high High priority issue core Core functionality labels Mar 20, 2026
@kcenon

kcenon commented Mar 20, 2026

Copy link
Copy Markdown
Owner Author

CI/CD Failure Analysis

Analysis Time: 2026-03-20 11:00 KST
Attempt: #1

Failed Workflows

Workflow Job Root Cause Category
CI Core-Only Build (macos-14) std::format check fails — Homebrew LLVM Clang 15.0.7 resolved instead of Apple Clang Environment
CI macos-14 / clang Same as above Environment
Code Coverage Coverage Analysis Ubuntu PPA (ppa.launchpadcontent.net) network timeout Transient
Integration Tests FetchContent Pilot (Ubuntu/gcc/Debug) Same PPA timeout Transient
Performance Regression Performance Regression Check Same PPA timeout → gcc-13 not found Transient

Root Cause Analysis

macOS std::format failure (2 jobs):

The CI workflow uses clang/clang++ without explicit paths. On the previous main branch run (2026-03-19), this resolved to Apple Clang 15.0.0.15000309 which supports std::format. On this PR run (2026-03-20), it resolves to Homebrew LLVM Clang 15.0.7 which does NOT support std::format.

The macos-14 runner image appears to have been updated, changing the PATH resolution order for clang++.

Main (pass): AppleClang 15.0.0.15000309 → HAS_STD_FORMAT: Success
PR   (fail): Clang 15.0.7 (Homebrew)    → HAS_STD_FORMAT: Failed

Ubuntu PPA timeout (3 jobs):

Workflows on ubuntu-22.04 depend on ppa:ubuntu-toolchain-r/test to install gcc-13. The PPA server was unreachable during the CI run, causing apt-get update to fail with exit code 100.

Proposed Fix

Issue Proposed Solution Files Affected
macOS compiler path Use explicit /usr/bin/clang and /usr/bin/clang++ on macOS to ensure Apple Clang is used .github/workflows/ci.yml
Ubuntu PPA timeout Transient — will be resolved by re-run after push N/A

Next Steps

  • Apply macOS compiler path fix
  • Push and monitor CI
  • Verify Ubuntu PPA jobs pass on re-run

Automated failure analysis - Attempt #1

The macos-14 runner image updated and clang++ now resolves to
Homebrew LLVM Clang 15.0.7 instead of Apple Clang 15.0.0,
which lacks std::format support. Use /usr/bin/clang(++) to
ensure the system Apple Clang is used.
@github-actions

Copy link
Copy Markdown
Contributor

Performance Regression Check

Benchmark Base (ns) PR (ns) Delta
BM_AsyncDecorator 194.1 191.2 -1.5%
BM_BufferedAsyncDecorator 125.6 118.3 🚀 -5.8%
BM_BufferedDecorator 904.7 917.8 +1.4%
BM_ConsoleAsyncDecorator 286.3 383.0 ⚠️ +33.8%
BM_DirectConsoleWriter 5653.0 5880.6 +4.0%
BM_DirectFileWriter 774.5 783.0 +1.1%
BM_ManualNesting_Async 199.1 179.9 🚀 -9.6%
BM_ManualNesting_BufferedAsync 116.9 118.9 +1.7%
BM_ObjectPool_HighContention/real_time/threads:4 1014.7 1024.9 +1.0%
BM_ObjectPool_HighContention/real_time/threads:8 1168.5 1162.4 -0.5%
BM_ObjectPool_MultiThread/real_time/threads:1 18.6 18.6 +0.2%
BM_ObjectPool_MultiThread/real_time/threads:2 123.3 122.6 -0.6%
BM_ObjectPool_MultiThread/real_time/threads:4 103.3 103.5 +0.2%
BM_ObjectPool_MultiThread/real_time/threads:8 117.5 124.0 ⚠️ +5.5%
BM_ObjectPool_SingleThread 18.0 18.1 +0.3%
BM_ThreadLocalObjectPool_CacheEfficiency/16/real_time/threads:4 1.3 1.3 +0.1%
BM_ThreadLocalObjectPool_CacheEfficiency/32/real_time/threads:4 1.3 1.3 +0.0%
BM_ThreadLocalObjectPool_CacheEfficiency/4/real_time/threads:4 1.3 1.3 +0.1%
BM_ThreadLocalObjectPool_CacheEfficiency/64/real_time/threads:4 1.3 1.3 -0.0%
BM_ThreadLocalObjectPool_CacheEfficiency/8/real_time/threads:4 1.3 1.4 +0.8%
BM_ThreadLocalObjectPool_HighContention/real_time/threads:4 192.7 194.8 +1.1%
BM_ThreadLocalObjectPool_HighContention/real_time/threads:8 183.7 189.7 +3.3%
BM_ThreadLocalObjectPool_MultiThread/real_time/threads:1 3.7 3.7 +0.2%
BM_ThreadLocalObjectPool_MultiThread/real_time/threads:2 14.7 14.6 -0.7%
BM_ThreadLocalObjectPool_MultiThread/real_time/threads:4 18.7 18.7 -0.1%
BM_ThreadLocalObjectPool_MultiThread/real_time/threads:8 17.5 17.6 +0.6%
BM_ThreadLocalObjectPool_SingleThread 2.8 2.8 -0.0%
BM_Throughput_LargeMessages 188.2 169.5 🚀 -9.9%
BM_Throughput_SmallMessages 192.3 171.7 🚀 -10.7%

Threshold: >5% regression triggers warning
Result: ⚠️ 2 benchmark(s) show potential regression

@kcenon kcenon merged commit 7a06725 into main Mar 20, 2026
33 checks passed
@kcenon kcenon deleted the perf/533-remove-unused-sequence branch March 20, 2026 04:21
kcenon added a commit that referenced this pull request Apr 13, 2026
…#535)

* perf(lockfree): remove unused sequence_ array from lockfree_spsc_queue

The separate sequence_ array was redundant because each cell already
contains its own atomic sequence number (cell::sequence) which is used
by enqueue_impl() and dequeue(). The standalone array was initialized
in the constructor but never read afterward, wasting Size *
sizeof(std::atomic<size_t>) bytes per queue instance.

Update the constructor to initialize cells_[i].sequence directly
instead of the removed array.

Closes #533

* perf(async): eliminate string copies in async writer hot path

Pass to_string() rvalues directly to batch_entry constructor instead of
storing them in intermediate local variables. The batch_entry constructor
already accepts std::string by value and moves internally, so passing
temporaries directly enables move semantics and avoids extra copies on
each write() call.

Closes #532

* docs: update CHANGELOG for #532, #533

Add Performance entries to [Unreleased] section for unused sequence
array removal and string copy elimination in hot path.

* fix(ci): use explicit Apple Clang path on macOS runners

The macos-14 runner image updated and clang++ now resolves to
Homebrew LLVM Clang 15.0.7 instead of Apple Clang 15.0.0,
which lacks std::format support. Use /usr/bin/clang(++) to
ensure the system Apple Clang is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core functionality performance Performance improvements priority:high High priority issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(lockfree): remove unused sequence_ array from lockfree_spsc_queue perf(async): eliminate string copies in high_performance_async_writer hot path

1 participant