checkqueue: implement a new scriptcheck worker pool with atomic variables #32791

HowHsu · 2025-06-21T12:58:58Z

TL;DR
This patchset proposes a new design of scriptcheck worker pool with
atomic variables to leverage more cpu resource on a machine which the
current mutex version cannot. Achieve about 45% boost in performance.

Main Idea:

The current version of worker pool use mutex to do the synchronization
between worker threads, which makes the performance downgrade after
worker_threads_num >= 14. One root cause is the high contention on the
queue's mutex among worker threads. Workers repeatedly sleep and be woken
up in short time, leading to heavy context switching.
The idea is to use atomic variables to protect the queue, workers
acquire it and increase it with a batch_size(a small number). This
significantly reduces context switching.

Implementation

To implement it, below changes are made:(enabled in atomic mode)
- atomic<int> nNext is used to indicate the next available indice in the
  queue
- atomic<int> nTodo2 is used to indicate the number of tasks left
- nBatchSize is the batch size of tasks everytime a worker consumes
  it should be small to keep workload balance among workers
- m_mutex is still used, to protect m_result and conditional vars
- fAtomic to indicate which mode is enabled
- about the queue:

  - the queue is used as queue, not stacks any more
  - the queue is 'add only', which means only adding tasks to it, don't
    remove used tasks.

    the above two are a must to make this idea achievable, and the
    second sacrifices space

  - start the workers only after all the tasks have been added to the
    queue

    this is a trade-off to make the implementaion simple, a worker can
    compare nNext + nBatchSize to queue.size() to find if there are
    tasks undone in the queue, otherwise another atomic<int> tail is
    needed, and race condition may exist in this way (need more
    investigation). From my test, it's not 100% to say this is a
    'trade-off' since it also brings something good, pros && cons:
    - pros:
        1. no contention among producer and consumers when adding tasks
           to the queue
        2. the adding period is very short compared to the verification
           period
    - cons:
        1. the workers start later

For details, please refer to the worker handler code

Tests:

- Numbers
  - For blocks/s numbers
    `./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor`
  - For cpu usage numbers
    `NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa`

  - Some data is missing because the machine was billed by time, and there wasn't
    enough time to complete all measurements.
  - Some cells contain a range instead of a single value due to high fluctuations
  - The numbers represent the average of multiple test runs.

           |     mutex verion       |   atomic version
nr_workers | blocks/s  | cpu usage  | blocks/s  | cpu usage2
-----------|-----------|------------|-----------|-------------
0          | 4.22      | 100%       | 4.2       | 100%
1          | 8.23      | 198%       | 8         | 191.3%
2          | 12.08     | 293.7%     | 11.6      | 276%
3          | 16        | 386.7%     | 14.9      | 354.5%
4          | 19.8      | 476.3%     |           |
5          | 23.12     | 565.4%     |           |
6          | 27.3      | 654%       | 23        | 560%
7          | 28.5      | 738%       |           |
8          | 33.8      | 809%       |           |
9          | 33.5      | 868%       | 28.5      | 730%
10         | 37.4      | 910%       | 30.5      | 777%
11         | 33~40     | 880%~928%  | 32        | 825%
12         | 22~39     | 860%~925%  | 32.5      | 830%
13         | 27.8~37   | 700%~913%  | 34.8      | 928%
15         | 24.8~38.9 | 640%~719%  | 38        | 1000%
19         | 18~23     | 370%~510%  | 41        | 1161%
22         | 7~18      | 270%~400%  |           |
25         | 6.4       | 260%       | 46        | 1400%
30         | 5         | 270%       | 49        | 1560%
60         | 5         | 280%       | 50~58     | 2100%

- Env:
 A virtual machine on cloud
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Address sizes:         46 bits physical, 48 bits virtual
   Byte Order:            Little Endian
   CPU(s):                64
   On-line CPU(s) list:   0-63
   Vendor ID:             GenuineIntel
   Model name:            Intel Xeon Processor (Cascadelake)
   CPU family:            6
   Model:                 85
   Thread(s) per core:    2
   Core(s) per socket:    16
   Socket(s):             2
   Stepping:              6
   BogoMIPS:              5985.93
 Virtualization features:
   Hypervisor vendor:     KVM
   Virtualization type:   full
 Caches (sum of all):
   L1d:                   2 MiB (64 instances)
   L1i:                   2 MiB (64 instances)
   L2:                    128 MiB (32 instances)
   L3:                    32 MiB (2 instances)
 NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-31
   NUMA node1 CPU(s):     32-63

Signed-off-by: Hao Xu <hao.xu@linux.dev>

DrahtBot · 2025-06-21T12:59:01Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32791.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32699 (docs: adds correct updated documentation links by Zeegaths)
#32692 (checkqueue: set MAX_SCRIPTCHECK_THREADS to nCores - 1 by HowHsu)
#31132 (validation: fetch block inputs on parallel threads 10% faster IBD by andrewtoth)
#28690 (build: Introduce internal kernel library by TheCharlatan)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot · 2025-06-21T14:21:13Z

🚧 At least one of the CI tasks failed.
_{Task no wallet, libbitcoinkernel: https://github.com/bitcoin/bitcoin/runs/44527955551}
_{LLM reason (✨ experimental): The CI failed due to a linker error: undefined reference to GetNumCores(), indicating a missing implementation or linkage of this function.}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

HowHsu · 2025-06-22T15:16:37Z

hmmm...Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the Add() is not fast due to I/O there.

HowHsu · 2025-06-26T15:28:58Z

Hi folks, any hints on how to see node logs of failed CI tests, It's not easy to create all the environements including OS and libs and to run it locally:

[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.311909Z TestFramework (ERROR): Test failed. Test logging available at /ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20250626_150635/wallet_fundrawtransaction_252/test_framework.log �[0m
[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.312354Z TestFramework (ERROR): �[0m
[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.312746Z TestFramework (ERROR): Hint: Call /ci_container_base/test/functional/combine_logs.py '/ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20250626_150635/wallet_fundrawtransaction_252' to consolidate all logs �[0m
[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.312954Z TestFramework (ERROR): �[0m
[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.313070Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. �[0m
[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.313299Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues �[0m
[11:09:03.884] �[0;36m test 2025-06-26T15:09:03.313452Z TestFramework (ERROR): �[0m
[11:09:03.884]
[11:09:03.884] �[0;32m node1 stderr libc++abi: terminating �[0m
[11:09:03.884]
[11:09:03.884] �[1mTEST | STATUS | DURATION
[11:09:03.884]
[11:09:03.884] �[0m�[0;32mmempool_ephemeral_dust.py | ✓ Passed | 76 s
[11:09:03.884] �[0m�[0;32mmempool_persist.py | ✓ Passed | 25 s
[11:09:03.884] �[0m�[0;32mmining_getblocktemplate_longpoll.py | ✓ Passed | 69 s
[11:09:03.884] �[0m�[0;32mp2p_opportunistic_1p1c.py | ✓ Passed | 140 s
[11:09:03.884] �[0m�[0;32mp2p_segwit.py | ✓ Passed | 113 s
[11:09:03.884] �[0m�[0;32mwallet_conflicts.py | ✓ Passed | 63 s
[11:09:03.884] �[0m�[0;31mwallet_fundrawtransaction.py | ✖ Failed | 2 s
[11:09:03.884] �[0m�[0;31m�[1m
[11:09:03.884] ALL | ✖ Failed | 488 s (accumulated)
[11:09:03.884] �[0m�[0mRuntime: 143 s

maflcko · 2025-06-26T15:42:37Z

Hi folks, any hints on how to see node logs of failed CI tests, It's not easy to create all the environements including OS and libs and to run it locally:

You can follow the " View more details on Cirrus CI " link: https://cirrus-ci.com/task/6753181050339328

It says system error, so you are likely launching enough threads to run into a system limit.

maflcko · 2025-06-26T15:43:45Z

For the CI docs, see https://github.com/bitcoin/bitcoin/tree/master/ci#ci-scripts, but you'll likely have to run that on a 64 core system to recreate the system error

HowHsu · 2025-06-26T16:29:34Z

For the CI docs, see https://github.com/bitcoin/bitcoin/tree/master/ci#ci-scripts, but you'll likely have to run that on a 64 core system to recreate the system error

Thanks, maflcko. The thing is this patchset is to extend numbers of scriptcheck threads to nCores-1, so I guess it is bound to fail the CI tests. What should I do now.

HowHsu · 2025-06-26T16:38:00Z

Hint: Call /ci_container_base/test/functional/combine_logs.py '/ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20250626_150635/wallet_fundrawtransaction_252'

In my local run, I can do something like this to see the bitcoind log (node log as I said), but how can I see the bitcoind log on Cirrus CI, seems no way to do that since that is not exposed.

maflcko · 2025-06-26T16:43:34Z

is not exposed.

it is: See " View more details on Cirrus CI " on the checks page

HowHsu · 2025-06-26T16:53:26Z

is not exposed.

it is: See " View more details on Cirrus CI " on the checks page

I'm so sorry, there is node logs, I missed them, thanks again.

HowHsu · 2025-06-26T17:00:15Z

From the CI log, seems too much threads causes failure, I'll temporarily set a smaller number limitation of scriptcheck workers to sort of confirm it.

Today, it is possible to fetch a block's spent prevouts in order to build an external index by using the `/rest/block/HASH.json` endpoint. However, its performance is low due to JSON serialization overhead. We can significantly optimize it by adding a new REST endpoint, using a binary response format: ``` $ BLOCKHASH=00000000000000000002a7c4c1e48d76c5a37902165a270156b7a8d72728a054 $ ab -k -c 1 -n 100 http://localhost:8332/rest/block/$BLOCKHASH.json Document Length: 13278152 bytes Requests per second: 3.53 [#/sec] (mean) Time per request: 283.569 [ms] (mean) $ ab -k -c 1 -n 10000 http://localhost:8332/rest/spentoutputs/$BLOCKHASH.bin Document Length: 195591 bytes Requests per second: 254.47 [#/sec] (mean) Time per request: 3.930 [ms] (mean) ``` Currently, this PR is being used and tested by Bindex: * https://github.com/romanz/bindex-rs This PR would allow to improve the performance of external indexers such as electrs, ElectrumX, Fulcrum and Blockbook: * https://github.com/romanz/electrs (also https://github.com/Blockstream/electrs and https://github.com/mempool/electrs) * https://github.com/spesmilo/electrumx * https://github.com/cculianu/Fulcrum * https://github.com/trezor/blockbook

…bles TL;DR This patchset proposes a new design of scriptcheck worker pool with atomic variables to leverage more cpu resource on a machine which the current mutex version cannot. Achieve about 45% boost in performance. Main Idea: The current version of worker pool use mutex to do the synchronization between worker threads, which makes the performance downgrade after worker_threads_num >= 14. One root cause is the high contention on the queue's mutex among worker threads. Workers repeatedly sleep and be woken up in short time, leading to heavy context switching. The idea is to use atomic variables to protect the queue, workers acquire it and increase it with a batch_size(a small number). This significantly reduces context switching. Implementation To implement it, below changes are made:(enabled in atomic mode) - atomic<int> nNext is used to indicate the next available indice in the queue - atomic<int> nTodo2 is used to indicate the number of tasks left - nBatchSize is the batch size of tasks everytime a worker consumes it should be small to keep workload balance among workers - m_mutex is still used, to protect m_result and conditional vars - fAtomic to indicate which mode is enabled - about the queue: - the queue is used as queue, not stacks any more - the queue is 'add only', which means only adding tasks to it, don't remove used tasks. the above two are a must to make this idea achievable, and the second sacrifices space - start the workers only after all the tasks have been added to the queue this is a trade-off to make the implementaion simple, a worker can compare nNext + nBatchSize to queue.size() to find if there are tasks undone in the queue, otherwise another atomic<int> tail is needed, and race condition may exist in this way (need more investigation). From my test, it's not 100% to say this is a 'trade-off' since it also brings something good, pros && cons: - pros: 1. no contention among producer and consumers when adding tasks to the queue 2. the adding period is very short compared to the verification period - cons: 1. the workers start later For details, please refer to the worker handler code Tests: - Numbers - For blocks/s numbers `./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor` - For cpu usage numbers `NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa` - Some data is missing because the machine was billed by time, and there wasn't enough time to complete all measurements. - Some cells contain a range instead of a single value due to high fluctuations - The numbers represent the average of multiple test runs. | mutex verion | atomic version nr_workers | blocks/s | cpu usage | blocks/s | cpu usage2 -----------|-----------|------------|-----------|------------- 0 | 4.22 | 100% | 4.2 | 100% 1 | 8.23 | 198% | 8 | 191.3% 2 | 12.08 | 293.7% | 11.6 | 276% 3 | 16 | 386.7% | 14.9 | 354.5% 4 | 19.8 | 476.3% | | 5 | 23.12 | 565.4% | | 6 | 27.3 | 654% | 23 | 560% 7 | 28.5 | 738% | | 8 | 33.8 | 809% | | 9 | 33.5 | 868% | 28.5 | 730% 10 | 37.4 | 910% | 30.5 | 777% 11 | 33~40 | 880%~928% | 32 | 825% 12 | 22~39 | 860%~925% | 32.5 | 830% 13 | 27.8~37 | 700%~913% | 34.8 | 928% 15 | 24.8~38.9 | 640%~719% | 38 | 1000% 19 | 18~23 | 370%~510% | 41 | 1161% 22 | 7~18 | 270%~400% | | 25 | 6.4 | 260% | 46 | 1400% 30 | 5 | 270% | 49 | 1560% 60 | 5 | 280% | 50~58 | 2100% - Env: A virtual machine on cloud Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel Xeon Processor (Cascadelake) CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 Stepping: 6 BogoMIPS: 5985.93 Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 128 MiB (32 instances) L3: 32 MiB (2 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-31 NUMA node1 CPU(s): 32-63 Signed-off-by: Hao Xu <hao.xu@linux.dev>

Use a atomic<bool> fResultDone to indicate the result is out to be something error, thus all workers can quit immediately. This is to boost the performance in script verification error case. Signed-off-by: Hao Xu <hao.xu@linux.dev>

The current max number of scriptcheck threads is MAX_SCRIPTCHECK_THREADS = 14 due to the limitation of the current mechanism, release it to GetNumCores() as we have atomic mode of scriptcheck worker pool. Signed-off-by: Hao Xu <hao.xu@linux.dev>

GetNumCores() is now used in checkqueue related code, thus it has to be added to libbitcoinkernel to keep everything ok Signed-off-by: Hao Xu <hao.xu@linux.dev>

Signed-off-by: Hao Xu <hao.xu@linux.dev>

HowHsu · 2025-06-29T13:25:27Z

hmmm...Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the Add() is not fast due to I/O there.

With #31132 , this one makes sense again, need to test them together in all-cache-miss case

DrahtBot · 2025-06-30T13:56:03Z

🐙 This pull request conflicts with the target branch and needs rebase.

instagibbs · 2025-07-17T14:52:42Z

Are the first two commits related to the PR? Seem REST-related?

Benchmark results suggest a 12-thread crossover to become faster than master, any chance that I'm reading this wrong?

HowHsu · 2025-07-17T16:45:36Z

Are the first two commits related to the PR? Seem REST-related?

No, they appears here after I rebase against the origin/master and then push -f.

Benchmark results suggest a 12-thread crossover to become faster than master, any chance that I'm reading this wrong?

the master behaves a little bit better when the number of threads is small, just like the table I put in the PR description. The crossover point is machine/hardware related, and it is usually around 12~15 threads. After that, with the number of threads increasing, performance of master branch rapidly decreases, while in this PR branch the performance still goes up.

maflcko · 2025-07-17T16:48:26Z

With #31132 , this one makes sense again, need to test them together in all-cache-miss case

If you are waiting on that one, it could make sense to mark this one as draft for now? If not, you'll have to rebase.

HowHsu · 2025-07-17T16:51:09Z

With #31132 , this one makes sense again, need to test them together in all-cache-miss case

If you are waiting on that one, it could make sense to mark this one as draft for now? If not, you'll have to rebase.

Done. Thanks Malmö.

HowHsu · 2025-09-18T11:06:30Z

hmmm...Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the Add() is not fast due to I/O there.

With #31132 , this one makes sense again, need to test them together in all-cache-miss case

Tested this one and combined with about 100+ contiguous real blocks in mainnet from height 840000, by the method I mentioned in .
But the result tells another story, this atomic version of work pool doesn't perform better than the master branch one, while it's a result on a 15-physical-core machine, I'll test it on a machine with more cores when I manage to get such a machine..

DrahtBot · 2025-12-16T00:31:57Z

⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?

Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
Is it no longer relevant? ➡️ Please close.
Did the author lose interest or time to work on this? ➡️ Please close it and mark it with one of the labels 'Up for grabs' or 'Insufficient Review', so that it can be picked up in the future.

DrahtBot added the CI failed label Jun 21, 2025

This was referenced Jun 21, 2025

docs: adds correct updated documentation links #32699

Closed

checkqueue: set MAX_SCRIPTCHECK_THREADS to nCores - 1 #32692

Closed

validation: fetch block inputs on parallel threads 3x faster IBD #31132

Open

HowHsu force-pushed the checkqueue_atomic branch from ac947ff to 7d2b002 Compare June 22, 2025 11:48

DrahtBot mentioned this pull request Jun 22, 2025

build: Introduce internal kernel library #28690

Open

HowHsu force-pushed the checkqueue_atomic branch 10 times, most recently from e9bb83c to 9501cb7 Compare June 26, 2025 14:15

HowHsu requested a review from Muaytie666 June 26, 2025 16:43

HowHsu force-pushed the checkqueue_atomic branch 2 times, most recently from 707fc31 to 5e51e5f Compare June 27, 2025 17:42

HowHsu force-pushed the checkqueue_atomic branch from 5e51e5f to c731b70 Compare June 28, 2025 06:15

DrahtBot removed the CI failed label Jun 28, 2025

HowHsu force-pushed the checkqueue_atomic branch 4 times, most recently from 37dd95f to 389ae2a Compare June 28, 2025 15:36

romanz and others added 7 commits June 29, 2025 00:25

doc: add release notes for bitcoin#32540

f7690a7

checkqueue: fix: add common/system.cpp to libbitcoinkernel

50e8013

GetNumCores() is now used in checkqueue related code, thus it has to be added to libbitcoinkernel to keep everything ok Signed-off-by: Hao Xu <hao.xu@linux.dev>

tmp: this is to test the CI failure

8ed4204

Signed-off-by: Hao Xu <hao.xu@linux.dev>

HowHsu force-pushed the checkqueue_atomic branch from 389ae2a to 8ed4204 Compare June 28, 2025 16:38

DrahtBot mentioned this pull request Jun 28, 2025

rest: rename strURIPart to uri_part #32825

Merged

DrahtBot added the Needs rebase label Jun 30, 2025

HowHsu marked this pull request as draft July 17, 2025 16:49

HowHsu mentioned this pull request Aug 4, 2025

Anychance to increase runner Cores to like 25 or 30 or even more? bitcoin-dev-tools/benchcoin#170

Open

checkqueue: implement a new scriptcheck worker pool with atomic variables #32791

Are you sure you want to change the base?

checkqueue: implement a new scriptcheck worker pool with atomic variables #32791

Conversation

HowHsu commented Jun 21, 2025

Uh oh!

DrahtBot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

DrahtBot commented Jun 21, 2025

Uh oh!

HowHsu commented Jun 22, 2025

Uh oh!

HowHsu commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maflcko commented Jun 26, 2025

Uh oh!

maflcko commented Jun 26, 2025

Uh oh!

HowHsu commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HowHsu commented Jun 26, 2025

Uh oh!

maflcko commented Jun 26, 2025

Uh oh!

HowHsu commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HowHsu commented Jun 26, 2025

Uh oh!

HowHsu commented Jun 29, 2025

Uh oh!

DrahtBot commented Jun 30, 2025

Uh oh!

instagibbs commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HowHsu commented Jul 17, 2025

Uh oh!

maflcko commented Jul 17, 2025

Uh oh!

HowHsu commented Jul 17, 2025

Uh oh!

HowHsu commented Sep 18, 2025

Uh oh!

DrahtBot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DrahtBot commented Jun 21, 2025 •

edited

Loading

HowHsu commented Jun 26, 2025 •

edited

Loading

HowHsu commented Jun 26, 2025 •

edited

Loading

HowHsu commented Jun 26, 2025 •

edited

Loading

instagibbs commented Jul 17, 2025 •

edited

Loading