-
Notifications
You must be signed in to change notification settings - Fork 38.7k
checkqueue: implement a new scriptcheck worker pool with atomic variables #32791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
HowHsu
commented
Jun 21, 2025
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32791. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
ac947ff to
7d2b002
Compare
|
hmmm...Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the |
e9bb83c to
9501cb7
Compare
|
Hi folks, any hints on how to see node logs of failed CI tests, It's not easy to create all the environements including OS and libs and to run it locally:
|
You can follow the " View more details on Cirrus CI " link: https://cirrus-ci.com/task/6753181050339328 It says system error, so you are likely launching enough threads to run into a system limit. |
|
For the CI docs, see https://github.com/bitcoin/bitcoin/tree/master/ci#ci-scripts, but you'll likely have to run that on a 64 core system to recreate the system error |
Thanks, maflcko. The thing is this patchset is to extend numbers of scriptcheck threads to nCores-1, so I guess it is bound to fail the CI tests. What should I do now. |
In my local run, I can do something like this to see the bitcoind log (node log as I said), but how can I see the bitcoind log on Cirrus CI, seems no way to do that since that is not exposed. |
it is: See " View more details on Cirrus CI " on the checks page |
I'm so sorry, there is node logs, I missed them, thanks again. |
|
From the CI log, seems too much threads causes failure, I'll temporarily set a smaller number limitation of scriptcheck workers to sort of confirm it. |
707fc31 to
5e51e5f
Compare
5e51e5f to
c731b70
Compare
37dd95f to
389ae2a
Compare
Today, it is possible to fetch a block's spent prevouts in order to build an external index by using the `/rest/block/HASH.json` endpoint. However, its performance is low due to JSON serialization overhead. We can significantly optimize it by adding a new REST endpoint, using a binary response format: ``` $ BLOCKHASH=00000000000000000002a7c4c1e48d76c5a37902165a270156b7a8d72728a054 $ ab -k -c 1 -n 100 http://localhost:8332/rest/block/$BLOCKHASH.json Document Length: 13278152 bytes Requests per second: 3.53 [#/sec] (mean) Time per request: 283.569 [ms] (mean) $ ab -k -c 1 -n 10000 http://localhost:8332/rest/spentoutputs/$BLOCKHASH.bin Document Length: 195591 bytes Requests per second: 254.47 [#/sec] (mean) Time per request: 3.930 [ms] (mean) ``` Currently, this PR is being used and tested by Bindex: * https://github.com/romanz/bindex-rs This PR would allow to improve the performance of external indexers such as electrs, ElectrumX, Fulcrum and Blockbook: * https://github.com/romanz/electrs (also https://github.com/Blockstream/electrs and https://github.com/mempool/electrs) * https://github.com/spesmilo/electrumx * https://github.com/cculianu/Fulcrum * https://github.com/trezor/blockbook
…bles
TL;DR
This patchset proposes a new design of scriptcheck worker pool with
atomic variables to leverage more cpu resource on a machine which the
current mutex version cannot. Achieve about 45% boost in performance.
Main Idea:
The current version of worker pool use mutex to do the synchronization
between worker threads, which makes the performance downgrade after
worker_threads_num >= 14. One root cause is the high contention on the
queue's mutex among worker threads. Workers repeatedly sleep and be woken
up in short time, leading to heavy context switching.
The idea is to use atomic variables to protect the queue, workers
acquire it and increase it with a batch_size(a small number). This
significantly reduces context switching.
Implementation
To implement it, below changes are made:(enabled in atomic mode)
- atomic<int> nNext is used to indicate the next available indice in the
queue
- atomic<int> nTodo2 is used to indicate the number of tasks left
- nBatchSize is the batch size of tasks everytime a worker consumes
it should be small to keep workload balance among workers
- m_mutex is still used, to protect m_result and conditional vars
- fAtomic to indicate which mode is enabled
- about the queue:
- the queue is used as queue, not stacks any more
- the queue is 'add only', which means only adding tasks to it, don't
remove used tasks.
the above two are a must to make this idea achievable, and the
second sacrifices space
- start the workers only after all the tasks have been added to the
queue
this is a trade-off to make the implementaion simple, a worker can
compare nNext + nBatchSize to queue.size() to find if there are
tasks undone in the queue, otherwise another atomic<int> tail is
needed, and race condition may exist in this way (need more
investigation). From my test, it's not 100% to say this is a
'trade-off' since it also brings something good, pros && cons:
- pros:
1. no contention among producer and consumers when adding tasks
to the queue
2. the adding period is very short compared to the verification
period
- cons:
1. the workers start later
For details, please refer to the worker handler code
Tests:
- Numbers
- For blocks/s numbers
`./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor`
- For cpu usage numbers
`NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa`
- Some data is missing because the machine was billed by time, and there wasn't
enough time to complete all measurements.
- Some cells contain a range instead of a single value due to high fluctuations
- The numbers represent the average of multiple test runs.
| mutex verion | atomic version
nr_workers | blocks/s | cpu usage | blocks/s | cpu usage2
-----------|-----------|------------|-----------|-------------
0 | 4.22 | 100% | 4.2 | 100%
1 | 8.23 | 198% | 8 | 191.3%
2 | 12.08 | 293.7% | 11.6 | 276%
3 | 16 | 386.7% | 14.9 | 354.5%
4 | 19.8 | 476.3% | |
5 | 23.12 | 565.4% | |
6 | 27.3 | 654% | 23 | 560%
7 | 28.5 | 738% | |
8 | 33.8 | 809% | |
9 | 33.5 | 868% | 28.5 | 730%
10 | 37.4 | 910% | 30.5 | 777%
11 | 33~40 | 880%~928% | 32 | 825%
12 | 22~39 | 860%~925% | 32.5 | 830%
13 | 27.8~37 | 700%~913% | 34.8 | 928%
15 | 24.8~38.9 | 640%~719% | 38 | 1000%
19 | 18~23 | 370%~510% | 41 | 1161%
22 | 7~18 | 270%~400% | |
25 | 6.4 | 260% | 46 | 1400%
30 | 5 | 270% | 49 | 1560%
60 | 5 | 280% | 50~58 | 2100%
- Env:
A virtual machine on cloud
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
Model name: Intel Xeon Processor (Cascadelake)
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
Stepping: 6
BogoMIPS: 5985.93
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 2 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 128 MiB (32 instances)
L3: 32 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
Signed-off-by: Hao Xu <hao.xu@linux.dev>
Use a atomic<bool> fResultDone to indicate the result is out to be something error, thus all workers can quit immediately. This is to boost the performance in script verification error case. Signed-off-by: Hao Xu <hao.xu@linux.dev>
The current max number of scriptcheck threads is MAX_SCRIPTCHECK_THREADS = 14 due to the limitation of the current mechanism, release it to GetNumCores() as we have atomic mode of scriptcheck worker pool. Signed-off-by: Hao Xu <hao.xu@linux.dev>
GetNumCores() is now used in checkqueue related code, thus it has to be added to libbitcoinkernel to keep everything ok Signed-off-by: Hao Xu <hao.xu@linux.dev>
Signed-off-by: Hao Xu <hao.xu@linux.dev>
389ae2a to
8ed4204
Compare
With #31132 , this one makes sense again, need to test them together in all-cache-miss case |
|
🐙 This pull request conflicts with the target branch and needs rebase. |
|
Are the first two commits related to the PR? Seem REST-related? Benchmark results suggest a 12-thread crossover to become faster than master, any chance that I'm reading this wrong? |
No, they appears here after I rebase against the origin/master and then push -f.
the master behaves a little bit better when the number of threads is small, just like the table I put in the PR description. The crossover point is machine/hardware related, and it is usually around 12~15 threads. After that, with the number of threads increasing, performance of master branch rapidly decreases, while in this PR branch the performance still goes up. |
If you are waiting on that one, it could make sense to mark this one as draft for now? If not, you'll have to rebase. |
Done. Thanks Malmö. |
Tested this one and combined with about 100+ contiguous real blocks in mainnet from height 840000, by the method I mentioned in . |
|
⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?
|