Skip to content

Conversation

@HowHsu
Copy link

@HowHsu HowHsu commented Jun 6, 2025

A fixed MAX_SCRIPTCHECK_THREADS value is not flexible for users to leverage their cpu resources, and a value large than nCores - 1 doesn't make sense since it only adds some context switch overhead. Set it to nCores - 1. Assumption: A user who sets the number of script verification workers is aware of how this affects the system performance, otherwise he/she leaves it as default (which is 0)

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 6, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32692.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

@sipa
Copy link
Member

sipa commented Jun 6, 2025

I think up to nCores makes sense, and we need to be sure that that value is an actual reflection of hardware concurrency on all platforms. It may be worth benchmarking to see if there is actually a gain above certain values still (because thread contention will start to become more and more noticeable, which may at some point overtake the gains from more parallellism).

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 6, 2025

🚧 At least one of the CI tasks failed.
Task no wallet, libbitcoinkernel: https://github.com/bitcoin/bitcoin/runs/43620218125
LLM reason (✨ experimental): Linker error due to undefined reference to GetNumCores() causing build failure.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@HowHsu
Copy link
Author

HowHsu commented Jun 6, 2025

I think up to nCores makes sense, and we need to be sure that that value is an actual reflection of hardware concurrency on all platforms. It may be worth benchmarking to see if there is actually a gain above certain values still (because thread contention will start to become more and more noticeable, which may at some point overtake the gains from more parallellism).

There is a master worker, plus nCores-1 normal workers, so setting it to nCores actually allows nCores+1 worker threads.
I now think maybe it should even be smaller than nCores-1, since there may be some important bitcoind threads running (as a newbie, I can't point them out clearly).
About thread contention, I assume users know how many and which cores are idle enough to run these workers and they will bind workers to them(for example, use taskset on linux). Maybe more workers bring more lock contention, that's a issue to be investigated.

be sure that that value is an actual reflection of hardware concurrency on all platforms

I'll look into it.

@sipa
Copy link
Member

sipa commented Jun 6, 2025

There is a master worker, plus nCores-1 normal workers, so setting it to nCores actually allows nCores+1 worker threads.

That is incorrect. With -par=N, there is 1 master plus N-1 workers, which both perform script validation work, but the master only joins the other threads when it is done with other work. If you have N perfectly parallel cores, and nothing else running on your machine, par=N is expected, so the software should allow up to that, if there is no reason to assume that this would result in degraded performance.

About thread contention, I assume users know which cores are idle enough to run these workers and they will bind workers to them(for example, use taskset on linux).

Thread contention is the overhead inside bitcoind caused by coordination between the threads, and is unrelated to what the hardware provides or how loaded it is. The reason for the existing limit of 15 is because benchmarks showed that even on an otherwise unloaded machine with more cores than that, adding more script validation threads was net slower after about that many. That benchmark dates from 2013, so there is no reason to assume it still holds, as both hardware and typical blockchain script usage has changed significantly, as well as other parts of the codebase. But it would still be worth investigating if there isn't just some number above which it makes no sense.

src/validation.h Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this. Upper-case variable names are for constants, and this is no longer a constant. If there is reason for it to be a variable, the constant and its name should be removed, and the code using it should be adapted to compute it on the fly.

@HowHsu
Copy link
Author

HowHsu commented Jun 6, 2025

There is a master worker, plus nCores-1 normal workers, so setting it to nCores actually allows nCores+1 worker threads.

That is incorrect. With -par=N, there is 1 master plus N-1 workers, which both perform script validation work, but the master only joins the other threads when it is done with other work. If you have N perfectly parallel cores, and nothing else running on your machine, par=N is expected, so the software should allow up to that, if there is no reason to assume that this would result in degraded performance.

From the code I read from checkqueue.h ( https://raw.githubusercontent.com/bitcoin/bitcoin/refs/heads/master/src/checkqueue.h#:~:text=std%3A%3Aoptional%3CR%3E%20Complete()%20EXCLUSIVE_LOCKS_REQUIRED(!m_mutex) ) I think par=N means N workers, plus one master worker which is actually the main bitcoind thread itself. And I run benchmark in ben/connectblock.cpp to debug it, shows the same thing.

About thread contention, I assume users know which cores are idle enough to run these workers and they will bind workers to them(for example, use taskset on linux).

Thread contention is the overhead inside bitcoind caused by coordination between the threads, and is unrelated to what the hardware provides or how loaded it is. The reason for the existing limit of 15 is because benchmarks showed that even on an otherwise unloaded machine with more cores than that, adding more script validation threads was net slower after about that many. That benchmark dates from 2013, so there is no reason to assume it still holds, as both hardware and typical blockchain script usage has changed significantly, as well as other parts of the codebase. But it would still be worth investigating if there isn't just some number above which it makes no sense.

I'll try to find a machine which has enough cpu resource to test it.

@sipa
Copy link
Member

sipa commented Jun 6, 2025

I think par=N means N workers, plus one master worker which is actually the main bitcoind thread itself.

That's not correct. The number of worker threads is one less than the argument to -par: https://github.com/bitcoin/bitcoin/blob/v29.0/src/node/chainstatemanager_args.cpp#L62.

@HowHsu
Copy link
Author

HowHsu commented Jun 6, 2025

I think par=N means N workers, plus one master worker which is actually the main bitcoind thread itself.

That's not correct. The number of worker threads is one less than the argument to -par: https://github.com/bitcoin/bitcoin/blob/v29.0/src/node/chainstatemanager_args.cpp#L62.

I see, sorry for missing this part, but MAX_SCRIPTCHECK_THREADS is to limit opt.worker_threads_num, so it still should be nCores-1

@sipa
Copy link
Member

sipa commented Jun 6, 2025

@HowHsu I am so sorry! You are right.

@HowHsu
Copy link
Author

HowHsu commented Jun 8, 2025

Hi Sipa,

The reason for the existing limit of 15 is because benchmarks showed that even on an otherwise unloaded machine with more cores than that, adding more script validation threads was net slower after about that many.

Any doc about this test?

That benchmark dates from 2013

Is the benchmark you mentioned /bench/connectblock.cpp

I've done some investigation, and found the bottleneck may be the ECC calculation itself, not lock contension or something else.
I did some test on my Intel i5-13600KF(13th Gen)PC which has 6 P-cores and 8 E-cores (6 * 2 + 8 = 20 logical cores) with ./bench/connectblock.cpp, the producer adds tasks to the queue fast, lock contention is low from perf report. But the TPS doesn't go higher after thread_num=14 or 15, which is equal to the number of physical cores.
From internet, I learned that ECC computation highly uses ALU、FPU etc. and logical cpu threads on one core shares these stuff, which means ECC cannot get benefit from SMT. So I guess a user can set MAX_SCRIPTCHECK_THREADS to nPhysicalCores - 1, but still needs more research. Unfortunately I don't have a test environment with a bunch of CPUs (Would be super great if someone can do this test on a machine with different number of physical cores)

sigmaemilio

This comment was marked as spam.

@HowHsu
Copy link
Author

HowHsu commented Jun 22, 2025

I'm going to close this one since I found that under the current scriptcheck queue implementation, cores >=14 or 15 may degrade the performance.

@HowHsu HowHsu closed this Jun 22, 2025
@luke-jr
Copy link
Member

luke-jr commented Jun 24, 2025

Maybe this is something best solved by runtime benchmarking? (Or perhaps just at startup, since early blocks may not give us the parallelization we need for it)

@HowHsu
Copy link
Author

HowHsu commented Jun 29, 2025

I post a new scriptcheck pool implementation #32791 with atomic variables rather than mutex to reduce thread sync contention. That achieves very good performance improvement. But the idea is based on that the adding period is very short compared to the script verification period, so it has to be with PR #31132. @sipa and others, welcome to take a look.

@HowHsu HowHsu reopened this Jun 29, 2025
@HowHsu
Copy link
Author

HowHsu commented Jun 29, 2025

Maybe this is something best solved by runtime benchmarking? (Or perhaps just at startup, since early blocks may not give us the parallelization we need for it)

I think we can do it so, binary search the best worker thread number with bencharking on some standard test data.

@maflcko
Copy link
Member

maflcko commented Jul 3, 2025

Maybe turn into draft while CI is red and this is still a WIP?

@fanquake fanquake marked this pull request as draft July 3, 2025 08:22
A fixed MAX_SCRIPTCHECK_THREADS value is not flexible for users to leverage their
cpu resources, and a value large than nCores - 1 doesn't make sense since it only
adds some context switch overhead. Set it to nCores - 1.
Assumption: A user who sets the number of script verification workers is aware of
how this affects the system performance, otherwise he/she leaves it as default
(which is 0)

Signed-off-by: Hao Xu <hao.xu@linux.dev>
@achow101
Copy link
Member

Are you still working on this?

@achow101 achow101 requested a review from fjahr October 22, 2025 15:27
@HowHsu
Copy link
Author

HowHsu commented Oct 23, 2025

Are you still working on this?

No

@HowHsu HowHsu closed this Oct 23, 2025
@fjahr
Copy link
Contributor

fjahr commented Oct 29, 2025

The reason for the existing limit of 15 is because benchmarks showed that even on an otherwise unloaded machine with more cores than that, adding more script validation threads was net slower after about that many. That benchmark dates from 2013, so there is no reason to assume it still holds, as both hardware and typical blockchain script usage has changed significantly, as well as other parts of the codebase. But it would still be worth investigating if there isn't just some number above which it makes no sense.

I also checked this using the ConnectBlock benchmark as well as #33740 plus modifying MAX_SCRIPTCHECK_THREADS of course. The assumption still appears to hold based on these results. If anything, based on my results (which I reran several times) we could potentially reduce MAX_SCRIPTCHECK_THREADS to 13. EDIT: After some time I started to see the best results with 14 and 15, so maybe this was nothing.

$ build/bin/bench_bitcoin -filter=ConnectBlockAllSchnorr -min-time=1000 -scale-threads
Running benchmarks with worker threads from 1 to 32

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|       43,184,145.50 |               23.16 |    0.1% |      0.99 | `ConnectBlockAllSchnorr_1_threads`
|       29,091,010.25 |               34.37 |    0.4% |      1.10 | `ConnectBlockAllSchnorr_2_threads`
|       22,236,218.75 |               44.97 |    0.1% |      1.09 | `ConnectBlockAllSchnorr_3_threads`
|       17,901,305.67 |               55.86 |    0.3% |      1.09 | `ConnectBlockAllSchnorr_4_threads`
|       15,134,386.86 |               66.07 |    0.2% |      1.08 | `ConnectBlockAllSchnorr_5_threads`
|       13,131,000.00 |               76.16 |    0.1% |      1.08 | `ConnectBlockAllSchnorr_6_threads`
|       11,635,203.67 |               85.95 |    0.1% |      1.07 | `ConnectBlockAllSchnorr_7_threads`
|       10,472,699.00 |               95.49 |    0.1% |      1.07 | `ConnectBlockAllSchnorr_8_threads`
|        9,571,375.00 |              104.48 |    0.1% |      1.08 | `ConnectBlockAllSchnorr_9_threads`
|        9,643,621.18 |              103.70 |    0.7% |      1.09 | `ConnectBlockAllSchnorr_10_threads`
|        9,545,537.50 |              104.76 |    0.6% |      1.09 | `ConnectBlockAllSchnorr_11_threads`
|        9,536,983.30 |              104.85 |    0.8% |      1.07 | `ConnectBlockAllSchnorr_12_threads`
|        9,313,783.30 |              107.37 |    0.7% |      1.08 | `ConnectBlockAllSchnorr_13_threads`
|        9,630,087.50 |              103.84 |    1.8% |      1.09 | `ConnectBlockAllSchnorr_14_threads`
|        9,893,344.73 |              101.08 |    1.5% |      1.10 | `ConnectBlockAllSchnorr_15_threads`
|        9,840,655.36 |              101.62 |    1.3% |      1.12 | `ConnectBlockAllSchnorr_16_threads`
|        9,914,708.27 |              100.86 |    3.3% |      1.07 | `ConnectBlockAllSchnorr_17_threads`
|       10,038,008.40 |               99.62 |    1.0% |      1.09 | `ConnectBlockAllSchnorr_18_threads`
|       10,126,620.90 |               98.75 |    1.0% |      1.09 | `ConnectBlockAllSchnorr_19_threads`
|       10,279,966.60 |               97.28 |    2.2% |      1.12 | `ConnectBlockAllSchnorr_20_threads`
|       10,334,737.50 |               96.76 |    1.8% |      1.10 | `ConnectBlockAllSchnorr_21_threads`
|       10,016,070.80 |               99.84 |    3.2% |      1.12 | `ConnectBlockAllSchnorr_22_threads`
|       10,447,277.78 |               95.72 |    1.5% |      1.08 | `ConnectBlockAllSchnorr_23_threads`
|       10,244,954.55 |               97.61 |    1.4% |      1.12 | `ConnectBlockAllSchnorr_24_threads`
|        9,975,325.73 |              100.25 |    1.4% |      1.09 | `ConnectBlockAllSchnorr_25_threads`
|       10,181,458.30 |               98.22 |    1.1% |      1.11 | `ConnectBlockAllSchnorr_26_threads`
|       10,263,129.20 |               97.44 |    0.8% |      1.10 | `ConnectBlockAllSchnorr_27_threads`
|       10,168,634.33 |               98.34 |    1.0% |      1.02 | `ConnectBlockAllSchnorr_28_threads`
|        9,781,129.20 |              102.24 |    1.9% |      1.11 | `ConnectBlockAllSchnorr_29_threads`
|       10,461,203.12 |               95.59 |    4.9% |      1.00 | `ConnectBlockAllSchnorr_30_threads`
|       10,292,564.89 |               97.16 |    2.0% |      0.99 | `ConnectBlockAllSchnorr_31_threads`
|       10,371,725.00 |               96.42 |    0.6% |      1.11 | `ConnectBlockAllSchnorr_32_threads`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants