More precise std::hardware_concurrency by AlexGuteniev · Pull Request #594 · microsoft/STL

AlexGuteniev · 2020-03-08T10:28:23Z

Description

Mostly for debugging purposes for code that pretends single CPU

Checklist

Be sure you've read README.md and understand the scope of this repo.

If you're unsure about a box, leave it unchecked. A maintainer will help you.

Identifiers in product code changes are properly _Ugly as per
https://eel.is/c++draft/lex.name#3.1 or there are no product code changes.
The STL builds successfully and all tests have passed (must be manually
verified by an STL maintainer before automated testing is enabled on GitHub,
leave this unchecked for initial submission).
These changes introduce no known ABI breaks (adding members, renaming
members, adding virtual functions, changing whether a type is an aggregate
or trivially copyable, etc.).
These changes were written from scratch using only this repository,
the C++ Working Draft (including any cited standards), other WG21 papers
(excluding reference implementations outside of proposed standard wording),
and LWG issues as reference material. If they were derived from a project
that's already listed in NOTICE.txt, that's fine, but please mention it.
If they were derived from any other project (including Boost and libc++,
which are not yet listed in NOTICE.txt), you must mention it here,
so we can determine whether the license is compatible and what else needs
to be done.

stl/src/cthread.cpp

StephanTLavavej · 2020-03-08T23:28:00Z

stl/src/cthread.cpp

-    return info.dwNumberOfProcessors;
+    DWORD_PTR process_affinity;
+    DWORD_PTR system_affinity;
+    if (GetProcessAffinityMask(GetCurrentProcess(), &process_affinity, &system_affinity))


GetProcessAffinityMask documentation says:

On a system with more than 64 processors, if the threads of the calling process are in a single processor group, the function sets the variables pointed to by lpProcessAffinityMask and lpSystemAffinityMask to the process affinity mask and the processor mask of active logical processors for that group. If the calling process contains threads in multiple groups, the function returns zero for both affinity masks.

dwNumberOfProcessors documentation says:

The number of logical processors in the current group.

Is there a behavioral difference here? If so, is it desirable? (These are not rhetorical questions; I am far from a WinAPI expert.)

Counting GetProcessAffinityMask is documented to return zero affinity for cases when process belongs to multiple processor groups.

As dwNumberOfProcessors documentation does not say anything about cases when process belongs to multiple processor groups. It could be assumed, that it still the number of logical processors in the current group. (It might be however lack of documentation for such cases).

So there's no difference regarding maximum return until process belongs to multiple processor groups.
And there is a possible difference, when process belongs to multiple processor groups.

Process would belong to multiple processor groups after one of its thread is explicitly specified to run in different processor group (by a call to SetThreadGroupAffinity or as creation parameter). This is done manually for each thread, and std::thread will not do this by default. So zero result of std::hardware_concurrency() is legitimate for this case (the value is not well defined, it depends on users further calls to SetThreadGroupAffinity). Per-group value returned by dwNumberOfProcessors, which is supposedly 64 for multi-group cases, if it really still returns nonzero for such cases (system will try to pack 64 processors in one group) -- is also suitable, as std::hardware_concurrency() is a hint. Once user took control by SetThreadGroupAffinity call, they took responsibility about effective hardware concurrency.

This change primarily targets other case, when this or other process calls SetProcessAffinityMask for this process. This technique can be used to test concurrent algorithms for performance on fewer CPUs (remove some other cores, other SMTs or both), or for fixing concurrency bugs (Application Compatibility can do this, for example).

For multi-group cases - I'm not sure. There is a possibility to enumerate all processor groups by obtaining GROUP_RELATIONSHIP from GetLogicalProcessorInformationEx. But I do not know if this is the right thing to do, as bigger concurrency value obtained by such means is only available for those who will call SetThreadGroupAffinity.

And there is a possible difference, when process belongs to multiple processor groups.

I missed that. Sadly to test this we need a machine with more than 64 threads...

(the value is not well defined, it depends on users further calls to SetThreadGroupAffinity).

I disagree, the value is defined for all portable uses of std::thread, since it uses the default group.

There is a possibility to enumerate all processor groups

Right, we should not do that because std::thread can't reach the other groups.

stl/src/cthread.cpp

…essary comment removal

BillyONeal

We need to confirm that the behavior is unchanged for multi-group systems. (There might be nothing @AlexGutenev can do about this yet)

StephanTLavavej · 2020-03-11T20:23:55Z

We believe the compiler back-end team has an 80-core machine; we should look into borrowing it for testing.

BillyONeal · 2020-03-11T20:37:41Z

We believe the compiler back-end team has an 80-core machine; we should look into borrowing it for testing.

I asked around on win32prg and got a volunteer. Just need to prepare the test for them...

AlexGuteniev · 2020-03-11T22:04:17Z

I asked around on win32prg and got a volunteer. Just need to prepare the test for them...

Try this, in x64 primarily, and in x86 for more data:

#define _WIN32_WINNT 0x0601 
#define WINVER 0x0601


#include <iostream>
#include <vector>
#include <Windows.h>


int main() {
    auto print_ga = []() {
        DWORD length = 1000;

        std::vector<BYTE> buffer;

        for (;;) {
            buffer.resize(0);
            buffer.resize(length);
            PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info = reinterpret_cast<SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*>(buffer.data());
            if (::GetLogicalProcessorInformationEx(RelationGroup, info, &length)) {
                break;
            }

            DWORD err = ::GetLastError();

            if (err != ERROR_INSUFFICIENT_BUFFER) {
                std::cerr << "Error: " << err << " calling GetLogicalProcessorInformationEx\n";
            }
        }

        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info = nullptr;
        for (DWORD i = 0; i < length; i += info->Size) {
            info = reinterpret_cast<SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*>(buffer.data() + i);
            if (info->Relationship == RelationGroup) {
                auto& g = info->Group;
                std::cout << "Groups " << g.ActiveGroupCount << " out of " << g.MaximumGroupCount << ".\n";
                for (WORD i = 0; i < g.ActiveGroupCount; i++) {
                    auto& gi = g.GroupInfo[i];
                    std::cout << "In group " << i << ": " << (int)gi.ActiveProcessorCount << " / " << (int)gi.MaximumProcessorCount << " processors. "
                        << "Affinity mask: " << std::showbase << std::hex << gi.ActiveProcessorMask << std::noshowbase << std::dec << ".\n";
                }
            }
        }
        SYSTEM_INFO si = {};
        ::GetNativeSystemInfo(&si);
        DWORD_PTR process_affinity;
        DWORD_PTR system_affinity;
        ::GetProcessAffinityMask(GetCurrentProcess(), &process_affinity, &system_affinity);
        std::cout << "System info reports: " << si.dwNumberOfProcessors << " CPUs. Affitity mask: "
            << std::showbase << std::hex << process_affinity << " / " << system_affinity << std::noshowbase << std::dec << ".\n";


    };

    print_ga();

    for (WORD g = 0; g < 2; g++)
    {
        HANDLE h = ::CreateThread(nullptr, 0, [](PVOID)->DWORD { ::Sleep(INFINITE); return 0; }, nullptr, 0, nullptr);

        GROUP_AFFINITY ga{};
        ga.Group = g;
        ga.Mask = 0x0000003;
        ::SetThreadGroupAffinity(h, &ga, nullptr);
    }

    std::cout << "\n\nAfter having 2 threads in 2 groups:\n\n";

    print_ga();
    
    return 0;
}

…conc

BillyONeal · 2020-03-17T06:49:54Z

I think the right behavior is to do this change, then fall back to the old method if the result is 0. I'll do some experimentation tomorrow; Windows folks pointed me to this knob for testing: https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/boot-parameters-to-test-drivers-for-multiple-processor-group-support

(A 3970X that the system thinks has 4 sockets?)

AlexGuteniev · 2020-03-17T07:58:09Z

I am too curious to avoid trying myself.

So, my results show that even after I successfully called SetThreadGroupAffinity to put threads into different groups, GetProcessAffinityMask still returns nonzero affinity masks.

Maybe it is not covered by simulation knob. Maybe documentation for GetProcessAffinityMask is wrong or outdated.

Another source of concern is truncation of masks in 32 bit process, and there are even two cases:

32-bit process on 32-bit system
32-bit process on 64-bit system

AlexGuteniev · 2020-03-17T08:13:02Z

If it is possible to propose Windows API functions, I would propose the following:

Number of CPUs function, which can take into account process affinity and can avoid a kernel call (maybe both features by flags passed to it)
Recommendation on spinning, like "don't", "avoid", "do" based on CPUs, and possibly other things like low power state

For pull request, maybe it should be closed, if full understanding is not reachable.

BillyONeal · 2020-03-17T16:59:03Z

If it is possible to propose Windows API functions,

I believe the right place for something like that is the Windows feedback tool. Of course even if they add such a thing it'll be some time before we can use it here :)

BillyONeal · 2020-03-17T22:13:29Z

OK I did some testing and I think this breaks 32 bit programs, since the status quo is that systems with more than 32 CPUs will do the right thing for them:

C:\Users\billy\Desktop>.\before.exe
The hardware concurrency is: 64

C:\Users\billy\Desktop>.\before32.exe
The hardware concurrency is: 64

As a result I think we should leave this unchanged.

AlexGuteniev · 2020-03-18T03:26:58Z

On a good part, bcdedit.exe /set groupsize 1 will make existing std::hardware_concurrency to 1, so there is already a way to test this case

BillyONeal · 2020-03-18T15:58:45Z

On a good part, bcdedit.exe /set groupsize 1 will make existing std::hardware_concurrency to 1, so there is already a way to test this case

I'm referring to the opposite. On my 32c/64t system, hardware_concurrency currently returns 64 but returns 32 with your change.

AlexGuteniev · 2020-03-18T16:02:40Z

I understand that the change is breaking this case.
I mean there's a way to have std::hardware_concurrency as 1 without my change.

More precise std::hardware_concurrency

81cff39

AlexGuteniev requested a review from a team as a code owner March 8, 2020 10:28

BillyONeal approved these changes Mar 8, 2020

View reviewed changes

stl/src/cthread.cpp Outdated Show resolved Hide resolved

CaseyCarter added the enhancement Something can be improved label Mar 8, 2020

StephanTLavavej requested changes Mar 8, 2020

View reviewed changes

Code style: control flow braces, more specific varaible names, unnecc…

9e3fa62

…essary comment removal

AlexGuteniev requested a review from StephanTLavavej March 9, 2020 05:32

let's go with <bitset>

33f4e12

BillyONeal suggested changes Mar 9, 2020

View reviewed changes

AlexGuteniev added 2 commits March 11, 2020 20:27

Good include quotes

203026e

Merge remote-tracking branch 'origin/master' into hw_conc

7dd5a9f

AlexGuteniev added 3 commits March 15, 2020 13:22

Merge remote-tracking branch 'original/master' into hw_conc

7e9a784

Merge branch 'hw_conc' of https://github.com/AlexGutenev/STL into hw_…

35cf1a0

…conc

Merge remote-tracking branch 'upstream/master' into hw_conc

3c90893

AlexGuteniev closed this Mar 18, 2020

AlexGuteniev deleted the hw_conc branch March 18, 2020 03:23

AlexGuteniev mentioned this pull request Aug 4, 2020

<execution>: cache thread::hardware_concurrency #1134

Closed

Chronial mentioned this pull request Aug 6, 2021

<thread>: (lack of) Support for >64 cores #2099

Closed

Conversation

AlexGuteniev commented Mar 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej Mar 8, 2020

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

BillyONeal Mar 9, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BillyONeal left a comment

Choose a reason for hiding this comment

Uh oh!

StephanTLavavej commented Mar 11, 2020

Uh oh!

BillyONeal commented Mar 11, 2020

Uh oh!

AlexGuteniev commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BillyONeal commented Mar 17, 2020

Uh oh!

AlexGuteniev commented Mar 17, 2020

Uh oh!

AlexGuteniev commented Mar 17, 2020

Uh oh!

BillyONeal commented Mar 17, 2020

Uh oh!

BillyONeal commented Mar 17, 2020

Uh oh!

AlexGuteniev commented Mar 18, 2020

Uh oh!

BillyONeal commented Mar 18, 2020

Uh oh!

AlexGuteniev commented Mar 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlexGuteniev commented Mar 8, 2020 •

edited

Loading

AlexGuteniev commented Mar 11, 2020 •

edited

Loading