Skip to content

More precise std::hardware_concurrency#594

Closed
AlexGuteniev wants to merge 8 commits intomicrosoft:masterfrom
AlexGuteniev:hw_conc
Closed

More precise std::hardware_concurrency#594
AlexGuteniev wants to merge 8 commits intomicrosoft:masterfrom
AlexGuteniev:hw_conc

Conversation

@AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 8, 2020

Description

Mostly for debugging purposes for code that pretends single CPU

Checklist

Be sure you've read README.md and understand the scope of this repo.

If you're unsure about a box, leave it unchecked. A maintainer will help you.

  • Identifiers in product code changes are properly _Ugly as per
    https://eel.is/c++draft/lex.name#3.1 or there are no product code changes.
  • The STL builds successfully and all tests have passed (must be manually
    verified by an STL maintainer before automated testing is enabled on GitHub,
    leave this unchecked for initial submission).
  • These changes introduce no known ABI breaks (adding members, renaming
    members, adding virtual functions, changing whether a type is an aggregate
    or trivially copyable, etc.).
  • These changes were written from scratch using only this repository,
    the C++ Working Draft (including any cited standards), other WG21 papers
    (excluding reference implementations outside of proposed standard wording),
    and LWG issues as reference material. If they were derived from a project
    that's already listed in NOTICE.txt, that's fine, but please mention it.
    If they were derived from any other project (including Boost and libc++,
    which are not yet listed in NOTICE.txt), you must mention it here,
    so we can determine whether the license is compatible and what else needs
    to be done.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 8, 2020 10:28
@CaseyCarter CaseyCarter added the enhancement Something can be improved label Mar 8, 2020
return info.dwNumberOfProcessors;
DWORD_PTR process_affinity;
DWORD_PTR system_affinity;
if (GetProcessAffinityMask(GetCurrentProcess(), &process_affinity, &system_affinity))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetProcessAffinityMask documentation says:

On a system with more than 64 processors, if the threads of the calling process are in a single processor group, the function sets the variables pointed to by lpProcessAffinityMask and lpSystemAffinityMask to the process affinity mask and the processor mask of active logical processors for that group. If the calling process contains threads in multiple groups, the function returns zero for both affinity masks.

dwNumberOfProcessors documentation says:

The number of logical processors in the current group.

Is there a behavioral difference here? If so, is it desirable? (These are not rhetorical questions; I am far from a WinAPI expert.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counting GetProcessAffinityMask is documented to return zero affinity for cases when process belongs to multiple processor groups.

As dwNumberOfProcessors documentation does not say anything about cases when process belongs to multiple processor groups. It could be assumed, that it still the number of logical processors in the current group. (It might be however lack of documentation for such cases).

So there's no difference regarding maximum return until process belongs to multiple processor groups.
And there is a possible difference, when process belongs to multiple processor groups.

Process would belong to multiple processor groups after one of its thread is explicitly specified to run in different processor group (by a call to SetThreadGroupAffinity or as creation parameter). This is done manually for each thread, and std::thread will not do this by default. So zero result of std::hardware_concurrency() is legitimate for this case (the value is not well defined, it depends on users further calls to SetThreadGroupAffinity). Per-group value returned by dwNumberOfProcessors, which is supposedly 64 for multi-group cases, if it really still returns nonzero for such cases (system will try to pack 64 processors in one group) -- is also suitable, as std::hardware_concurrency() is a hint. Once user took control by SetThreadGroupAffinity call, they took responsibility about effective hardware concurrency.

This change primarily targets other case, when this or other process calls SetProcessAffinityMask for this process. This technique can be used to test concurrent algorithms for performance on fewer CPUs (remove some other cores, other SMTs or both), or for fixing concurrency bugs (Application Compatibility can do this, for example).

For multi-group cases - I'm not sure. There is a possibility to enumerate all processor groups by obtaining GROUP_RELATIONSHIP from GetLogicalProcessorInformationEx. But I do not know if this is the right thing to do, as bigger concurrency value obtained by such means is only available for those who will call SetThreadGroupAffinity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And there is a possible difference, when process belongs to multiple processor groups.

I missed that. Sadly to test this we need a machine with more than 64 threads...

(the value is not well defined, it depends on users further calls to SetThreadGroupAffinity).

I disagree, the value is defined for all portable uses of std::thread, since it uses the default group.

There is a possibility to enumerate all processor groups

Right, we should not do that because std::thread can't reach the other groups.

Copy link
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to confirm that the behavior is unchanged for multi-group systems. (There might be nothing @AlexGutenev can do about this yet)

@StephanTLavavej
Copy link
Member

We believe the compiler back-end team has an 80-core machine; we should look into borrowing it for testing.

@BillyONeal
Copy link
Member

We believe the compiler back-end team has an 80-core machine; we should look into borrowing it for testing.

I asked around on win32prg and got a volunteer. Just need to prepare the test for them...

@AlexGuteniev
Copy link
Contributor Author

AlexGuteniev commented Mar 11, 2020

I asked around on win32prg and got a volunteer. Just need to prepare the test for them...

Try this, in x64 primarily, and in x86 for more data:

#define _WIN32_WINNT 0x0601 
#define WINVER 0x0601


#include <iostream>
#include <vector>
#include <Windows.h>


int main() {
    auto print_ga = []() {
        DWORD length = 1000;

        std::vector<BYTE> buffer;

        for (;;) {
            buffer.resize(0);
            buffer.resize(length);
            PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info = reinterpret_cast<SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*>(buffer.data());
            if (::GetLogicalProcessorInformationEx(RelationGroup, info, &length)) {
                break;
            }

            DWORD err = ::GetLastError();

            if (err != ERROR_INSUFFICIENT_BUFFER) {
                std::cerr << "Error: " << err << " calling GetLogicalProcessorInformationEx\n";
            }
        }

        PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info = nullptr;
        for (DWORD i = 0; i < length; i += info->Size) {
            info = reinterpret_cast<SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX*>(buffer.data() + i);
            if (info->Relationship == RelationGroup) {
                auto& g = info->Group;
                std::cout << "Groups " << g.ActiveGroupCount << " out of " << g.MaximumGroupCount << ".\n";
                for (WORD i = 0; i < g.ActiveGroupCount; i++) {
                    auto& gi = g.GroupInfo[i];
                    std::cout << "In group " << i << ": " << (int)gi.ActiveProcessorCount << " / " << (int)gi.MaximumProcessorCount << " processors. "
                        << "Affinity mask: " << std::showbase << std::hex << gi.ActiveProcessorMask << std::noshowbase << std::dec << ".\n";
                }
            }
        }
        SYSTEM_INFO si = {};
        ::GetNativeSystemInfo(&si);
        DWORD_PTR process_affinity;
        DWORD_PTR system_affinity;
        ::GetProcessAffinityMask(GetCurrentProcess(), &process_affinity, &system_affinity);
        std::cout << "System info reports: " << si.dwNumberOfProcessors << " CPUs. Affitity mask: "
            << std::showbase << std::hex << process_affinity << " / " << system_affinity << std::noshowbase << std::dec << ".\n";


    };

    print_ga();

    for (WORD g = 0; g < 2; g++)
    {
        HANDLE h = ::CreateThread(nullptr, 0, [](PVOID)->DWORD { ::Sleep(INFINITE); return 0; }, nullptr, 0, nullptr);

        GROUP_AFFINITY ga{};
        ga.Group = g;
        ga.Mask = 0x0000003;
        ::SetThreadGroupAffinity(h, &ga, nullptr);
    }

    std::cout << "\n\nAfter having 2 threads in 2 groups:\n\n";

    print_ga();
    
    return 0;
}

@BillyONeal
Copy link
Member

I think the right behavior is to do this change, then fall back to the old method if the result is 0. I'll do some experimentation tomorrow; Windows folks pointed me to this knob for testing: https://docs.microsoft.com/en-us/windows-hardware/drivers/devtest/boot-parameters-to-test-drivers-for-multiple-processor-group-support

(A 3970X that the system thinks has 4 sockets?)

image

@AlexGuteniev
Copy link
Contributor Author

I am too curious to avoid trying myself.

So, my results show that even after I successfully called SetThreadGroupAffinity to put threads into different groups, GetProcessAffinityMask still returns nonzero affinity masks.

Maybe it is not covered by simulation knob. Maybe documentation for GetProcessAffinityMask is wrong or outdated.

Another source of concern is truncation of masks in 32 bit process, and there are even two cases:

  • 32-bit process on 32-bit system
  • 32-bit process on 64-bit system

@AlexGuteniev
Copy link
Contributor Author

If it is possible to propose Windows API functions, I would propose the following:

  1. Number of CPUs function, which can take into account process affinity and can avoid a kernel call (maybe both features by flags passed to it)
  2. Recommendation on spinning, like "don't", "avoid", "do" based on CPUs, and possibly other things like low power state

For pull request, maybe it should be closed, if full understanding is not reachable.

@BillyONeal
Copy link
Member

If it is possible to propose Windows API functions,

I believe the right place for something like that is the Windows feedback tool. Of course even if they add such a thing it'll be some time before we can use it here :)

@BillyONeal
Copy link
Member

OK I did some testing and I think this breaks 32 bit programs, since the status quo is that systems with more than 32 CPUs will do the right thing for them:

C:\Users\billy\Desktop>.\before.exe
The hardware concurrency is: 64

C:\Users\billy\Desktop>.\before32.exe
The hardware concurrency is: 64

As a result I think we should leave this unchanged.

@AlexGuteniev AlexGuteniev deleted the hw_conc branch March 18, 2020 03:23
@AlexGuteniev
Copy link
Contributor Author

On a good part, bcdedit.exe /set groupsize 1 will make existing std::hardware_concurrency to 1, so there is already a way to test this case

@BillyONeal
Copy link
Member

On a good part, bcdedit.exe /set groupsize 1 will make existing std::hardware_concurrency to 1, so there is already a way to test this case

I'm referring to the opposite. On my 32c/64t system, hardware_concurrency currently returns 64 but returns 32 with your change.

@AlexGuteniev
Copy link
Contributor Author

I understand that the change is breaking this case.
I mean there's a way to have std::hardware_concurrency as 1 without my change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Something can be improved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants