Skip to content

<chrono>: steady_clock::now() is avoidably slow #2085

@randomascii

Description

@randomascii

Describe the bug
The conversion between QPC frequency and GHz/nano-second frequency in steady_clock::now, together with the expensive method of caching the QPC frequency means that steady_clock::now() is about twice as expensive as raw QPC code. Special-casing the very common (all of my machines have this) 10 MHz QPC frequency allows the conversion cost to almost disappear. Marking _Freq as "static" and relying on magic statics gives a further speedup.

This was also filed as https://developercommunity.visualstudio.com/t/steady-clocknow-is-avoidably-slow/1490700

Command-line test case

C:\Temp>type repro.cpp
#include <windows.h>
#include <iostream>

#include <chrono>

struct steady_clock_fast { // wraps QueryPerformanceCounter
  using rep                       = long long;
  using period                    = std::nano;
  using duration                  = std::chrono::nanoseconds;
  using time_point                = _CHRONO time_point<steady_clock_fast>;
  static constexpr bool is_steady = true;

  _NODISCARD static time_point now() noexcept { // get current time
    static const long long _Freq = _Query_perf_frequency(); // doesn't change after system boot
    const long long _Ctr  = _Query_perf_counter();
    static_assert(period::num == 1, "This assumes period::num == 1.");
    // Instead of just having "(_Ctr * period::den) / _Freq",
    // the algorithm below prevents overflow when _Ctr is sufficiently large.
    // It assumes that _Freq * period::den does not overflow, which is currently true for nano period.
    // It is not realistic for _Ctr to accumulate to large values from zero with this assumption,
    // but the initial value of _Ctr could be large.
    // 10 MHz is a very common QPC frequency on modern PCs. Optimizing for
    // this specific frequency can double the performance of this function by
    // avoiding the expensive frequency conversion path.
    if (_Freq == 10000000) {
      return time_point(duration(_Ctr * 100));
    }
    else {
      const long long _Whole = (_Ctr / _Freq) * period::den;
      const long long _Part = (_Ctr % _Freq) * period::den / _Freq;
      return time_point(duration(_Whole + _Part));
    }
  }
};

constexpr int kIterations = 10000000;

class Timer {
public:
  Timer(const char* label) : label_(label) {
    QueryPerformanceCounter(&start_);
  }
  ~Timer() {
    LARGE_INTEGER end;
    QueryPerformanceCounter(&end);
    LARGE_INTEGER freq;
    QueryPerformanceFrequency(&freq);
    printf("Took %5.1f ms for %s.\n", 1000.0 * (end.QuadPart - start_.QuadPart) / double(freq.QuadPart), label_);
  }
private:
  const char* label_;
  LARGE_INTEGER start_;
};

LARGE_INTEGER t1[2];

void __declspec(noinline) QPCSpeed() {
  Timer timer("QPCSpeed");
  for (int i = 0; i < kIterations; ++i) {
    LARGE_INTEGER count;
    QueryPerformanceCounter(&count);
    t1[i & 1] = count;
  }
}

std::chrono::steady_clock::time_point t2[2];

void __declspec(noinline) ChronoSpeed() {
  Timer timer("ChronoSpeed");
  for (int i = 0; i < kIterations; ++i) {
    t2[i & 1] = std::chrono::steady_clock::now();
  }
}

steady_clock_fast::time_point t3[2];

void __declspec(noinline) ChronoSpeedFast() {
  Timer timer("ChronoSpeedFast");
  for (int i = 0; i < kIterations; ++i) {
    t3[i & 1] = steady_clock_fast::now();
  }
}

int main()
{
  LARGE_INTEGER freq;
  QueryPerformanceFrequency(&freq);
  printf("QPC frequency is %1.1f MHz.\n", freq.QuadPart / 1e6);
  if (freq.QuadPart == 10000000)
    printf("Frequency is 10 MHz - optimized code path will be engaged.\n");
  else
    printf("Frequency is not 10 MHz - optimized code path will be skipped and steady_clock::now() will be slow.\n");

  for (int i = 0; i < 3; ++i) {
    QPCSpeed();
    ChronoSpeedFast();
    ChronoSpeed();
    printf("\n");
  }
}

C:\Temp>cl /O2 ClockTests.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.30.30401 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

ClockTests.cpp
Microsoft (R) Incremental Linker Version 14.30.30401.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:ClockTests.exe
ClockTests.obj

C:\Temp>.\ClockTests.exe
QPC frequency is 10.0 MHz.
Frequency is 10 MHz - optimized code path will be engaged.
Took 383.8 ms for QPCSpeed.
Took 419.9 ms for ChronoSpeedFast.
Took 966.2 ms for ChronoSpeed.

Took 382.5 ms for QPCSpeed.
Took 423.2 ms for ChronoSpeedFast.
Took 964.9 ms for ChronoSpeed.

Took 383.7 ms for QPCSpeed.
Took 419.2 ms for ChronoSpeedFast.
Took 966.2 ms for ChronoSpeed.

**STL version**
* Option 1: Visual Studio version
  * Displayed in Help > About Microsoft Visual Studio
  * Example:
    ```
    Microsoft Visual Studio Professional 2022 Preview (64-bit)
    Version 17.0.0 Preview 2.1
    ```

**Additional context**
steady_clock::now() is twice as expensive as raw QPC calls which makes it harder to justify writing portable code. This slowness is blocking/complicating this issue: https://github.com/ninja-build/ninja/issues/2004#issuecomment-887888546

Metadata

Metadata

Assignees

No one assigned

    Labels

    fixedSomething works now, yay!performanceMust go faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions