bpo-34060: Report system load when running test suite for Windows #8357

ammaraskar · 2018-07-20T20:26:42Z

This is (mostly) a pure Python implementation of the other PR. It leverages the typeperf command which monitors performance counters and outputs them at a given interval. So every 5 seconds, typeperf can output the processor queue length into stdout.

subprocess.stdout.readline is a blocking call. Using a thread seemed like an obvious solution, but we can't achieve this with multiprocessing or a thread, because like Victor speculated in the previous bug report on this, it conflicts with test_multiprocessing and test_threading. Hence, I opted to use the asynchronous/overlapped IO API which was designed for async. Most of the diff actually just pertains to using this rather low level API.

This is almost a pure python implementation but there was one edge case where this would fail. Namely, when the python interpreter running the test suite crashes, this leaves an orphaned typeperf process running which refuses to die. This means that when the test suite is run with -j x and this situation happens:

python -m test -j2
├── python <test runner>
│   └── typeperf.exe
└── python *CRASHED*
    └── typeperf.exe

The big test coordinating python process will wait forever on the crashed python and consequently typeperf to terminate, which just doesn't happen by default in Windows. After reading up on the APIs, the right way to fix this is by using a Job Object to ask the OS to kill the child when the parent dies. Hence, there is a change in _winapi to make this happen. Unlike the last PR, this API is actually reusable and fit to be exposed to the public. It could even allow implementing things like bpo-5115 to be a lot easier.

https://bugs.python.org/issue34060

zooba

I like this! And I'll be happy to have support for job objects in there too :)

Looking at the test runs, the numbers seem to be consistent with other platforms, but since I don't have as good a feel for what to expect here, I'd like someone else who's been involved in this or the previous PR to sign off as well.

zooba · 2018-07-20T21:27:29Z

Modules/_winapi.c

We should surround the non-Python parts of this function with _Py_BEGIN_ALLOW_THREADS and _Py_END_ALLOW_THREADS to release the GIL. Similarly for the Assign function.

zooba · 2018-07-20T21:29:33Z

Lib/test/libregrtest/main.py

This can just be self.getloadavg = self.load_tracker.getloadavg

And since there are no later references to self.load_tracker, that doesn't need to be kept on the object (the references via the method will keep it from being deallocated)

Done, thanks :)

ammaraskar · 2018-07-20T21:44:35Z

I'd like someone else who's been involved in this or the previous PR to sign off as well.

Yeah I'd definitely like to get @vstinner's take on it since he is likely familiar with normal load values.

eryksun · 2018-07-20T23:37:05Z

Modules/_winapi.c

HANDLE_return_converter requires INVALID_HANDLE_VALUE for an error.

Good catch, thanks.

vstinner · 2018-07-21T00:14:36Z

subprocess.stdout.readline is a blocking call. Using a thread seemed like an obvious solution, but we can't achieve this with multiprocessing or a thread, because like Victor speculated in the previous bug report on this, it conflicts with test_multiprocessing and test_threading.

Did I say thatI? A thread is fine here. regrtest already uses threads to run tests in subprocesses when using -jN. faulthandler also uses a C thread to implement an hard timeout (dumping the Python traceback on timeout). regrtest is full of threads :-) Overlapped IO may be more complicated than a thread, no?

eryksun · 2018-07-21T04:29:53Z

Lib/test/libregrtest/utils.py

Python still supports Windows 7, which only allows a process to be assigned to one job at a time. If the OS version is prior to NT 6.2 (i.e. sys.getwindowsversion() < (6, 2)), use the creation flag
CREATE_BREAKAWAY_FROM_JOB (0x01000000) to try to break the child process out of the current job.

Aah thank you, I've been testing on Windows 10 and didn't read the documentation carefully enough, is there any particular reason to scope it down to <NT 6.2? Will there be a problem with just using CREATE_BREAKAWAY_FROM_JOB as the creationflags all the time?

There's no reason to break the child out of the job hierarchy in Windows 8+, especially since jobs are used more frequently (e.g. to implement silos, such as for Windows containers).

Also, trying to break away from the parent job complicates creating the process. I forgot to mention that Popen will fail with a PermissionError if the child isn't allowed to break away. This can be handled by retrying without CREATE_BREAKAWAY_FROM_JOB. In this case you could skip calling AssignProcessToJobObject. For example (untested):

CREATE_BREAKAWAY_FROM_JOB = 0x01000000 cflags = 0 if sys.getwindowsversion() >= (6, 2) else CREATE_BREAKAWAY_FROM_JOB assign_to_job = True try: self.p = subprocess.Popen(command, stdout=command_stdout, creationflags=cflags) except PermissionError: if cflags == 0: raise self.p = subprocess.Popen(command, stdout=command_stdout) assign_to_job = False if assign_to_job: _winapi.AssignProcessToJobObject(job_group, self.p._handle)

eryksun · 2018-07-21T04:33:24Z

Lib/test/libregrtest/utils.py

Call this in a try/except that handles failure either by ignoring it or with a warning. AssignProcessToJobObject will fail in Windows 7 if the parent process (i.e. the current Python process) is in a job that doesn't allow child processes to break away.

Thanks, I'll issue a warning since this to handle a rather rare case of the interpreter actually crashing in -jN mode.

eryksun · 2018-07-21T04:38:23Z

Modules/_winapi.c

Maybe this should reflect the WinAPI structure, i.e. the BasicLimitInformation field. Also, it would be nice to have PyWin32 interoperability, which does the latter and also uses dicts instead of simple namespaces.

Hmm, so I based this on CreateProcess which also uses attributes in a passed in object for this information. What do you mean by reflect the structure? Do you mean support all the fields for BasicLimitInformation?

For reference, here is how subprocess passes uses it with CreateProcess

cpython/Lib/subprocess.py

Lines 129 to 137 in 06ca3f0

class STARTUPINFO:

def __init__(self, *, dwFlags=0, hStdInput=None, hStdOutput=None,

hStdError=None, wShowWindow=0, lpAttributeList=None):

self.dwFlags = dwFlags

self.hStdInput = hStdInput

self.hStdOutput = hStdOutput

self.hStdError = hStdError

self.wShowWindow = wShowWindow

self.lpAttributeList = lpAttributeList or {"handle_list": []}

I just used simple namespace because I was being lazy

Sometimes a dict is used, e.g. STARTUPINFO uses a dict for the new lpAttributeList support in 3.7. win32job uses dicts, but it's fine to stick with namespaces instead. I still would rather LimitFlags be set in a BasicLimitInformation attribute instead of at the top level. _winapi is low-level and undocumented, so it's best if it mirrors the actual API, if it's not overly awkward. I didn't meant to support all fields, however. _winapi gets extended only as required.

ammaraskar · 2018-07-21T05:19:35Z

This is the failure that shows up when using a thread:

0:00:21 load avg: 0.07 [ 16/417] test__xxsubinterpreters
test test__xxsubinterpreters failed -- Traceback (most recent call last):
  File "C:\Users\ammar\workspace\cpython\lib\test\test__xxsubinterpreters.py", line 473, in test_main
    self.assertTrue(interpreters.is_running(main))
RuntimeError: interpreter has more than one thread

ammaraskar · 2018-07-21T12:37:32Z

So I think I jumped the gun early with the job grouping stuff. There was a much easier solution to dealing with interpreter crashes in -jN mode. Only run the load tracking subprocess in the main interpreter coordinating the children. It's the only one that needs the information since it prints out the progress reports.

This is now actually just pure python and consequently a lot simpler.

vstinner · 2018-07-23T14:00:10Z

Lib/test/libregrtest/main.py

I would prefer to import the class there.

vstinner · 2018-07-23T14:00:24Z

Lib/test/libregrtest/main.py

I dislike lambda. Would you mind to define a method here using "def" ?

I prefer lambda for this sort of short stuff but done

vstinner · 2018-07-23T14:01:00Z

Lib/test/libregrtest/utils.py

Please put the new code in a new file, like libregrtest/winloadavg.py.

vstinner · 2018-07-23T14:01:10Z

Lib/test/libregrtest/utils.py

Please put the docstring inside the class.

vstinner · 2018-07-23T14:01:30Z

Lib/test/libregrtest/utils.py

Please don't use del() but add a close() method for example.

Hmm, are you sure about this? It would require us to keep a reference to load_tracker and it needs an:

if self.load_tracker: self.load_tracker.stop()

because we don't have a load tracker to stop on non windows systems.

vstinner · 2018-07-23T14:02:48Z

Lib/test/libregrtest/utils.py

What if the last line is incomplete: doesn't end with a newline character? Maybe you should put it back into a buffer, and concatenate it to the output, next time. Maybe use .splitlines(True) to check if there is a newline character?

I don't think its worth adding the extra complexity to handle this. Worst case is we miss a single point or two of data and the load number is slightly off.

vstinner · 2018-07-23T14:03:40Z

Lib/test/libregrtest/utils.py

Please add a comment to document the unit (seconds, no?).

vstinner · 2018-07-23T14:05:54Z

Lib/test/libregrtest/utils.py

It's possible that this function is only called every 5 minutes. I expect a lot of output in this case.

Why not running this function inside a thread?

This failure shows up when running in a thread: #8357 (comment)

Even at 5 minutes, that's 60 points of data. Even if the typeperf command puts out 100 bytes of output per line its not enough to saturate the buffer. And even if it does, those data points will get picked up eventually by the next call.

vstinner · 2018-07-23T14:07:25Z

Lib/test/libregrtest/main.py

I seems like you spawn a subprocess even in test worker processes, whereas you wrote that it's useless. This code should be moved after handling slaveargs, no? Maybe move the slaveargs handling code earlier?

Not sure I follow, we already know if the runner is a slave based on the parsing of the arguments. The arguments have been parsed by this point, so this should be fine. I also tested and under task manager only one typeperf instance shows up under the main python process unlike before.

bedevere-bot · 2018-07-23T14:07:33Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be poked with soft cushions!

ammaraskar · 2018-08-10T07:46:00Z

I have made the requested changes; please review again

bedevere-bot · 2018-08-10T07:46:03Z

Thanks for making the requested changes!

@zooba, @vstinner: please review the changes made to this pull request.

csabella · 2019-01-20T20:53:13Z

It looks like there was a lot of interest and activity on this PR a few months ago and @zooba had approved it. @ammaraskar, could you resolve the merge conflict? I think that might be all that is needed, along with @vstinner's approval for merging.

Thanks!

It seems like my comments have been addressed.

vstinner · 2019-01-23T17:12:27Z

I'm sorry but I don't have the bandwidth right to review this change (test it manually).

@ammaraskar: You have to update your PR, there is now a conflict.

@zware: If you are confident that the change is good, please go ahead and merge it (once CI tests and the conflict is solved).

zware · 2019-02-01T17:50:30Z

I haven't researched whether this is the best way to do this to any extent, but this looks fine to me.

While Windows exposes the system processor queue length: the raw value used for load calculations on Unix systems, it does not provide an API to access the averaged value. Hence to calculate the load we must track and average it ourselves. We can't use multiprocessing or a thread to read it in the background while the tests run since using those would conflict with test_multiprocessing and test_xxsubprocess. Thus, we use Window's asynchronous IO API to run the tracker in the background with it sampling at the correct rate. When we wish to access the load we check to see if there's new data on the stream, if there is, we update our load values.

ammaraskar · 2019-02-10T02:32:49Z

Thanks for the reminder Cheryl, appreciate it! :)

@zware Just fixed the merge conflict, please take a look.
@zooba If you're available for a re-review, that would be appreciated as well.

csabella · 2019-04-09T12:29:57Z

Based on @zooba's approval and the other consensus on this, I've merged the PR. Thanks @ammaraskar for the PR and to @zware, @eryksun, @vstinner, and @zooba for the reviews! 🙂

vstinner · 2019-04-11T11:00:03Z

Thanks @ammaraskar!

* Clean up code which checked presence of os.{stat,lstat,chmod} (GH-11643) (cherry picked from commit 8377cd4) * bpo-36725: regrtest: add TestResult type (GH-12960) * Add TestResult and MultiprocessResult types to ensure that results always have the same fields. * runtest() now handles KeyboardInterrupt * accumulate_result() and format_test_result() now takes a TestResult * cleanup_test_droppings() is now called by runtest() and mark the test as ENV_CHANGED if the test leaks support.TESTFN file. * runtest() now includes code "around" the test in the test timing * Add print_warning() in test.libregrtest.utils to standardize how libregrtest logs warnings to ease parsing the test output. * support.unload() is now called with abstest rather than test_name * Rename 'test' variable/parameter to 'test_name' * dash_R(): remove unused the_module parameter * Remove unused imports (cherry picked from commit 4d29983) * bpo-36725: Refactor regrtest multiprocessing code (GH-12961) Rewrite run_tests_multiprocess() function as a new MultiprocessRunner class with multiple methods to better report errors and stop immediately when needed. Changes: * Worker processes are now killed immediately if tests are interrupted or if a test does crash (CHILD_ERROR): worker processes are killed. * Rewrite how errors in a worker thread are reported to the main thread. No longer ignore BaseException or parsing errors silently. * Remove 'finished' variable: use worker.is_alive() instead * Always compute omitted tests. Add Regrtest.get_executed() method. (cherry picked from commit 3cde440) * bpo-36719: regrtest always detect uncollectable objects (GH-12951) regrtest now always detects uncollectable objects. Previously, the check was only enabled by --findleaks. The check now also works with -jN/--multiprocess N. --findleaks becomes a deprecated alias to --fail-env-changed. (cherry picked from commit 75120d2) * bpo-34060: Report system load when running test suite for Windows (GH-8357) While Windows exposes the system processor queue length, the raw value used for load calculations on Unix systems, it does not provide an API to access the averaged value. Hence to calculate the load we must track and average it ourselves. We can't use multiprocessing or a thread to read it in the background while the tests run since using those would conflict with test_multiprocessing and test_xxsubprocess. Thus, we use Window's asynchronous IO API to run the tracker in the background with it sampling at the correct rate. When we wish to access the load we check to see if there's new data on the stream, if there is, we update our load values. (cherry picked from commit e16467a) * bpo-36719: Fix regrtest re-run (GH-12964) Properly handle a test which fail but then pass. Add test_rerun_success() unit test. (cherry picked from commit 837acc1) * bpo-36719: regrtest closes explicitly WindowsLoadTracker (GH-12965) Regrtest.finalize() now closes explicitly the WindowsLoadTracker instance. (cherry picked from commit 00db7c7)

the-knights-who-say-ni added the CLA signed label Jul 20, 2018

bedevere-bot added the awaiting review label Jul 20, 2018

ammaraskar force-pushed the windows_load2 branch from ede0a5a to da62440 Compare July 20, 2018 20:40

zooba approved these changes Jul 20, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Jul 20, 2018

zooba reviewed Jul 20, 2018

View reviewed changes

eryksun reviewed Jul 20, 2018

View reviewed changes

eryksun reviewed Jul 21, 2018

View reviewed changes

ammaraskar force-pushed the windows_load2 branch from fc8c05d to 63b57b2 Compare July 21, 2018 12:35

vstinner previously requested changes Jul 23, 2018

View reviewed changes

bedevere-bot removed the awaiting merge label Jul 23, 2018

bedevere-bot added the awaiting changes label Jul 23, 2018

ammaraskar force-pushed the windows_load2 branch 2 times, most recently from 6990d2c to a51f181 Compare July 25, 2018 02:38

bedevere-bot added awaiting change review and removed awaiting changes labels Aug 10, 2018

ammaraskar added 3 commits February 9, 2019 17:06

Move windows specific code to its own file

00a0895

Move imports to top of file

5c0e275

ammaraskar force-pushed the windows_load2 branch 3 times, most recently from fb0a14d to 31a71df Compare February 9, 2019 22:44

Add comment explaining check in libregrtest

a8df864

ammaraskar force-pushed the windows_load2 branch from 31a71df to a8df864 Compare February 10, 2019 02:13

csabella merged commit e16467a into python:master Apr 9, 2019

bedevere-bot removed the awaiting change review label Apr 9, 2019

ammaraskar mentioned this pull request Apr 10, 2019

bpo-34060: Report system load when running test suite for Windows #8287

Closed

	class STARTUPINFO:
	def __init__(self, *, dwFlags=0, hStdInput=None, hStdOutput=None,
	hStdError=None, wShowWindow=0, lpAttributeList=None):
	self.dwFlags = dwFlags
	self.hStdInput = hStdInput
	self.hStdOutput = hStdOutput
	self.hStdError = hStdError
	self.wShowWindow = wShowWindow
	self.lpAttributeList = lpAttributeList or {"handle_list": []}

Uh oh!

bpo-34060: Report system load when running test suite for Windows #8357

bpo-34060: Report system load when running test suite for Windows #8357

Uh oh!

Conversation

ammaraskar commented Jul 20, 2018 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zooba left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ammaraskar commented Jul 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Jul 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ammaraskar commented Jul 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ammaraskar commented Jul 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ammaraskar Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ammaraskar commented Jul 20, 2018 •

edited by bedevere-bot

Loading

ammaraskar commented Jul 21, 2018 •

edited

Loading

ammaraskar commented Jul 21, 2018 •

edited

Loading

ammaraskar Jul 25, 2018 •

edited

Loading