-
-
Notifications
You must be signed in to change notification settings - Fork 33.9k
bpo-34060: Report system load when running test suite for Windows #8357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ede0a5a to
da62440
Compare
zooba
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! And I'll be happy to have support for job objects in there too :)
Looking at the test runs, the numbers seem to be consistent with other platforms, but since I don't have as good a feel for what to expect here, I'd like someone else who's been involved in this or the previous PR to sign off as well.
Modules/_winapi.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should surround the non-Python parts of this function with _Py_BEGIN_ALLOW_THREADS and _Py_END_ALLOW_THREADS to release the GIL. Similarly for the Assign function.
Lib/test/libregrtest/main.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be self.getloadavg = self.load_tracker.getloadavg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And since there are no later references to self.load_tracker, that doesn't need to be kept on the object (the references via the method will keep it from being deallocated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks :)
Yeah I'd definitely like to get @vstinner's take on it since he is likely familiar with normal load values. |
Modules/_winapi.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HANDLE_return_converter requires INVALID_HANDLE_VALUE for an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks.
Did I say thatI? A thread is fine here. regrtest already uses threads to run tests in subprocesses when using -jN. faulthandler also uses a C thread to implement an hard timeout (dumping the Python traceback on timeout). regrtest is full of threads :-) Overlapped IO may be more complicated than a thread, no? |
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python still supports Windows 7, which only allows a process to be assigned to one job at a time. If the OS version is prior to NT 6.2 (i.e. sys.getwindowsversion() < (6, 2)), use the creation flag
CREATE_BREAKAWAY_FROM_JOB (0x01000000) to try to break the child process out of the current job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah thank you, I've been testing on Windows 10 and didn't read the documentation carefully enough, is there any particular reason to scope it down to <NT 6.2? Will there be a problem with just using CREATE_BREAKAWAY_FROM_JOB as the creationflags all the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to break the child out of the job hierarchy in Windows 8+, especially since jobs are used more frequently (e.g. to implement silos, such as for Windows containers).
Also, trying to break away from the parent job complicates creating the process. I forgot to mention that Popen will fail with a PermissionError if the child isn't allowed to break away. This can be handled by retrying without CREATE_BREAKAWAY_FROM_JOB. In this case you could skip calling AssignProcessToJobObject. For example (untested):
CREATE_BREAKAWAY_FROM_JOB = 0x01000000
cflags = 0 if sys.getwindowsversion() >= (6, 2) else CREATE_BREAKAWAY_FROM_JOB
assign_to_job = True
try:
self.p = subprocess.Popen(command, stdout=command_stdout,
creationflags=cflags)
except PermissionError:
if cflags == 0:
raise
self.p = subprocess.Popen(command, stdout=command_stdout)
assign_to_job = False
if assign_to_job:
_winapi.AssignProcessToJobObject(job_group, self.p._handle)
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this in a try/except that handles failure either by ignoring it or with a warning. AssignProcessToJobObject will fail in Windows 7 if the parent process (i.e. the current Python process) is in a job that doesn't allow child processes to break away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll issue a warning since this to handle a rather rare case of the interpreter actually crashing in -jN mode.
Modules/_winapi.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should reflect the WinAPI structure, i.e. the BasicLimitInformation field. Also, it would be nice to have PyWin32 interoperability, which does the latter and also uses dicts instead of simple namespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so I based this on CreateProcess which also uses attributes in a passed in object for this information. What do you mean by reflect the structure? Do you mean support all the fields for BasicLimitInformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, here is how subprocess passes uses it with CreateProcess
Lines 129 to 137 in 06ca3f0
| class STARTUPINFO: | |
| def __init__(self, *, dwFlags=0, hStdInput=None, hStdOutput=None, | |
| hStdError=None, wShowWindow=0, lpAttributeList=None): | |
| self.dwFlags = dwFlags | |
| self.hStdInput = hStdInput | |
| self.hStdOutput = hStdOutput | |
| self.hStdError = hStdError | |
| self.wShowWindow = wShowWindow | |
| self.lpAttributeList = lpAttributeList or {"handle_list": []} |
I just used simple namespace because I was being lazy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes a dict is used, e.g. STARTUPINFO uses a dict for the new lpAttributeList support in 3.7. win32job uses dicts, but it's fine to stick with namespaces instead. I still would rather LimitFlags be set in a BasicLimitInformation attribute instead of at the top level. _winapi is low-level and undocumented, so it's best if it mirrors the actual API, if it's not overly awkward. I didn't meant to support all fields, however. _winapi gets extended only as required.
|
This is the failure that shows up when using a thread: |
fc8c05d to
63b57b2
Compare
|
So I think I jumped the gun early with the job grouping stuff. There was a much easier solution to dealing with interpreter crashes in This is now actually just pure python and consequently a lot simpler. |
Lib/test/libregrtest/main.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to import the class there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Lib/test/libregrtest/main.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dislike lambda. Would you mind to define a method here using "def" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer lambda for this sort of short stuff but done
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the new code in a new file, like libregrtest/winloadavg.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the docstring inside the class.
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use del() but add a close() method for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, are you sure about this? It would require us to keep a reference to load_tracker and it needs an:
if self.load_tracker:
self.load_tracker.stop()because we don't have a load tracker to stop on non windows systems.
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the last line is incomplete: doesn't end with a newline character? Maybe you should put it back into a buffer, and concatenate it to the output, next time. Maybe use .splitlines(True) to check if there is a newline character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its worth adding the extra complexity to handle this. Worst case is we miss a single point or two of data and the load number is slightly off.
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment to document the unit (seconds, no?).
Lib/test/libregrtest/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that this function is only called every 5 minutes. I expect a lot of output in this case.
Why not running this function inside a thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure shows up when running in a thread: #8357 (comment)
Even at 5 minutes, that's 60 points of data. Even if the typeperf command puts out 100 bytes of output per line its not enough to saturate the buffer. And even if it does, those data points will get picked up eventually by the next call.
Lib/test/libregrtest/main.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seems like you spawn a subprocess even in test worker processes, whereas you wrote that it's useless. This code should be moved after handling slaveargs, no? Maybe move the slaveargs handling code earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, we already know if the runner is a slave based on the parsing of the arguments. The arguments have been parsed by this point, so this should be fine. I also tested and under task manager only one typeperf instance shows up under the main python process unlike before.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
6990d2c to
a51f181
Compare
|
I have made the requested changes; please review again |
|
It looks like there was a lot of interest and activity on this PR a few months ago and @zooba had approved it. @ammaraskar, could you resolve the merge conflict? I think that might be all that is needed, along with @vstinner's approval for merging. Thanks! |
It seems like my comments have been addressed.
|
I'm sorry but I don't have the bandwidth right to review this change (test it manually). @ammaraskar: You have to update your PR, there is now a conflict. @zware: If you are confident that the change is good, please go ahead and merge it (once CI tests and the conflict is solved). |
|
I haven't researched whether this is the best way to do this to any extent, but this looks fine to me. |
While Windows exposes the system processor queue length: the raw value used for load calculations on Unix systems, it does not provide an API to access the averaged value. Hence to calculate the load we must track and average it ourselves. We can't use multiprocessing or a thread to read it in the background while the tests run since using those would conflict with test_multiprocessing and test_xxsubprocess. Thus, we use Window's asynchronous IO API to run the tracker in the background with it sampling at the correct rate. When we wish to access the load we check to see if there's new data on the stream, if there is, we update our load values.
fb0a14d to
31a71df
Compare
31a71df to
a8df864
Compare
|
Thanks @ammaraskar! |
* Clean up code which checked presence of os.{stat,lstat,chmod} (GH-11643)
(cherry picked from commit 8377cd4)
* bpo-36725: regrtest: add TestResult type (GH-12960)
* Add TestResult and MultiprocessResult types to ensure that results
always have the same fields.
* runtest() now handles KeyboardInterrupt
* accumulate_result() and format_test_result() now takes a TestResult
* cleanup_test_droppings() is now called by runtest() and mark the
test as ENV_CHANGED if the test leaks support.TESTFN file.
* runtest() now includes code "around" the test in the test timing
* Add print_warning() in test.libregrtest.utils to standardize how
libregrtest logs warnings to ease parsing the test output.
* support.unload() is now called with abstest rather than test_name
* Rename 'test' variable/parameter to 'test_name'
* dash_R(): remove unused the_module parameter
* Remove unused imports
(cherry picked from commit 4d29983)
* bpo-36725: Refactor regrtest multiprocessing code (GH-12961)
Rewrite run_tests_multiprocess() function as a new MultiprocessRunner
class with multiple methods to better report errors and stop
immediately when needed.
Changes:
* Worker processes are now killed immediately if tests are
interrupted or if a test does crash (CHILD_ERROR): worker
processes are killed.
* Rewrite how errors in a worker thread are reported to
the main thread. No longer ignore BaseException or parsing errors
silently.
* Remove 'finished' variable: use worker.is_alive() instead
* Always compute omitted tests. Add Regrtest.get_executed() method.
(cherry picked from commit 3cde440)
* bpo-36719: regrtest always detect uncollectable objects (GH-12951)
regrtest now always detects uncollectable objects. Previously, the
check was only enabled by --findleaks. The check now also works with
-jN/--multiprocess N.
--findleaks becomes a deprecated alias to --fail-env-changed.
(cherry picked from commit 75120d2)
* bpo-34060: Report system load when running test suite for Windows (GH-8357)
While Windows exposes the system processor queue length, the raw value
used for load calculations on Unix systems, it does not provide an API
to access the averaged value. Hence to calculate the load we must track
and average it ourselves. We can't use multiprocessing or a thread to
read it in the background while the tests run since using those would
conflict with test_multiprocessing and test_xxsubprocess.
Thus, we use Window's asynchronous IO API to run the tracker in the
background with it sampling at the correct rate. When we wish to access
the load we check to see if there's new data on the stream, if there is,
we update our load values.
(cherry picked from commit e16467a)
* bpo-36719: Fix regrtest re-run (GH-12964)
Properly handle a test which fail but then pass.
Add test_rerun_success() unit test.
(cherry picked from commit 837acc1)
* bpo-36719: regrtest closes explicitly WindowsLoadTracker (GH-12965)
Regrtest.finalize() now closes explicitly the WindowsLoadTracker
instance.
(cherry picked from commit 00db7c7)
This is (mostly) a pure Python implementation of the other PR. It leverages the
typeperfcommand which monitors performance counters and outputs them at a given interval. So every 5 seconds,typeperfcan output the processor queue length into stdout.subprocess.stdout.readlineis a blocking call. Using a thread seemed like an obvious solution, but we can't achieve this with multiprocessing or a thread, because like Victor speculated in the previous bug report on this, it conflicts withtest_multiprocessingandtest_threading. Hence, I opted to use the asynchronous/overlapped IO API which was designed forasync. Most of the diff actually just pertains to using this rather low level API.This is almost a pure python implementation but there was one edge case where this would fail. Namely, when the python interpreter running the test suite crashes, this leaves an orphaned
typeperfprocess running which refuses to die. This means that when the test suite is run with-j xand this situation happens:The big test coordinating python process will wait forever on the crashed python and consequently
typeperfto terminate, which just doesn't happen by default in Windows. After reading up on the APIs, the right way to fix this is by using a Job Object to ask the OS to kill the child when the parent dies. Hence, there is a change in_winapito make this happen. Unlike the last PR, this API is actually reusable and fit to be exposed to the public. It could even allow implementing things like bpo-5115 to be a lot easier.https://bugs.python.org/issue34060