Synchronized claiming of jobs for processing#1121
Conversation
The immediate use case is to unit testing.
* Add unit test for test progress * Add unit test for claiming of waiting tests * Add documentation * Various tiny refactors
* test_progress() * store_results()
Removes one of its failure modes.
matsduf
left a comment
There was a problem hiding this comment.
I think this looks good. As far as I can see it fully replaces #1115. I hope that @matsstralbergiis will be able to run a test with many domain names and at least two Test Agents soon to see that there in fact are no collisions or something else that seems to break.
|
I have tonight run a batch with 11004 zones and used three testagents with this DB.pm and it seems to work. |
@matsstralbergiis, was there any test that was started by more than one Test Agent? |
|
As the sum of tests completed in the log files matches the number I submitted to test I would say no. |
There was a problem hiding this comment.
Very nice!
A few nitpicking remarks:
- I left a comment on a documentation line that feels unclear to me
- I realize you don't use the new
test_state()method, but it's nice to have it.
edit: I updated the PR description to remove the TEST_CRASHED and TEST_LAPSED constants.
| If there are no waiting tests to claim, C<undef> is returned for both ids. | ||
|
|
||
| Only tests in the "waiting" state are considered. | ||
| When a test is claimed it is removed from the queue and it transitions to the |
There was a problem hiding this comment.
"it is removed from the queue": is the test removed from the queue because its state changed? If so, maybe we could update the sentence to make it clearer.
suggestion:
"When a test is claimed it transitions to the "running" state. The test is now unavailable in its queue."
There was a problem hiding this comment.
Both "removed from the queue" and "unavailable in its queue" is not very clear. Every test belongs to a queue through the queue field in the test_results table. Queue 0 is the default queue, and the test will remain in its queue forever. Rather "The test is now unavailable for other Test Agent processes."
|
I tested the combination of this PR with the experimental Clickhouse support (see #1094) and it doesn’t work well (see my comment for additional details). It seems this is due to the lack of proper transaction isolation when Clickhouse performs ALTER TABLE … UPDATE queries. So it’s important to point out that the idea in this PR only works if Zonemaster-Engine is configured to use a proper DBMS that does have some sort of atomic UPDATE. I’m going to try with PostgreSQL now… |
|
Release testing report – No technical issues, one documentation issue Rocky Linux 9.3 + PostgreSQL 13.11 The best way of running more than one instance of zm-testagent on the same machine is not documented yet. I opened an issue about it (see zonemaster/zonemaster#1234). I duplicated the systemd unit file as described here with satisfactory results. In the testing procedure, the meaning of “large batch” was left unspecified. I picked an arbitrary value of 1 000 because I felt it was a reasonably large batch and a reasonable compromise in terms of runtime; it runs in about 50 to 60 minutes on two test agents on my test VM. The batch consists of 1 000 random subdomains under With one test agent, I witnessed nothing out of the extraordinary in terms of behavior. I waited for the completion of about 500 tests before assuming this change does not break anything. With two test agents, one agent found, started and completed 493 tests and the second one did 507 tests. This correctly adds up to 1 000 and one can reasonably assume that the load was spread evenly. |
Purpose
Make it possible for multiple Test Agents to process the same queue without multiple workers running the same job.
In #1115 @matsduf said:
Meta
In this PR description I'm using the term "job" instead of the traditional term "test". Speaking of the testing of jobs is less confusing than speaking of the testing of tests.
However in the code changes I used the term "test" in anticipation of "the great renaming".
Context
Changes
Reviewing
The easiest way to review this is probably to do it commit by commit.
New constants
$TEST_{WAITING,RUNNING,COMPLETED,CRASHED,LAPSED}are introduced to represent the different states of a job.New methods
test_state()is introduced to query the state of a job.claim_test()is introduced to transition jobs from "waiting" to "running".Updated methods
get_test_request()synchronizes its claiming of jobs.get_test_request()finds a job that it then fails to claim, it immediately tries to find and claim another one.test_progress()is no longer able to update the state of a job.test_progress()is no longer able to update the progress value of a job that is not "running".test_progress()is no longer able to decrease the progress value of a job.test_progress(0)is now interpreted as a request to set the progress value to zero, rather than a request to read the current progress value.store_results()is no longer able to update a job that is not "running".New test files
t/queue.tis introduced to test the claiming of waiting jobs. Thet/lifecycle.talready has some code to exercise reuse of jobs which is quite related. However I wanted to make sure the test database was clean for the claiming tests and this is what I ended up with. Maybet/lifecycle.tshould be reserved for testing individual jobs andt/queue.tcould test the enqueuing and dequeuing of jobs.How to test this PR