Synchronized claiming of jobs for processing by mattias-p · Pull Request #1121 · zonemaster/zonemaster-backend

mattias-p · 2023-07-19T14:58:45Z

Purpose

Make it possible for multiple Test Agents to process the same queue without multiple workers running the same job.

When a large batch is to be tested, adding several Test Agents increases performance if the server has large capacity. Several servers could also be used. One limitation is that current code only allows one Test Agent per queue. All [jobs] in a batch always belong to the same queue.

Context

Replaces Run several Test Agents against the same queue #1115.

Changes

Reviewing

The easiest way to review this is probably to do it commit by commit.

New constants

$TEST_{WAITING,RUNNING,COMPLETED~~,CRASHED,LAPSED~~} are introduced to represent the different states of a job.

New methods

test_state() is introduced to query the state of a job.
claim_test() is introduced to transition jobs from "waiting" to "running".

Updated methods

get_test_request() synchronizes its claiming of jobs.
when get_test_request() finds a job that it then fails to claim, it immediately tries to find and claim another one.
test_progress() is no longer able to update the state of a job.
test_progress() is no longer able to update the progress value of a job that is not "running".
test_progress() is no longer able to decrease the progress value of a job.
test_progress(0) is now interpreted as a request to set the progress value to zero, rather than a request to read the current progress value.
store_results() is no longer able to update a job that is not "running".

New test files

t/queue.t is introduced to test the claiming of waiting jobs. The t/lifecycle.t already has some code to exercise reuse of jobs which is quite related. However I wanted to make sure the test database was clean for the claiming tests and this is what I ended up with. Maybe t/lifecycle.t should be reserved for testing individual jobs and t/queue.t could test the enqueuing and dequeuing of jobs.

How to test this PR

Run a large batch with one Test Agent and nothing should be changed.
Run a large batch with two Test Agents (the same queue) and there should be no cases when the same job is run by more than one worker.

The immediate use case is to unit testing.

* Add unit test for test progress * Add unit test for claiming of waiting tests * Add documentation * Various tiny refactors

* test_progress() * store_results()

Removes one of its failure modes.

matsduf

I think this looks good. As far as I can see it fully replaces #1115. I hope that @matsstralbergiis will be able to run a test with many domain names and at least two Test Agents soon to see that there in fact are no collisions or something else that seems to break.

matsstralbergiis · 2023-07-23T18:16:15Z

I have tonight run a batch with 11004 zones and used three testagents with this DB.pm and it seems to work.
When examining the log files from all three testagents I get the same number of tests finished.
grep "Test completed" /var/log/zonemaster/zm-testagent*.log |wc 11004 77028 1243441
The three testagents seems to take an equal share of the load (3710, 3648, 3646).

matsduf · 2023-07-23T19:42:20Z

I have tonight run a batch with 11004 zones and used three testagents with this DB.pm and it seems to work.

@matsstralbergiis, was there any test that was started by more than one Test Agent?

matsstralbergiis · 2023-07-23T19:55:21Z

As the sum of tests completed in the log files matches the number I submitted to test I would say no.
When running the same zones without the updated DB.pm I get 20-30% more "tests completed" in the log files.

ghost

Very nice!
A few nitpicking remarks:

I left a comment on a documentation line that feels unclear to me
I realize you don't use the new test_state() method, but it's nice to have it.

edit: I updated the PR description to remove the TEST_CRASHED and TEST_LAPSED constants.

ghost · 2023-09-25T13:52:42Z

+If there are no waiting tests to claim, C<undef> is returned for both ids.
+
+Only tests in the "waiting" state are considered.
+When a test is claimed it is removed from the queue and it transitions to the


"it is removed from the queue": is the test removed from the queue because its state changed? If so, maybe we could update the sentence to make it clearer.

suggestion:
"When a test is claimed it transitions to the "running" state. The test is now unavailable in its queue."

Both "removed from the queue" and "unavailable in its queue" is not very clear. Every test belongs to a queue through the queue field in the test_results table. Queue 0 is the default queue, and the test will remain in its queue forever. Rather "The test is now unavailable for other Test Agent processes."

marc-vanderwal · 2023-12-13T11:01:20Z

I tested the combination of this PR with the experimental Clickhouse support (see #1094) and it doesn’t work well (see my comment for additional details). It seems this is due to the lack of proper transaction isolation when Clickhouse performs ALTER TABLE … UPDATE queries.

So it’s important to point out that the idea in this PR only works if Zonemaster-Engine is configured to use a proper DBMS that does have some sort of atomic UPDATE. I’m going to try with PostgreSQL now…

marc-vanderwal · 2023-12-13T14:40:22Z

Release testing report – No technical issues, one documentation issue

Rocky Linux 9.3 + PostgreSQL 13.11

The best way of running more than one instance of zm-testagent on the same machine is not documented yet. I opened an issue about it (see zonemaster/zonemaster#1234). I duplicated the systemd unit file as described here with satisfactory results.

In the testing procedure, the meaning of “large batch” was left unspecified. I picked an arbitrary value of 1 000 because I felt it was a reasonably large batch and a reasonable compromise in terms of runtime; it runs in about 50 to 60 minutes on two test agents on my test VM. The batch consists of 1 000 random subdomains under .fr from an internal data source.

With one test agent, I witnessed nothing out of the extraordinary in terms of behavior. I waited for the completion of about 500 tests before assuming this change does not break anything.

With two test agents, one agent found, started and completed 493 tests and the second one did 507 tests. This correctly adds up to 1 000 and one can reasonably assume that the load was spread evenly.

mattias-p added 8 commits July 19, 2023 14:40

Diagnostics: Better feedback when test crashes halfway through

b1a6d56

New: test_state()

64838da

The immediate use case is to unit testing.

Maintenance: Touch up existing functionality

d62e4de

* Add unit test for test progress * Add unit test for claiming of waiting tests * Add documentation * Various tiny refactors

Refactor: Dedicated method for transition to "running"

534da0c

Fix: Make some functions stricter

512c829

* test_progress() * store_results()

Fix: Make test_progress(0) mean set progress to 0

40f2cdb

Fix: Make get_test_request() synchronize claims

0b27c4c

Fix: Make get_test_request() try again

472b030

Removes one of its failure modes.

mattias-p requested review from a user, hannaeko, marc-vanderwal, matsduf and tgreenx July 19, 2023 14:59

mattias-p added the V-Minor Versioning: The change gives an update of minor in version. label Jul 19, 2023

mattias-p added this to the v2023.2 milestone Jul 19, 2023

mattias-p mentioned this pull request Jul 19, 2023

New result entries table #1092

Merged

mattias-p added 2 commits July 20, 2023 10:31

Documentation for constants

03fa8e9

Remove ugly hack

e6c9378

matsduf reviewed Jul 20, 2023

View reviewed changes

Comment thread lib/Zonemaster/Backend/DB.pm Outdated

Comment thread lib/Zonemaster/Backend/DB.pm Outdated

Clean up

c5cc56a

matsduf reviewed Jul 21, 2023

View reviewed changes

Comment thread lib/Zonemaster/Backend/DB.pm Outdated

Comment thread lib/Zonemaster/Backend/DB.pm

matsduf reviewed Jul 21, 2023

View reviewed changes

matsduf approved these changes Jul 23, 2023

View reviewed changes

mattias-p mentioned this pull request Jul 24, 2023

Improved control over queue usage #1124

Open

ghost approved these changes Sep 25, 2023

View reviewed changes

matsduf mentioned this pull request Nov 2, 2023

Run several Test Agents against the same queue #1115

Closed

mattias-p merged commit 4d78a06 into zonemaster:develop Nov 2, 2023

matsduf mentioned this pull request Nov 23, 2023

Adds missing files in MANIFEST #1136

Merged

matsduf mentioned this pull request Dec 11, 2023

PRs for v2023.2 zonemaster/zonemaster#1232

Closed

This was referenced Dec 13, 2023

Running multiple test agents in Zonemaster::Backend needs documentation zonemaster/zonemaster#1234

Open

Clickhouse: new database engine (experimental) #1094

Closed

marc-vanderwal added the S-ReleaseTested Status: The PR has been successfully tested in release testing label Dec 13, 2023

mattias-p mentioned this pull request Feb 13, 2024

Make testagent allocate test atomically #857

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Synchronized claiming of jobs for processing#1121

Synchronized claiming of jobs for processing#1121
mattias-p merged 11 commits into
zonemaster:developfrom
mattias-p:sync-claim

mattias-p commented Jul 19, 2023 •

edited by ghost

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matsduf left a comment

Uh oh!

matsstralbergiis commented Jul 23, 2023 •

edited

Loading

Uh oh!

matsduf commented Jul 23, 2023 •

edited

Loading

Uh oh!

matsstralbergiis commented Jul 23, 2023

Uh oh!

ghost left a comment •

edited by ghost

Loading

Uh oh!

ghost Sep 25, 2023

Uh oh!

matsduf Sep 25, 2023

Uh oh!

marc-vanderwal commented Dec 13, 2023

Uh oh!

marc-vanderwal commented Dec 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mattias-p commented Jul 19, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Meta

Context

Changes

Reviewing

New constants

New methods

Updated methods

New test files

How to test this PR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matsduf left a comment

Choose a reason for hiding this comment

Uh oh!

matsstralbergiis commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matsduf commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matsstralbergiis commented Jul 23, 2023

Uh oh!

ghost left a comment • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghost Sep 25, 2023

Choose a reason for hiding this comment

Uh oh!

matsduf Sep 25, 2023

Choose a reason for hiding this comment

Uh oh!

marc-vanderwal commented Dec 13, 2023

Uh oh!

marc-vanderwal commented Dec 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mattias-p commented Jul 19, 2023 •

edited by ghost

Loading

matsstralbergiis commented Jul 23, 2023 •

edited

Loading

matsduf commented Jul 23, 2023 •

edited

Loading

ghost left a comment •

edited by ghost

Loading