Skip to content

Synchronized claiming of jobs for processing#1121

Merged
mattias-p merged 11 commits into
zonemaster:developfrom
mattias-p:sync-claim
Nov 2, 2023
Merged

Synchronized claiming of jobs for processing#1121
mattias-p merged 11 commits into
zonemaster:developfrom
mattias-p:sync-claim

Conversation

@mattias-p

@mattias-p mattias-p commented Jul 19, 2023

Copy link
Copy Markdown
Member

Purpose

Make it possible for multiple Test Agents to process the same queue without multiple workers running the same job.

In #1115 @matsduf said:

When a large batch is to be tested, adding several Test Agents increases performance if the server has large capacity. Several servers could also be used. One limitation is that current code only allows one Test Agent per queue. All [jobs] in a batch always belong to the same queue.

Meta

In this PR description I'm using the term "job" instead of the traditional term "test". Speaking of the testing of jobs is less confusing than speaking of the testing of tests.

However in the code changes I used the term "test" in anticipation of "the great renaming".

Context

Changes

Reviewing

The easiest way to review this is probably to do it commit by commit.

New constants

  • $TEST_{WAITING,RUNNING,COMPLETED,CRASHED,LAPSED} are introduced to represent the different states of a job.

New methods

  • test_state() is introduced to query the state of a job.
  • claim_test() is introduced to transition jobs from "waiting" to "running".

Updated methods

  • get_test_request() synchronizes its claiming of jobs.
  • when get_test_request() finds a job that it then fails to claim, it immediately tries to find and claim another one.
  • test_progress() is no longer able to update the state of a job.
  • test_progress() is no longer able to update the progress value of a job that is not "running".
  • test_progress() is no longer able to decrease the progress value of a job.
  • test_progress(0) is now interpreted as a request to set the progress value to zero, rather than a request to read the current progress value.
  • store_results() is no longer able to update a job that is not "running".

New test files

  • t/queue.t is introduced to test the claiming of waiting jobs. The t/lifecycle.t already has some code to exercise reuse of jobs which is quite related. However I wanted to make sure the test database was clean for the claiming tests and this is what I ended up with. Maybe t/lifecycle.t should be reserved for testing individual jobs and t/queue.t could test the enqueuing and dequeuing of jobs.

How to test this PR

  • Run a large batch with one Test Agent and nothing should be changed.
  • Run a large batch with two Test Agents (the same queue) and there should be no cases when the same job is run by more than one worker.

@mattias-p mattias-p requested review from a user, hannaeko, marc-vanderwal, matsduf and tgreenx July 19, 2023 14:59
@mattias-p mattias-p added the V-Minor Versioning: The change gives an update of minor in version. label Jul 19, 2023
@mattias-p mattias-p added this to the v2023.2 milestone Jul 19, 2023
Comment thread lib/Zonemaster/Backend/DB.pm Outdated
Comment thread lib/Zonemaster/Backend/DB.pm Outdated
Comment thread lib/Zonemaster/Backend/DB.pm Outdated
Comment thread lib/Zonemaster/Backend/DB.pm

@matsduf matsduf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. As far as I can see it fully replaces #1115. I hope that @matsstralbergiis will be able to run a test with many domain names and at least two Test Agents soon to see that there in fact are no collisions or something else that seems to break.

@matsstralbergiis

matsstralbergiis commented Jul 23, 2023

Copy link
Copy Markdown
Collaborator

I have tonight run a batch with 11004 zones and used three testagents with this DB.pm and it seems to work.
When examining the log files from all three testagents I get the same number of tests finished.
grep "Test completed" /var/log/zonemaster/zm-testagent*.log |wc 11004 77028 1243441
The three testagents seems to take an equal share of the load (3710, 3648, 3646).

@matsduf

matsduf commented Jul 23, 2023

Copy link
Copy Markdown
Contributor

I have tonight run a batch with 11004 zones and used three testagents with this DB.pm and it seems to work.

@matsstralbergiis, was there any test that was started by more than one Test Agent?

@matsstralbergiis

Copy link
Copy Markdown
Collaborator

As the sum of tests completed in the log files matches the number I submitted to test I would say no.
When running the same zones without the updated DB.pm I get 20-30% more "tests completed" in the log files.

@ghost ghost left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!
A few nitpicking remarks:

  • I left a comment on a documentation line that feels unclear to me
  • I realize you don't use the new test_state() method, but it's nice to have it.

edit: I updated the PR description to remove the TEST_CRASHED and TEST_LAPSED constants.

If there are no waiting tests to claim, C<undef> is returned for both ids.

Only tests in the "waiting" state are considered.
When a test is claimed it is removed from the queue and it transitions to the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it is removed from the queue": is the test removed from the queue because its state changed? If so, maybe we could update the sentence to make it clearer.

suggestion:
"When a test is claimed it transitions to the "running" state. The test is now unavailable in its queue."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both "removed from the queue" and "unavailable in its queue" is not very clear. Every test belongs to a queue through the queue field in the test_results table. Queue 0 is the default queue, and the test will remain in its queue forever. Rather "The test is now unavailable for other Test Agent processes."

@mattias-p mattias-p merged commit 4d78a06 into zonemaster:develop Nov 2, 2023
@marc-vanderwal

Copy link
Copy Markdown
Contributor

I tested the combination of this PR with the experimental Clickhouse support (see #1094) and it doesn’t work well (see my comment for additional details). It seems this is due to the lack of proper transaction isolation when Clickhouse performs ALTER TABLE … UPDATE queries.

So it’s important to point out that the idea in this PR only works if Zonemaster-Engine is configured to use a proper DBMS that does have some sort of atomic UPDATE. I’m going to try with PostgreSQL now…

@marc-vanderwal

Copy link
Copy Markdown
Contributor

Release testing report – No technical issues, one documentation issue

Rocky Linux 9.3 + PostgreSQL 13.11

The best way of running more than one instance of zm-testagent on the same machine is not documented yet. I opened an issue about it (see zonemaster/zonemaster#1234). I duplicated the systemd unit file as described here with satisfactory results.

In the testing procedure, the meaning of “large batch” was left unspecified. I picked an arbitrary value of 1 000 because I felt it was a reasonably large batch and a reasonable compromise in terms of runtime; it runs in about 50 to 60 minutes on two test agents on my test VM. The batch consists of 1 000 random subdomains under .fr from an internal data source.

With one test agent, I witnessed nothing out of the extraordinary in terms of behavior. I waited for the completion of about 500 tests before assuming this change does not break anything.

With two test agents, one agent found, started and completed 493 tests and the second one did 507 tests. This correctly adds up to 1 000 and one can reasonably assume that the load was spread evenly.

@marc-vanderwal marc-vanderwal added the S-ReleaseTested Status: The PR has been successfully tested in release testing label Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-ReleaseTested Status: The PR has been successfully tested in release testing V-Minor Versioning: The change gives an update of minor in version.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants