Skip to content

jobs: fix test for batch jobs creation, marked as flaky#69014

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
sajjadrizvi:remove_extra_test_batch_jobs
Aug 17, 2021
Merged

jobs: fix test for batch jobs creation, marked as flaky#69014
craig[bot] merged 1 commit intocockroachdb:masterfrom
sajjadrizvi:remove_extra_test_batch_jobs

Conversation

@sajjadrizvi
Copy link
Copy Markdown

@sajjadrizvi sajjadrizvi commented Aug 16, 2021

Commit #67991 introduced a test that turned out to be flaky.
The test runs out of memory sometimes as it creates a very
large batch of jobs. This fix disables job adoptions to avoid
large memory use.

Release note: None

Fixes: #68962

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test runs out of memory sometimes as it creates a very
large batch of jobs.

While I could be fine removing this test, I think we need to unpack this analysis a little more deeply. 5k jobs is not a lot of memory. I see the log from the failure says could not mark as succeeded: log-job: root: memory budget exceeded: 10240 bytes requested, 134215680 currently allocated, 134217728 bytes in budget but I think we need to ask what this means as it's farfetched that 5k empty jobs would use a 132MiB.


please reference the issue that this PR corresponds to.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like the problem here is that all these jobs are getting adopted and we really don't want them to. In the course of running the adoption, we then do spend a good bit of time actually adopting these jobs. What I think you should do is create a testing knob to disable job adoption and use it in this test. See

func (r *Registry) adoptionDisabled(ctx context.Context) bool {

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sajjadrizvi)


pkg/jobs/registry_test.go, line 351 at r1 (raw file):

Quoted 4 lines of code…
				tdb.CheckQueryResultsRetry(t, "SELECT count(*) FROM [SHOW JOBS]",
					[][]string{{fmt.Sprintf("%d", test.batchSize)}})
				for _, id := range jobIDs {
					tdb.CheckQueryResultsRetry(t, fmt.Sprintf("SELECT status FROM system.jobs WHERE id = '%d'", id),

In retrospect this should not have been CheckQueryResultsRetry as it should not need to retry.

@sajjadrizvi
Copy link
Copy Markdown
Author

In retrospect this should not have been CheckQueryResultsRetry as it should not need to retry.

Yes, the main culprit here is retries.

@ajwerner
Copy link
Copy Markdown
Contributor

Yes, the main culprit here is retries.

Really? What makes you say that?

@sajjadrizvi
Copy link
Copy Markdown
Author

Really? What makes you say that?

I am wrong. I saw the failure message adoption completed with error job ‹684636351025086465›: could not mark as succeeded, but it itself is because of running out of memory.

@ajwerner
Copy link
Copy Markdown
Contributor

It's hard to say whether you're wrong but at least I don't think what you said is very clear. Any of that memory allocated to the query which returned that error would be free by the time that error returns, or, at least, it should be. It does not accumulate forever. The memory monitoring framework is mostly just hooked up to the query execution layer and once that error comes back to the jobs layer, the corresponding memory account will have been closed.

Commit cockroachdb#67991 introduced a test that turned out to be flaky.
The test runs out of memory sometimes as it creates a very
large batch of jobs. This fix disables job adoptions to
avoid large memory use.

Release note: None

Fixes: cockroachdb#68962
@sajjadrizvi sajjadrizvi force-pushed the remove_extra_test_batch_jobs branch from efb7c3b to 913a1f4 Compare August 17, 2021 14:24
@sajjadrizvi sajjadrizvi changed the title jobs: remove unnecessary test for batch jobs creation jobs: fix test for batch jobs creation, marked as flaky Aug 17, 2021
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sajjadrizvi)

@sajjadrizvi sajjadrizvi marked this pull request as ready for review August 17, 2021 18:57
@sajjadrizvi sajjadrizvi requested a review from a team August 17, 2021 18:57
@sajjadrizvi
Copy link
Copy Markdown
Author

TFTR!

bors r=ajwerner

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 17, 2021

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 17, 2021

Build succeeded:

@craig craig bot merged commit 5d2c91c into cockroachdb:master Aug 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jobs: TestBatchJobsCreation failed

3 participants