jobs: fix test for batch jobs creation, marked as flaky#69014
jobs: fix test for batch jobs creation, marked as flaky#69014craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
ajwerner
left a comment
There was a problem hiding this comment.
The test runs out of memory sometimes as it creates a very
large batch of jobs.
While I could be fine removing this test, I think we need to unpack this analysis a little more deeply. 5k jobs is not a lot of memory. I see the log from the failure says could not mark as succeeded: log-job: root: memory budget exceeded: 10240 bytes requested, 134215680 currently allocated, 134217728 bytes in budget but I think we need to ask what this means as it's farfetched that 5k empty jobs would use a 132MiB.
please reference the issue that this PR corresponds to.
Reviewable status:
complete! 0 of 0 LGTMs obtained
ajwerner
left a comment
There was a problem hiding this comment.
It seems to me like the problem here is that all these jobs are getting adopted and we really don't want them to. In the course of running the adoption, we then do spend a good bit of time actually adopting these jobs. What I think you should do is create a testing knob to disable job adoption and use it in this test. See
cockroach/pkg/jobs/registry.go
Line 1315 in 852934c
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @sajjadrizvi)
pkg/jobs/registry_test.go, line 351 at r1 (raw file):
Quoted 4 lines of code…
tdb.CheckQueryResultsRetry(t, "SELECT count(*) FROM [SHOW JOBS]", [][]string{{fmt.Sprintf("%d", test.batchSize)}}) for _, id := range jobIDs { tdb.CheckQueryResultsRetry(t, fmt.Sprintf("SELECT status FROM system.jobs WHERE id = '%d'", id),
In retrospect this should not have been CheckQueryResultsRetry as it should not need to retry.
Yes, the main culprit here is retries. |
Really? What makes you say that? |
I am wrong. I saw the failure message |
|
It's hard to say whether you're wrong but at least I don't think what you said is very clear. Any of that memory allocated to the query which returned that error would be free by the time that error returns, or, at least, it should be. It does not accumulate forever. The memory monitoring framework is mostly just hooked up to the query execution layer and once that error comes back to the jobs layer, the corresponding memory account will have been closed. |
Commit cockroachdb#67991 introduced a test that turned out to be flaky. The test runs out of memory sometimes as it creates a very large batch of jobs. This fix disables job adoptions to avoid large memory use. Release note: None Fixes: cockroachdb#68962
efb7c3b to
913a1f4
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @sajjadrizvi)
|
TFTR! bors r=ajwerner |
|
Build failed (retrying...): |
|
Build succeeded: |
Commit #67991 introduced a test that turned out to be flaky.
The test runs out of memory sometimes as it creates a very
large batch of jobs. This fix disables job adoptions to avoid
large memory use.
Release note: None
Fixes: #68962