jobs,sql,server: add structured job execution failure log to job info#75556
Conversation
|
@miretskiy this is in support of this design (internal). |
ea34362 to
280525a
Compare
miretskiy
left a comment
There was a problem hiding this comment.
Would be nice to include new output example in PR description. Probably would help doc writes too.
Reviewed 6 of 11 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jocrl)
pkg/jobs/jobs.go, line 1068 at r1 (raw file):
} msg, err := protoreflect.MessageToJSON(&ev, protoreflect.FmtFlags{ EmitDefaults: false,
Do we want to also set EmitRedacted: true? (this would probably make Parse method not work ... but presumably we should try to redact?
pkg/server/serverpb/admin.proto, line 635 at r1 (raw file):
// ExecutionFailures is a log of execution failures of the job. It is not // guaranteed to contain all execution failures and some execution failures // may not contain an error or end.
are these truncated at some point?
pkg/sql/crdb_internal.go, line 849 at r1 (raw file):
if payload != nil { executionErrors = jobs.FormatRetriableExecutionErrorLogToStringArray( ctx, payload.RetriableExecutionFailureLog,
do we want to show all? or perhaps truncate?
ajwerner
left a comment
There was a problem hiding this comment.
I'll collect some example of the output and write a test in the jobs package.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)
pkg/jobs/jobs.go, line 1068 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Do we want to also set
EmitRedacted: true? (this would probably make Parse method not work ... but presumably we should try to redact?
I don't think we want to redact it. We don't redact other jobs errors shown in the table or UI.
pkg/server/serverpb/admin.proto, line 635 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
are these truncated at some point?
Yes, they are truncated already before we store them.
pkg/sql/crdb_internal.go, line 849 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
do we want to show all? or perhaps truncate?
We already truncate them before storing them.
Lines 617 to 622 in b64fe37
miretskiy
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jocrl, and @miretskiy)
pkg/jobs/jobs.go, line 1068 at r1 (raw file):
Previously, ajwerner wrote…
I don't think we want to redact it. We don't redact other jobs errors shown in the table or UI.
Okay; I suspect this will have to change at some point.
280525a to
8358c60
Compare
ajwerner
left a comment
There was a problem hiding this comment.
I updated the message with some examples. There's definitely more we can do with this log, but I'm not doing more now.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)
pkg/jobs/jobs.go, line 1068 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Okay; I suspect this will have to change at some point.
we do hide rows based on the user who is logged in, so maybe that helps?
cockroach/pkg/sql/crdb_internal.go
Lines 775 to 782 in fa93c68
jocrl
left a comment
There was a problem hiding this comment.
Thank you!! lgtm
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)
pkg/server/serverpb/admin.proto, line 625 at r2 (raw file):
message ExecutionFailure { // Status is the status of the job during the execution. string status = 1;
Just out of curiosity, will the per-error status always be "failed"?
ajwerner
left a comment
There was a problem hiding this comment.
TFTR
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)
pkg/server/serverpb/admin.proto, line 625 at r2 (raw file):
Previously, jocrl (Josephine) wrote…
Just out of curiosity, will the per-error status always be "failed"?
No, it'll always be either running or reverting corresponding to
Lines 165 to 211 in af4b5db
|
Build failed: |
This commit adds the structured information about job execution to the crdb_internal table in addition to the flattened string array. It then leverages this information in the admin server API. Informs cockroachdb#69170 As an example of crdb_internal.jobs for a retrying reverting job, you'll see: ``` > select json_array_elements(execution_events) from crdb_internal.jobs where job_id = 731484911676784641; json_array_elements ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {"executionEndMicros": "1643301928919757", "executionStartMicros": "1643301927541336", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133939200 currently allocated, 134217728 bytes in budget"} {"executionEndMicros": "1643302029260612", "executionStartMicros": "1643302028024034", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133867520 currently allocated, 134217728 bytes in budget"} {"executionEndMicros": "1643302255024877", "executionStartMicros": "1643302253353781", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133898240 currently allocated, 134217728 bytes in budget"} ``` The setup to make this example was to run a cluster with a small `--max-sql-memory=128MiB` and then create a table: ```sql create table foo (i int primary key, j int default (unique_rowid()) unique, b string); insert into foo(i, b) select i, repeat('a', 1<<12) from generate_series(1, 100000) as t(i); set sql_safe_updates=false; alter table foo add column bb string as (i::string || b) stored unique, drop column j; -- retries forever ``` Release note (sql change): The crdb_internal.jobs table now has a new column `execution_events` which is a structured json form of `execution_errors`.
8358c60 to
931957c
Compare
|
bors r+ |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build succeeded: |
This commit adds the structured information about job execution to the
crdb_internal table in addition to the flattened string array. It then
leverages this information in the admin server API.
Informs #69170
As an example of crdb_internal.jobs for a retrying reverting job, you'll
see:
The setup to make this example was to run a cluster with a small
--max-sql-memory=128MiBand then create a table:Release note (sql change): The crdb_internal.jobs table now has a new column
execution_eventswhich is a structured json form ofexecution_errors.