Skip to content

jobs,sql,server: add structured job execution failure log to job info#75556

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/crdb-internal-jobs-refactor
Jan 29, 2022
Merged

jobs,sql,server: add structured job execution failure log to job info#75556
craig[bot] merged 1 commit intocockroachdb:masterfrom
ajwerner:ajwerner/crdb-internal-jobs-refactor

Conversation

@ajwerner
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner commented Jan 26, 2022

This commit adds the structured information about job execution to the
crdb_internal table in addition to the flattened string array. It then
leverages this information in the admin server API.

Informs #69170

As an example of crdb_internal.jobs for a retrying reverting job, you'll
see:

> select json_array_elements(execution_events) from crdb_internal.jobs where job_id = 731484911676784641;
                                                                                                                                             json_array_elements
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  {"executionEndMicros": "1643301928919757", "executionStartMicros": "1643301927541336", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133939200 currently allocated, 134217728 bytes in budget"}
  {"executionEndMicros": "1643302029260612", "executionStartMicros": "1643302028024034", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133867520 currently allocated, 134217728 bytes in budget"}
  {"executionEndMicros": "1643302255024877", "executionStartMicros": "1643302253353781", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133898240 currently allocated, 134217728 bytes in budget"}

The setup to make this example was to run a cluster with a small
--max-sql-memory=128MiB and then create a table:

create table foo (i int primary key, j int default (unique_rowid()) unique, b string);
insert into foo(i, b) select i, repeat('a', 1<<12) from generate_series(1, 100000) as t(i);
set sql_safe_updates=false;
alter table foo add column bb string as (i::string || b) stored unique, drop column j; -- retries forever

Release note (sql change): The crdb_internal.jobs table now has a new column
execution_events which is a structured json form of execution_errors.

@ajwerner ajwerner requested review from jocrl and miretskiy January 26, 2022 17:56
@ajwerner ajwerner requested review from a team as code owners January 26, 2022 17:57
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajwerner
Copy link
Copy Markdown
Contributor Author

@miretskiy this is in support of this design (internal).

@ajwerner ajwerner force-pushed the ajwerner/crdb-internal-jobs-refactor branch 2 times, most recently from ea34362 to 280525a Compare January 27, 2022 00:00
@ajwerner ajwerner requested a review from a team as a code owner January 27, 2022 00:00
Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to include new output example in PR description. Probably would help doc writes too.

Reviewed 6 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jocrl)


pkg/jobs/jobs.go, line 1068 at r1 (raw file):

		}
		msg, err := protoreflect.MessageToJSON(&ev, protoreflect.FmtFlags{
			EmitDefaults: false,

Do we want to also set EmitRedacted: true? (this would probably make Parse method not work ... but presumably we should try to redact?


pkg/server/serverpb/admin.proto, line 635 at r1 (raw file):

  // ExecutionFailures is a log of execution failures of the job. It is not
  // guaranteed to contain all execution failures and some execution failures
  // may not contain an error or end.

are these truncated at some point?


pkg/sql/crdb_internal.go, line 849 at r1 (raw file):

				if payload != nil {
					executionErrors = jobs.FormatRetriableExecutionErrorLogToStringArray(
						ctx, payload.RetriableExecutionFailureLog,

do we want to show all? or perhaps truncate?

Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll collect some example of the output and write a test in the jobs package.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)


pkg/jobs/jobs.go, line 1068 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Do we want to also set EmitRedacted: true? (this would probably make Parse method not work ... but presumably we should try to redact?

I don't think we want to redact it. We don't redact other jobs errors shown in the table or UI.


pkg/server/serverpb/admin.proto, line 635 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

are these truncated at some point?

Yes, they are truncated already before we store them.


pkg/sql/crdb_internal.go, line 849 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

do we want to show all? or perhaps truncate?

We already truncate them before storing them.

cockroach/pkg/jobs/jobs.go

Lines 617 to 622 in b64fe37

// Truncate all errors to avoid large rows in the jobs
// table.
const (
jobErrMaxRuneCount = 1024
jobErrTruncatedMarker = " -- TRUNCATED"
)

Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jocrl, and @miretskiy)


pkg/jobs/jobs.go, line 1068 at r1 (raw file):

Previously, ajwerner wrote…

I don't think we want to redact it. We don't redact other jobs errors shown in the table or UI.

Okay; I suspect this will have to change at some point.

@miretskiy miretskiy self-requested a review January 27, 2022 12:39
@ajwerner ajwerner force-pushed the ajwerner/crdb-internal-jobs-refactor branch from 280525a to 8358c60 Compare January 27, 2022 16:55
Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the message with some examples. There's definitely more we can do with this log, but I'm not doing more now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)


pkg/jobs/jobs.go, line 1068 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Okay; I suspect this will have to change at some point.

we do hide rows based on the user who is logged in, so maybe that helps?

// The user can access the row if the meet one of the conditions:
// 1. The user is an admin.
// 2. The job is owned by the user.
// 3. The user has CONTROLJOB privilege and the job is not owned by
// an admin.
if canAccess := isAdmin || !ownedByAdmin && hasControlJob || sameUser; !canAccess {
continue
}

Copy link
Copy Markdown
Contributor

@jocrl jocrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!! lgtm

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)


pkg/server/serverpb/admin.proto, line 625 at r2 (raw file):

  message ExecutionFailure {
    // Status is the status of the job during the execution.
    string status = 1;

Just out of curiosity, will the per-error status always be "failed"?

Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)


pkg/server/serverpb/admin.proto, line 625 at r2 (raw file):

Previously, jocrl (Josephine) wrote…

Just out of curiosity, will the per-error status always be "failed"?

No, it'll always be either running or reverting corresponding to

cockroach/pkg/jobs/jobs.go

Lines 165 to 211 in af4b5db

// Status represents the status of a job in the system.jobs table.
type Status string
// SafeFormat implements redact.SafeFormatter.
func (s Status) SafeFormat(sp redact.SafePrinter, verb rune) {
sp.SafeString(redact.SafeString(s))
}
var _ redact.SafeFormatter = Status("")
// RunningStatus represents the more detailed status of a running job in
// the system.jobs table.
type RunningStatus string
const (
// StatusPending is `for jobs that have been created but on which work has
// not yet started.
StatusPending Status = "pending"
// StatusRunning is for jobs that are currently in progress.
StatusRunning Status = "running"
// StatusPaused is for jobs that are not currently performing work, but have
// saved their state and can be resumed by the user later.
StatusPaused Status = "paused"
// StatusFailed is for jobs that failed.
StatusFailed Status = "failed"
// StatusReverting is for jobs that failed or were canceled and their changes are being
// being reverted.
StatusReverting Status = "reverting"
// StatusSucceeded is for jobs that have successfully completed.
StatusSucceeded Status = "succeeded"
// StatusCanceled is for jobs that were explicitly canceled by the user and
// cannot be resumed.
StatusCanceled Status = "canceled"
// StatusCancelRequested is for jobs that were requested to be canceled by
// the user but may be still running Resume. The node that is running the job
// will change it to StatusReverting the next time it runs maybeAdoptJobs.
StatusCancelRequested Status = "cancel-requested"
// StatusPauseRequested is for jobs that were requested to be paused by the
// user but may be still resuming or reverting. The node that is running the
// job will change its state to StatusPaused the next time it runs
// maybeAdoptJobs and will stop running it.
StatusPauseRequested Status = "pause-requested"
// StatusRevertFailed is for jobs that encountered an non-retryable error when
// reverting their changes. Manual cleanup is required when a job ends up in
// this state.
StatusRevertFailed Status = "revert-failed"
)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 27, 2022

Build failed:

This commit adds the structured information about job execution to the
crdb_internal table in addition to the flattened string array. It then
leverages this information in the admin server API.

Informs cockroachdb#69170

As an example of crdb_internal.jobs for a retrying reverting job, you'll
see:

```
> select json_array_elements(execution_events) from crdb_internal.jobs where job_id = 731484911676784641;
                                                                                                                                             json_array_elements
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  {"executionEndMicros": "1643301928919757", "executionStartMicros": "1643301927541336", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133939200 currently allocated, 134217728 bytes in budget"}
  {"executionEndMicros": "1643302029260612", "executionStartMicros": "1643302028024034", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133867520 currently allocated, 134217728 bytes in budget"}
  {"executionEndMicros": "1643302255024877", "executionStartMicros": "1643302253353781", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133898240 currently allocated, 134217728 bytes in budget"}
```

The setup to make this example was to run a cluster with a small
`--max-sql-memory=128MiB` and then create a table:

```sql
create table foo (i int primary key, j int default (unique_rowid()) unique, b string);
insert into foo(i, b) select i, repeat('a', 1<<12) from generate_series(1, 100000) as t(i);
set sql_safe_updates=false;
alter table foo add column bb string as (i::string || b) stored unique, drop column j; -- retries forever
```

Release note (sql change): The crdb_internal.jobs table now has a new column
`execution_events` which is a structured json form of `execution_errors`.
@ajwerner ajwerner force-pushed the ajwerner/crdb-internal-jobs-refactor branch from 8358c60 to 931957c Compare January 27, 2022 23:55
@ajwerner
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 28, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 29, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants