jobs,sql,server: add structured job execution failure log to job info by ajwerner · Pull Request #75556 · cockroachdb/cockroach

ajwerner · 2022-01-26T17:56:59Z

This commit adds the structured information about job execution to the
crdb_internal table in addition to the flattened string array. It then
leverages this information in the admin server API.

Informs #69170

As an example of crdb_internal.jobs for a retrying reverting job, you'll
see:

> select json_array_elements(execution_events) from crdb_internal.jobs where job_id = 731484911676784641;
                                                                                                                                             json_array_elements
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  {"executionEndMicros": "1643301928919757", "executionStartMicros": "1643301927541336", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133939200 currently allocated, 134217728 bytes in budget"}
  {"executionEndMicros": "1643302029260612", "executionStartMicros": "1643302028024034", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133867520 currently allocated, 134217728 bytes in budget"}
  {"executionEndMicros": "1643302255024877", "executionStartMicros": "1643302253353781", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133898240 currently allocated, 134217728 bytes in budget"}

The setup to make this example was to run a cluster with a small
--max-sql-memory=128MiB and then create a table:

create table foo (i int primary key, j int default (unique_rowid()) unique, b string);
insert into foo(i, b) select i, repeat('a', 1<<12) from generate_series(1, 100000) as t(i);
set sql_safe_updates=false;
alter table foo add column bb string as (i::string || b) stored unique, drop column j; -- retries forever

Release note (sql change): The crdb_internal.jobs table now has a new column
execution_events which is a structured json form of execution_errors.

cockroach-teamcity · 2022-01-26T17:57:07Z

This change is

ajwerner · 2022-01-26T17:57:55Z

@miretskiy this is in support of this design (internal).

miretskiy

Would be nice to include new output example in PR description. Probably would help doc writes too.

Reviewed 6 of 11 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @jocrl)

pkg/jobs/jobs.go, line 1068 at r1 (raw file):

		}
		msg, err := protoreflect.MessageToJSON(&ev, protoreflect.FmtFlags{
			EmitDefaults: false,

Do we want to also set EmitRedacted: true? (this would probably make Parse method not work ... but presumably we should try to redact?

pkg/server/serverpb/admin.proto, line 635 at r1 (raw file):

  // ExecutionFailures is a log of execution failures of the job. It is not
  // guaranteed to contain all execution failures and some execution failures
  // may not contain an error or end.

are these truncated at some point?

pkg/sql/crdb_internal.go, line 849 at r1 (raw file):

				if payload != nil {
					executionErrors = jobs.FormatRetriableExecutionErrorLogToStringArray(
						ctx, payload.RetriableExecutionFailureLog,

do we want to show all? or perhaps truncate?

ajwerner

I'll collect some example of the output and write a test in the jobs package.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)

pkg/jobs/jobs.go, line 1068 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Do we want to also set EmitRedacted: true? (this would probably make Parse method not work ... but presumably we should try to redact?

I don't think we want to redact it. We don't redact other jobs errors shown in the table or UI.

pkg/server/serverpb/admin.proto, line 635 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

are these truncated at some point?

Yes, they are truncated already before we store them.

pkg/sql/crdb_internal.go, line 849 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

do we want to show all? or perhaps truncate?

We already truncate them before storing them.

cockroach/pkg/jobs/jobs.go

Lines 617 to 622 in b64fe37

    
           // Truncate all errors to avoid large rows in the jobs 
        
           // table. 
        
           const ( 
        
           	jobErrMaxRuneCount    = 1024 
        
           	jobErrTruncatedMarker = " -- TRUNCATED" 
        
           )

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jocrl, and @miretskiy)

pkg/jobs/jobs.go, line 1068 at r1 (raw file):

Previously, ajwerner wrote…

I don't think we want to redact it. We don't redact other jobs errors shown in the table or UI.

Okay; I suspect this will have to change at some point.

ajwerner

I updated the message with some examples. There's definitely more we can do with this log, but I'm not doing more now.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)

pkg/jobs/jobs.go, line 1068 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Okay; I suspect this will have to change at some point.

we do hide rows based on the user who is logged in, so maybe that helps?

cockroach/pkg/sql/crdb_internal.go

Lines 775 to 782 in fa93c68

    
           // The user can access the row if the meet one of the conditions: 
        
           //  1. The user is an admin. 
        
           //  2. The job is owned by the user. 
        
           //  3. The user has CONTROLJOB privilege and the job is not owned by 
        
           //      an admin. 
        
           if canAccess := isAdmin || !ownedByAdmin && hasControlJob || sameUser; !canAccess { 
        
           	continue 
        
           }

jocrl

Thank you!! lgtm

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)

pkg/server/serverpb/admin.proto, line 625 at r2 (raw file):

  message ExecutionFailure {
    // Status is the status of the job during the execution.
    string status = 1;

Just out of curiosity, will the per-error status always be "failed"?

ajwerner

TFTR

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jocrl and @miretskiy)

pkg/server/serverpb/admin.proto, line 625 at r2 (raw file):

Previously, jocrl (Josephine) wrote…

Just out of curiosity, will the per-error status always be "failed"?

No, it'll always be either running or reverting corresponding to

cockroach/pkg/jobs/jobs.go

Lines 165 to 211 in af4b5db

    
           // Status represents the status of a job in the system.jobs table. 
        
           type Status string 
        
           // SafeFormat implements redact.SafeFormatter. 
        
           func (s Status) SafeFormat(sp redact.SafePrinter, verb rune) { 
        
           	sp.SafeString(redact.SafeString(s)) 
        
           } 
        
           var _ redact.SafeFormatter = Status("") 
        
           // RunningStatus represents the more detailed status of a running job in 
        
           // the system.jobs table. 
        
           type RunningStatus string 
        
           const ( 
        
           	// StatusPending is `for jobs that have been created but on which work has 
        
           	// not yet started. 
        
           	StatusPending Status = "pending" 
        
           	// StatusRunning is for jobs that are currently in progress. 
        
           	StatusRunning Status = "running" 
        
           	// StatusPaused is for jobs that are not currently performing work, but have 
        
           	// saved their state and can be resumed by the user later. 
        
           	StatusPaused Status = "paused" 
        
           	// StatusFailed is for jobs that failed. 
        
           	StatusFailed Status = "failed" 
        
           	// StatusReverting is for jobs that failed or were canceled and their changes are being 
        
           	// being reverted. 
        
           	StatusReverting Status = "reverting" 
        
           	// StatusSucceeded is for jobs that have successfully completed. 
        
           	StatusSucceeded Status = "succeeded" 
        
           	// StatusCanceled is for jobs that were explicitly canceled by the user and 
        
           	// cannot be resumed. 
        
           	StatusCanceled Status = "canceled" 
        
           	// StatusCancelRequested is for jobs that were requested to be canceled by 
        
           	// the user but may be still running Resume. The node that is running the job 
        
           	// will change it to StatusReverting the next time it runs maybeAdoptJobs. 
        
           	StatusCancelRequested Status = "cancel-requested" 
        
           	// StatusPauseRequested is for jobs that were requested to be paused by the 
        
           	// user but may be still resuming or reverting. The node that is running the 
        
           	// job will change its state to StatusPaused the next time it runs 
        
           	// maybeAdoptJobs and will stop running it. 
        
           	StatusPauseRequested Status = "pause-requested" 
        
           	// StatusRevertFailed is for jobs that encountered an non-retryable error when 
        
           	// reverting their changes. Manual cleanup is required when a job ends up in 
        
           	// this state. 
        
           	StatusRevertFailed Status = "revert-failed" 
        
           )

craig · 2022-01-27T18:33:22Z

Build failed:

GitHub CI (Cockroach)

This commit adds the structured information about job execution to the crdb_internal table in addition to the flattened string array. It then leverages this information in the admin server API. Informs cockroachdb#69170 As an example of crdb_internal.jobs for a retrying reverting job, you'll see: ``` > select json_array_elements(execution_events) from crdb_internal.jobs where job_id = 731484911676784641; json_array_elements ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {"executionEndMicros": "1643301928919757", "executionStartMicros": "1643301927541336", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133939200 currently allocated, 134217728 bytes in budget"} {"executionEndMicros": "1643302029260612", "executionStartMicros": "1643302028024034", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133867520 currently allocated, 134217728 bytes in budget"} {"executionEndMicros": "1643302255024877", "executionStartMicros": "1643302253353781", "instanceId": 1, "status": "reverting", "truncatedError": "scan with start key /Table/54/1/13110: root: memory budget exceeded: 532480 bytes requested, 133898240 currently allocated, 134217728 bytes in budget"} ``` The setup to make this example was to run a cluster with a small `--max-sql-memory=128MiB` and then create a table: ```sql create table foo (i int primary key, j int default (unique_rowid()) unique, b string); insert into foo(i, b) select i, repeat('a', 1<<12) from generate_series(1, 100000) as t(i); set sql_safe_updates=false; alter table foo add column bb string as (i::string || b) stored unique, drop column j; -- retries forever ``` Release note (sql change): The crdb_internal.jobs table now has a new column `execution_events` which is a structured json form of `execution_errors`.

ajwerner · 2022-01-28T20:49:29Z

bors r+

craig · 2022-01-28T23:23:08Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-29T01:32:39Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-29T04:58:29Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-29T09:39:04Z

Build succeeded:

GitHub CI (Cockroach)

ajwerner requested review from jocrl and miretskiy January 26, 2022 17:56

ajwerner requested review from a team as code owners January 26, 2022 17:57

ajwerner force-pushed the ajwerner/crdb-internal-jobs-refactor branch 2 times, most recently from ea34362 to 280525a Compare January 27, 2022 00:00

ajwerner requested a review from a team as a code owner January 27, 2022 00:00

miretskiy suggested changes Jan 27, 2022

View reviewed changes

ajwerner commented Jan 27, 2022

View reviewed changes

miretskiy approved these changes Jan 27, 2022

View reviewed changes

miretskiy self-requested a review January 27, 2022 12:39

ajwerner force-pushed the ajwerner/crdb-internal-jobs-refactor branch from 280525a to 8358c60 Compare January 27, 2022 16:55

ajwerner commented Jan 27, 2022

View reviewed changes

jocrl approved these changes Jan 27, 2022

View reviewed changes

ajwerner commented Jan 27, 2022

View reviewed changes

ajwerner force-pushed the ajwerner/crdb-internal-jobs-refactor branch from 8358c60 to 931957c Compare January 27, 2022 23:55

craig bot merged commit 75ba483 into cockroachdb:master Jan 29, 2022

cockroach-teamcity mentioned this pull request Jan 29, 2022

jobs,sql,server: add structured job execution failure log to job info cockroachdb/docs#12879

Closed

jocrl mentioned this pull request May 10, 2022

ui: Add table to Jobs page showing errors #81180

Draft

	// Truncate all errors to avoid large rows in the jobs
	// table.
	const (
	jobErrMaxRuneCount = 1024
	jobErrTruncatedMarker = " -- TRUNCATED"
	)

	// The user can access the row if the meet one of the conditions:
	// 1. The user is an admin.
	// 2. The job is owned by the user.
	// 3. The user has CONTROLJOB privilege and the job is not owned by
	// an admin.
	if canAccess := isAdmin \|\| !ownedByAdmin && hasControlJob \|\| sameUser; !canAccess {
	continue
	}

	// Status represents the status of a job in the system.jobs table.
	type Status string

	// SafeFormat implements redact.SafeFormatter.
	func (s Status) SafeFormat(sp redact.SafePrinter, verb rune) {
	sp.SafeString(redact.SafeString(s))
	}

	var _ redact.SafeFormatter = Status("")

	// RunningStatus represents the more detailed status of a running job in
	// the system.jobs table.
	type RunningStatus string

	const (
	// StatusPending is `for jobs that have been created but on which work has
	// not yet started.
	StatusPending Status = "pending"
	// StatusRunning is for jobs that are currently in progress.
	StatusRunning Status = "running"
	// StatusPaused is for jobs that are not currently performing work, but have
	// saved their state and can be resumed by the user later.
	StatusPaused Status = "paused"
	// StatusFailed is for jobs that failed.
	StatusFailed Status = "failed"
	// StatusReverting is for jobs that failed or were canceled and their changes are being
	// being reverted.
	StatusReverting Status = "reverting"
	// StatusSucceeded is for jobs that have successfully completed.
	StatusSucceeded Status = "succeeded"
	// StatusCanceled is for jobs that were explicitly canceled by the user and
	// cannot be resumed.
	StatusCanceled Status = "canceled"
	// StatusCancelRequested is for jobs that were requested to be canceled by
	// the user but may be still running Resume. The node that is running the job
	// will change it to StatusReverting the next time it runs maybeAdoptJobs.
	StatusCancelRequested Status = "cancel-requested"
	// StatusPauseRequested is for jobs that were requested to be paused by the
	// user but may be still resuming or reverting. The node that is running the
	// job will change its state to StatusPaused the next time it runs
	// maybeAdoptJobs and will stop running it.
	StatusPauseRequested Status = "pause-requested"
	// StatusRevertFailed is for jobs that encountered an non-retryable error when
	// reverting their changes. Manual cleanup is required when a job ends up in
	// this state.
	StatusRevertFailed Status = "revert-failed"
	)

Conversation

ajwerner commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jan 26, 2022

Uh oh!

ajwerner commented Jan 26, 2022

Uh oh!

miretskiy left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

miretskiy left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

jocrl left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Jan 27, 2022

Uh oh!

ajwerner commented Jan 28, 2022

Uh oh!

craig bot commented Jan 28, 2022

Uh oh!

craig bot commented Jan 29, 2022

Uh oh!

craig bot commented Jan 29, 2022

Uh oh!

craig bot commented Jan 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajwerner commented Jan 26, 2022 •

edited

Loading