Skip to content

ui: Add table to Jobs page showing errors#81180

Draft
jocrl wants to merge 1 commit intocockroachdb:masterfrom
jocrl:jobs-error-details
Draft

ui: Add table to Jobs page showing errors#81180
jocrl wants to merge 1 commit intocockroachdb:masterfrom
jocrl:jobs-error-details

Conversation

@jocrl
Copy link
Copy Markdown
Contributor

@jocrl jocrl commented May 10, 2022

Todo:

Fixes #69170

This commit adds a "Job errors" table to the Job Details page.

Currently this draft implementation shows both the message in the error
(string) field with the job start and end time as error start and end times,
and the errors in the execution_failures field.

Separately, this commit also

  • Improves job and table stories
  • Renames the active stmts/txn prop type to distinguish those types from their
    respective components

No errors:
image

Retriable errors: Image TBD. Blocked on #81474

Failed without retriable errors:
image

Release note (ui): This commit adds a "Job errors" table to the Job Details
page.

Todo:
- Blocked on understanding how the endpoint fields are populated in a real use
  case. See comment in thread.

Fixes cockroachdb#69170

This commit adds a "Job errors" table to the Job Details page.

Currently this draft implementation shows both the message in the `error`
(string) field with the job start and end time as error start and end times,
and the errors in the `execution_failures` field.

Separately, this commit also
- Improves job and table stories
- Renames the active stmts/txn prop type to distinguish those types from their
  respective components

Release note (ui): This commit adds a "Job errors" table to the Job Details
page.
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jocrl
Copy link
Copy Markdown
Contributor Author

jocrl commented May 10, 2022

Hello! 😄 Could I get help with understanding the endpoint + requirements?

Specifically, #75556 adds an execution_failures field to the API, which contains start and end times in addition to the error message and allows us to populate the three UI table columns. However, this execution_failures field only contains retriable execution errors (since it is populated by the log of retriable execution failures), and does not include errors that are not retriable.

For development purposes, I have managed to cause a job to fail by adding an error in the normal job flow. But it does not populate the execution_failures field, because the error I introduced is not retriable. I haven't figured out how to induce a retriable failure, and am thus not sure how the execution_failures field and the separate error field (string message only, no start or end times) would be in populated in a real use case.

Lacking the example of a real use case, I have some starting questions:

  1. @ajwerner, I'm assuming this is intentional, to only populate execution_failures with retriable errors?
  2. @kevin-v-ngo / @Annebirzin: I assume that we want to show both retriable, and non-retriable errors in the UI table? Or am I mistaken, and we only want to show those that are retriable? Currently, if a job fails in a non-retriable way, it already shows the error message under the "FAILED" badge (this message existed before this PR). Should we repeat that in the table, or just show it once in the table/under the badge?

Assuming I'm correct on the above two assumptions, my further questions are:
3) @ajwerner, do you have an idea how one could induce a retriable error? E.g. via temporarily compiling errors somewhere in a normal job flow, for UI dev purposes. If we can't simulate this, my question is then:
- When a retriable error occurs, is the latest error populated into the error (string) column of crdb_internal.jobs? Assuming we should show both retriable and non-retriable errors in the UI table, I'm trying to understand if I should concatenate the error (string) field with errors in the execution_failures field, or if that would cause duplicates if execution_failures is populated.
4) @kevin-v-ngo / @ajwerner, for non-retriable errors, what should the start and end time columnns be? I could make them the job start and end time, though since the job is the first thing to start (before any retriable errors), the "final" error will be the bottommost one if we are sorting by most recent error start time as default. Or should they be empty?
5) @kevin-v-ngo / @Annebirzin, on that note: should the table default to sorting by most recent error start time, or something else?

Regardless of the retriable/non-retriable errors stuff,
6) @kevin-v-ngo / @ajwerner, do we have to worry about the error messages being too long to reasonably contain in the table cell? E.g., the dump of some JSON payload.
7) @kevin-v-ngo / @ajwerner, would we potentially expect more than, e.g., 20 errors in a real use case, and if so how many? If we're expecting many errors for a single job, @Annebirzin I'm wondering if we should paginate the table.
8) @kevin-v-ngo / @ajwerner, should we link to any resources to help with the error messages?

Thanks, everyone!

@kevin-v-ngo
Copy link
Copy Markdown

Hello! 😄 Could I get help with understanding the endpoint + requirements?

Specifically, #75556 adds an execution_failures field to the API, which contains start and end times in addition to the error message and allows us to populate the three UI table columns. However, this execution_failures field only contains retriable execution errors (since it is populated by the log of retriable execution failures), and does not include errors that are not retriable.

For development purposes, I have managed to cause a job to fail by adding an error in the normal job flow. But it does not populate the execution_failures field, because the error I introduced is not retriable. I haven't figured out how to induce a retriable failure, and am thus not sure how the execution_failures field and the separate error field (string message only, no start or end times) would be in populated in a real use case.

Lacking the example of a real use case, I have some starting questions:

  1. @ajwerner, I'm assuming this is intentional, to only populate execution_failures with retriable errors?
  2. @kevin-v-ngo / @Annebirzin: I assume that we want to show both retriable, and non-retriable errors in the UI table? Or am I mistaken, and we only want to show those that are retriable? Currently, if a job fails in a non-retriable way, it already shows the error message under the "FAILED" badge (this message existed before this PR). Should we repeat that in the table, or just show it once in the table/under the badge?

Assuming I'm correct on the above two assumptions, my further questions are: 3) @ajwerner, do you have an idea how one could induce a retriable error? E.g. via temporarily compiling errors somewhere in a normal job flow, for UI dev purposes. If we can't simulate this, my question is then: - When a retriable error occurs, is the latest error populated into the error (string) column of crdb_internal.jobs? Assuming we should show both retriable and non-retriable errors in the UI table, I'm trying to understand if I should concatenate the error (string) field with errors in the execution_failures field, or if that would cause duplicates if execution_failures is populated. 4) @kevin-v-ngo / @ajwerner, for non-retriable errors, what should the start and end time columnns be? I could make them the job start and end time, though since the job is the first thing to start (before any retriable errors), the "final" error will be the bottommost one if we are sorting by most recent error start time as default. Or should they be empty? 5) @kevin-v-ngo / @Annebirzin, on that note: should the table default to sorting by most recent error start time, or something else?

Regardless of the retriable/non-retriable errors stuff, 6) @kevin-v-ngo / @ajwerner, do we have to worry about the error messages being too long to reasonably contain in the table cell? E.g., the dump of some JSON payload. 7) @kevin-v-ngo / @ajwerner, would we potentially expect more than, e.g., 20 errors in a real use case, and if so how many? If we're expecting many errors for a single job, @Annebirzin I'm wondering if we should paginate the table. 8) @kevin-v-ngo / @ajwerner, should we link to any resources to help with the error messages?

Thanks, everyone!

  1. I'd suggest we show all retriable errors that the job experienced in the UI table below. The non-retriable error that caused the jobs to fail would be part of the status section of the job.

  2. With 2, we'd have the non-retriable error as the status. Ideally it should have the time the job failed and experienced that error there as well.

  3. Sort by most recent error in the UI table sounds good to me

  4. I'm not sure how long error messages can be. Each job type would probably have its own set of error messages but it would be great if we can search for the longest in the code base. If it's not feasible, perhaps we should have fixed width and truncate as we have with other tables in the UI. On hover would show the full error message. Thoughts @Annebirzin? I remember doing this for the new statements/transaction format.

  1. I'm not completely sure the max number of retriable errors we'd see. I'd suspect it's not many to warrant pagination but adding @vy-ton for schema and statistics and @amruss for CDC for their feedback.

  2. Short answer not yet. Each job type would probably have its own set of error messages that would need descriptions and how to mitigate/prevent. We could probably have something similar to this doc. @amruss, is there a Jobs docs writer or board we can file a tracking docs improvement here?

@jocrl
Copy link
Copy Markdown
Contributor Author

jocrl commented May 20, 2022

Thanks, @kevin-v-ngo!

I've gotten answers to 5 and 7 from Andrew:
6. Error messages are capped at 16k (i.e., they could be very long)
7. Default to retaining only 3, controlled by this cluster setting.

Additionally, Andrew said that

From Kevin, regarding error message length:

If it's not feasible, perhaps we should have fixed width and truncate as we have with other tables in the UI. On hover would show the full error message. Thoughts @Annebirzin? I remember doing this for the new statements/transaction format.

@Annebirzin, just bringing up that another option could be to fix cell size, and scroll the contents.

Additionally, @Annebirzin, regarding Kevin's comment that:

I'd suggest we show all retriable errors that the job experienced in the UI table below. The non-retriable error that caused the jobs to fail would be part of the status section of the job.

We discussed off-thread, and it sounds like we want to convey the difference between the "final" non-retriable error in the status, and the retriable errors in the job table. A suggestion was to name the job errors table "Previous Job errors" - this table might be present while the job is running, reverting, or moved on to some other state e.g. succeeded/failed. I've updated the screenshot in the description with these text updates.

@Annebirzin
Copy link
Copy Markdown

Annebirzin commented May 23, 2022

@jocrl

  • For long error messages, I wonder if we can introduce a pattern to truncate after 2 lines, and then include a 'Show More/Less' action to vertically expand or hide the additional text in the cell (designs here)
    Screen Shot 2022-05-23 at 12 16 19 PM
    Screen Shot 2022-05-23 at 12 16 25 PM

  • For the title and empty state text, I'm not sure adding the word 'Previous' really tells the user much. I would suggest leaving as 'Job Errors' and maybe add some description text below the title to convey that only retriable errors will show in the table. (designs here) I'm open to suggestions on this description copy as I'm not sure I captured the distinction between the types of job errors that will show up here cc @kevin-v-ngo

Screen Shot 2022-05-23 at 12 37 42 PM

@jocrl
Copy link
Copy Markdown
Contributor Author

jocrl commented May 25, 2022

Hi @amruss! Could we get your help on how to convey that the errors table only contains retriable errors, and not non-retriable ones? E.g., whether that's a title for the table, or some description text.

@ericharmeling ericharmeling self-assigned this Jul 5, 2022
@ericharmeling
Copy link
Copy Markdown

It looks like #82562 could unblock this. See #81474 (comment).

@ericharmeling ericharmeling removed their assignment Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

jobs: surface job error details in the DB console

5 participants