Skip to content
This repository was archived by the owner on Jan 28, 2022. It is now read-only.

Sets Run to terminal state if it has been deleted from Databricks first#158

Merged
Azadehkhojandi merged 3 commits intoAzure:masterfrom
stuartleeks:acm/156-run-delete
Feb 19, 2020
Merged

Sets Run to terminal state if it has been deleted from Databricks first#158
Azadehkhojandi merged 3 commits intoAzure:masterfrom
stuartleeks:acm/156-run-delete

Conversation

@magencio
Copy link
Contributor

@magencio magencio commented Feb 3, 2020

Fix for issue #156: Reconciler error when refreshing a Run that has been deleted in Databricks.

return err
// Databricks API 2.0/jobs/runs/get-output returns error 500 if the Run has already been
// deleted from Databricks. Set k8s Run to a terminal state
if strings.Contains(err.Error(), "Response from server (500)") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this error could happen even if it's not deleted? If it does, wouldn't we be setting a non deleted run to be deleted on the cluster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, and unfortunately the status code and response are pretty unhelpful here!

I just tested https://northeurope.azuredatabricks.net/api/2.0/jobs/runs/get?run_id=123 and that gives a 400 response with "Run 123 does not exist" in the JSON message

Whereas https://northeurope.azuredatabricks.net/api/2.0/jobs/runs/get-output?run_id=123 gives a 500 respones with an empty message!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @EliiseS that we don't want to stop reconciling for runs that still exist when the API is having issues and returning status code 500 responses.

My initial thought was if we get a 500 repsonse to then call /jobs/runs/get and test the error message. However this adds an extra call in a reconciliation loop which could have a negative impact if the API is having issues and returning genuine 500 responses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EliiseS this change sets the Run in k8s to a final state, but doesn't set if for deletion in k8s. You would still need to delete it afterwards (which will work fine even without this fix). Still, as @stuartleeks mentioned, we could use a call to /jobs/runs/get to determine the "type" of error 500 we are getting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I improved the handling of the error to tell if the 500 happens because the Run was deleted in Databricks or the API is failing for some other reason

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably run a load test on this before this is merged

@stuartleeks
Copy link
Collaborator

Update - I've now managed to run a load test with this change.

tldr - this looks ok. I'm going to remove the not-for-review label

Load pattern:

image

API calls (shows when mock api failure behaviour was configured: 20% status code 500 responses in first batch, 100% in second batch):

image

Mean value for poll_run_await_completion (which is a proxy for latency in detecting run completion) stays within reasonable value:

image

Controller reconciliation rate dips with status 500 responses, but recovers well
image

Work queue etc look busy, but doesn't keep growing (and poll_run_await_completion was ok above)

image

@stuartleeks
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Azadehkhojandi
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@Azadehkhojandi Azadehkhojandi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@Azadehkhojandi Azadehkhojandi merged commit f259cd8 into Azure:master Feb 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants