Skip to content

Issue-105420: Fix bug causing incorrect error on force deleting already deleted model#107188

Open
batcity wants to merge 9 commits intoelastic:mainfrom
batcity:issue-105420-bugfix
Open

Issue-105420: Fix bug causing incorrect error on force deleting already deleted model#107188
batcity wants to merge 9 commits intoelastic:mainfrom
batcity:issue-105420-bugfix

Conversation

@batcity
Copy link
Copy Markdown

@batcity batcity commented Apr 8, 2024

What I did:

Previously, attempting to force delete a referenced model in Elasticsearch that had already been deleted would result in an error indicating that the model is still being referenced by ingest processors. This behavior was misleading since the model no longer exists in the system.

Changes made:

  • Modified the deleteModel method to to properly check for the existence of the model before proceeding with the deletion process.

With this fix, attempting to force delete an already deleted model will now correctly indicate that the model isn't found, rather than misleadingly reporting it as still being referenced.

Related issue:

Fixes #105420

How did I test my fix:

I built and started up elasticsearch locally, then I downloaded a new model using the following command:

curl --insecure -u "elastic:<password>" -X PUT "https://localhost:9200/_ml/trained_models/.elser_model_2?pretty" -H 'Content-Type: application/json' -d' { "input": { "field_names": ["text_field"] } } '

Then I attempted to delete it and it said that it's being referenced as expected:

curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2?pretty" { "error" : { "root_cause" : [ { "type" : "status_exception", "reason" : "Cannot delete model [.elser_model_2] as it is still referenced by ingest processors; use force to delete the model" } ], "type" : "status_exception", "reason" : "Cannot delete model [.elser_model_2] as it is still referenced by ingest processors; use force to delete the model" }, "status" : 409 }

I force deleted it the first time:

curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2?force=true" {"acknowledged":true}%

It now returns the right error response when I attempt to delete the same model again:

curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2" {"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"Could not find trained model [.elser_model_2]"}],"type":"resource_not_found_exception","reason":"Could not find trained model [.elser_model_2]"},"status":404}%

PR checklist:

  • Have you signed the contributor license agreement? ✔️Yes

  • Have you followed the contributor guidelines? ✔️Yes

  • If submitting code, have you built your formula locally prior to submission with gradle check?

    I ran tests in the package to make sure I didn't break anything, here's the command I ran:

    ./gradlew :x-pack:plugin:ml:test

    Here's a screenshot showing that the tests succeeded:

Screenshot 2024-04-10 at 4 44 21 AM
  • If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed. ✔️Yes
  • If submitting code, have you checked that your submission is for an OS and architecture that we support? ✔️Yes
  • If you are submitting this code for a class then read our policy for that. N/A

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v8.14.0 labels Apr 8, 2024
@batcity batcity changed the title draft: Issue-105420: Fix bug causing incorrect error on force deleting already deleted model Issue-105420: Fix bug causing incorrect error on force deleting already deleted model Apr 9, 2024
@batcity batcity marked this pull request as ready for review April 9, 2024 23:45
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 9, 2024
@davidkyle davidkyle added :ml Machine learning and removed needs:triage Requires assignment of a team area label labels Apr 10, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Apr 10, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle
Copy link
Copy Markdown
Member

@elasticmachine test this please

@davidkyle davidkyle added the >bug label Apr 10, 2024
@maxhniebergall maxhniebergall self-requested a review April 10, 2024 18:27
Copy link
Copy Markdown
Contributor

@maxhniebergall maxhniebergall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this change! I have requested a few changes, but I am open to doing it a different way. Let me know what you think!

edit: please run this command once your next commit is ready ./gradlew :x-pack:plugin:core:spotlessApply :x-pack:plugin:core:precommit :x-pack:plugin:ml:spotlessApply :x-pack:plugin:ml:precommit :x-pack:plugin:inference:spotlessApply :x-pack:plugin:inference:precommit :x-pack:plugin:ml:qa:native-multi-node-tests:precommit

IngestMetadata currentIngestMetadata = state.metadata().custom(IngestMetadata.TYPE);
Set<String> referencedModels = getReferencedModelKeys(currentIngestMetadata, ingestService);

if (modelExists(request.getId()) == false) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this looks very clean and readable.

trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null, trainedModelListener);

try {
boolean latchReached = latch.await(5, TimeUnit.SECONDS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a 5 second timeout here which will throw an exception. If this timeout occurs, I don't think it will be any clearer what happened for the end user.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
};

trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null, trainedModelListener);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't pass a parent task to this request, it wont be cancelable. I think it would be better to pass in the task from the masterOperation here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +247 to +250
protected boolean modelExists(String modelId) {
CountDownLatch latch = new CountDownLatch(1);
final AtomicBoolean modelExists = new AtomicBoolean(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid using a latch and requiring a timeout, I think we could replace this function with an actionListener. What do you think?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@Override
public void onFailure(Exception e) {
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e);
logger.error("Failed to retrieve model [" + modelId + "]: [" + e.getMessage() + "]", e);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

protected boolean modelExists(String modelId) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the countdown latch and hence blocking the calling thread by using a listener.

You don't have to timeout the call to trainedModelProvider.getTrainedModel() if it does timeout simply out let the error propagate from the call.

    private void modelExists(String modelId, ActionListener<Boolean> listener) {
        trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null,
            ActionListener.wrap(
                model -> listener.onResponse(Boolean.TRUE),
                exception -> {
                    if (ExceptionsHelper.unwrapCause(exception) instanceof ResourceNotFoundException) {
                        listener.onResponse(Boolean.FALSE);
                    } else {
                        listener.onFailure(exception);
                    }
                }
            )
        );
    }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you make the modelExists() function async with a listener you will need to chain the various processing steps together. The best way to do this is use a SubscribableListener

Here's an example if it being used:
https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java#L128

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit: 04bed02#diff-9c8c7464beb6c8c177bd79140cdb557b630bdbeb06fb8ecac032eeb518487825R223

Apologies about the delayed response, I was busy with work 😅

Copy link
Copy Markdown
Author

@batcity batcity Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't spin up the service and test the endpoints this time around since the ml endpoints won't activate locally for whatever reason (I get a no handler found error for the ml endpoints)

so I've only verified that the unit tests succeed

image


@Override
public void onFailure(Exception e) {
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an error as we are checking the model's existence here. If the model doesn't exist then that is fine and it should be reported back to the caller rather than logged.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw new ElasticsearchException("Timeout while waiting for trained model to be retrieved");
}
} catch (InterruptedException e) {
throw new ElasticsearchException("Unexpected exception", e);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not necessary if you take my suggestion but in Java it's best practice in to reset the interrupt flag with Thread.currentThread().interrupt();

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxhniebergall maxhniebergall removed their assignment May 15, 2024
@batcity batcity requested a review from davidkyle October 8, 2025 14:25
@davidkyle davidkyle removed their request for review March 30, 2026 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong error message returned on delete trained model when model already deleted

5 participants