-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[ML] _all requests can suffer "job not found" errors #37959
Description
(Migrated from #37545 (comment) to improve visibility.)
The failure of https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.5+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/166/ showed that it is possible for a request to do some ML operation for _all can return an error that it could not find an entity it expected to find.
For example, closing _all jobs might return an error that job foo does not exist. Or stopping _all datafeeds might return an error that datafeed bar does not exist.
This seems completely crazy, as it's obvious that _all should only include entities that exist.
The reason this can happen is that our actions involve multiple base level Elasticsearch actions chained together, and entities could be deleted in between these base level steps. For example:
- Alice requests force delete of job
foo - Bob requests close
_alljobs - Bob's request to close
_alljobs expands_alltofooandbar - Alice's request to force delete
fooremoves the config associated with jobfoo - Bob's request to close
_alljobs attempts to find the config for jobfoo - Bob's request to close
_allfails because the config for jobfoodoes not exist
Although the test failure that highlighted this problem was a 6.5 test run, I suspect the problem is worse in 6.6 and above because expanding _all requires a search for configs in an index rather than just looking in the (in-memory on all nodes) cluster state.
ML actions that operate on _all should silently ignore failures to find entities from the original expansion of _all, on the assumption that these entities have been deleted by a concurrent request.