[ML] Jindex: Rolling upgrade tests#35700
[ML] Jindex: Rolling upgrade tests#35700davidkyle merged 20 commits intoelastic:feature-jindex-6xfrom
Conversation
|
Pinging @elastic/ml-core |
|
I pushed another commit fixing the problems with When reviewing please pay close attention to |
There was a problem hiding this comment.
I would think that it is better to sort this once once all the tasks are added instead of for every task : tasks.
|
retest this please |
|
run gradle build tests |
2 similar comments
|
run gradle build tests |
|
run gradle build tests |
|
retest this please |
Use job update to set model snapshot and established memory in autodetect results processor. This simplifies the code.
… updates have completed before close
Accounts for jobs created by the other tests etc
Look for datafeed and job config in the cluster state if not in task parameters
58f5eff to
6e86639
Compare
bcce73b to
0a35dd6
Compare
0a35dd6 to
6532995
Compare
|
|
||
| private void finalizeIndexJobs(Collection<String> jobIds, ActionListener<AcknowledgedResponse> listener) { | ||
| String jobIdString = String.join(",", jobIds); | ||
| logger.debug("finalizing jobs [{}]", jobIdString); |
There was a problem hiding this comment.
nit: isn't it stringified to "[id1, id2, ...]" anyway? + less garbage for non-debug?
There was a problem hiding this comment.
Fair point. I'll remedy in a later commit as this is merged now.
| Logger logger) { | ||
| if (job == null) { | ||
| logger.debug("[{}] select node job is null", jobId); | ||
| } |
There was a problem hiding this comment.
job == null is valid for the rolling upgrade as that field isn't present in the persistent task parameters for jobs < v6.6.0. I added this to help me understand what was happening during the rolling upgrade tests it probably shouldn't be in the released code however.
| String reason = "Not opening job [" + jobId + "] on node [" + nodeNameOrId(node) | ||
| + "] version [" + node.getVersion() + "], because this node does not support " + | ||
| "jobs of version [" + job.getJobVersion() + "]"; | ||
| logger.trace(reason); |
There was a problem hiding this comment.
nit: debug seems more suited to me
There was a problem hiding this comment.
Remember that on a 100 node cluster each allocation will generate 100 messages similar to this one, which would be significant log spam. They get concatenated into the overall reason, which is stored in the cluster state if the persistent task exists (and returned in the error message in the case of this being called prior to opening).
All the other possible reasons for ruling out a node in this method also currently log at the trace level. I think they should all log at the same level, otherwise someone reading the logs could get a misleading picture of what is happening.
I would leave this as trace to match the others.
| jobUpdateSemaphore.acquire(); | ||
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| LOGGER.info("[{}] Interrupted acquiring update established model memory semaphore", jobId); |
There was a problem hiding this comment.
Hmm, I don't think it's an error level message as the interrupt is not necessarily an error perhaps debug. ml code does not interrupt threads and the only usage I can find in es core outside of tests is in CancellableThreads
I doubt we will every see this message outside of testing. The test framework interrupts zombie threads after each test so that will probably be the only time we see it.
Adds rolling upgrade tests specific to the migration project and re-enables all the previously muted upgrade tests. A milestone
A number of minor issues had to be fixed to enable the tests:
.ml-configtemplate to be installed before running the tests