Skip to content

Cache compiled classes by vertex ID#11479

Closed
colinsurprenant wants to merge 4 commits intoelastic:masterfrom
colinsurprenant:class_cache_key
Closed

Cache compiled classes by vertex ID#11479
colinsurprenant wants to merge 4 commits intoelastic:masterfrom
colinsurprenant:class_cache_key

Conversation

@colinsurprenant
Copy link
Copy Markdown
Contributor

Fixes #11105
Relates to #11175

This fixes the compiled classes cache problem where the compiled classes were cached using the ComputeStepSyntaxElement instance object as the cache key. The problem is that new ComputeStepSyntaxElement instances are created per worker threads CompiledExecution thus the cached classes were never reused.

This solution proposes to use the DataSet associated vertex ID as the cache key. For this the ComputeStepSyntaxElement class now has a unique ID String field which is propagated from the CompiledPipeline where the vertex ID is known and used as the cache key.

Copy link
Copy Markdown
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think the approach is sound, so long as a vertex's id is truly unique.

I haven't spent a lot of time in the JEE, so pardon my ignorance in the following questions.

  1. where does a vertex's id come from, and what is the effective risk of accidental collisions?
  2. would there be value in ever invalidating elements from the cache? (e.g., with frequent pipeline reloads do we orphan elements in the cache?)

@colinsurprenant
Copy link
Copy Markdown
Contributor Author

@yaauie

where does a vertex's id come from, and what is the effective risk of accidental collisions?

The Vertex id is either explicit

protected Vertex(SourceWithMetadata meta, String explicitId) {

by using the corresponding plugin id

protected Vertex(SourceWithMetadata meta, String explicitId) {

or uniquely generated by computing a hash from the source

public String getId() {
if (explicitId != null) return explicitId;
if (generatedId != null) return generatedId;
if (this.getGraph() == null) {
throw new RuntimeException("Attempted to get ID from PluginVertex before attaching it to a graph!");
}
// Generating unique hashes for vertices is very slow!
// We try to avoid this where possible, which means that generally only tests hit the path with hashes, since
// they have no source metadata. This might also be used in the future by alternate config languages which are
// willing to take the hit.
if (this.getSourceWithMetadata() != null) {
generatedId = Util.digest(this.graph.uniqueHash() + "|" + this.getSourceWithMetadata().uniqueHash());
} else {
generatedId = this.uniqueHash();
}
return generatedId;
}

for conditionals.

So a pipeline plugins and conditional Vertex ids will be unique within that pipeline.

This raises a good question though - although plugin ids are required to be unique within a pipeline I believe identical ids can be reused across multiple pipelines which might be a problem here because the cache is a static field thus global across the whole runtime and not a per-pipeline cache... 🤔

would there be value in ever invalidating elements from the cache?

That is a good question; there was no previous invalidation logic but I believe if a pipeline is reloaded the corresponding generated cached classes should be cleared if we assume that when reloading a pipeline we want to recompile everything and not try to, say, reuse unchanged parts of the config which seems an potential optimization we may tackle but certainly not now.

Maybe we could namespace the cache per pipeline id or maybe have that cache live in the CompiledPipeline object and move the caching logic higher up in that class (which is actually what I did in my first stab trying to fix that before I realized a cache existed in ComputeStepSyntaxElement).

@yaauie
Copy link
Copy Markdown
Member

yaauie commented Jan 9, 2020

🤔 If a vertex ID can be sourced from a plugin id, and the plugin id can be specified in the pipeline configuration, this change may introduce a bug in which reloads of any plugin that explicitly give their id in the pipeline configuration will be silently ignored. Is this okay? Are the directives that are a part of that plugin's invocation also cached (e.g., if they change which host an explicitly-id'd elasticsearch output points to, will that change get picked up when recompilation is skipped)?

I think you're right that there is a chance of collisions in the multiple pipeline scenario. If we move the cache up into CompiledPipeline, I believe we will lose cache persistence across pipeline reloads, which may negatively affect performance for some. Namespacing by pipeline id (or incorporating pipeline id into our vertex id) may be a good quick-fix.

I agree that cache eviction is beyond the scope for this PR, but we may want to file a separate issue to track it.

@colinsurprenant
Copy link
Copy Markdown
Contributor Author

@yaauie

If a vertex ID can be sourced from a plugin id, and the plugin id can be specified in the pipeline configuration, this change may introduce a bug in which reloads of any plugin that explicitly give their id in the pipeline configuration will be silently ignored.

Yes you are right. This will be a problem here. Good catch.

Are the directives that are a part of that plugin's invocation also cached

AFAICT they are not.

If we move the cache up into CompiledPipeline, I believe we will lose cache persistence across pipeline reloads, which may negatively affect performance for some.

True, but I think this might be a good thing actually:

  • First, as of now, caching does not work at all so solving the multiple recompile per worker thread is definitely a win in performance and staying at the same single pipeline compile cost for the initial creation and also across pipeline reloads seems reasonable for now.
  • Having a per CompiledPipeline cache solves cache invalidation (and potential leaks) since it will be wiped across reloads.
  • The only benefit for persisting a cache across pipeline reloads would be for the plugin statements and only if they have an explicit id since a) conditionals have a generated vertex id which will change across reloads and b) plugins generated ids will also change across reloads (not 100% sure for the latter).
  • Implementing a cross reload persisted cache seems to me like an optimization we could tackle in a second phase and is also tricky to make right (how do we invalidate removed config elements across reloads to avoid leaks? etc).

WDYT?

@colinsurprenant colinsurprenant changed the title cache compiled classes by vertex ID Cache compiled classes by vertex ID Jan 9, 2020
@colinsurprenant
Copy link
Copy Markdown
Contributor Author

I created #11482 as an alternate implementation as per above.

@colinsurprenant
Copy link
Copy Markdown
Contributor Author

closing in favour of #11482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Start time much longer after upgrade from 6.x to 7.x

2 participants