Cache compiled classes by vertex ID by colinsurprenant · Pull Request #11479 · elastic/logstash

colinsurprenant · 2020-01-09T18:07:31Z

Fixes #11105
Relates to #11175

This fixes the compiled classes cache problem where the compiled classes were cached using the ComputeStepSyntaxElement instance object as the cache key. The problem is that new ComputeStepSyntaxElement instances are created per worker threads CompiledExecution thus the cached classes were never reused.

This solution proposes to use the DataSet associated vertex ID as the cache key. For this the ComputeStepSyntaxElement class now has a unique ID String field which is propagated from the CompiledPipeline where the vertex ID is known and used as the cache key.

yaauie

In general, I think the approach is sound, so long as a vertex's id is truly unique.

I haven't spent a lot of time in the JEE, so pardon my ignorance in the following questions.

where does a vertex's id come from, and what is the effective risk of accidental collisions?
would there be value in ever invalidating elements from the cache? (e.g., with frequent pipeline reloads do we orphan elements in the cache?)

logstash-core/src/main/java/org/logstash/config/ir/compiler/DatasetCompiler.java

logstash-core/src/main/java/org/logstash/config/ir/compiler/ComputeStepSyntaxElement.java

colinsurprenant · 2020-01-09T19:14:36Z

@yaauie

where does a vertex's id come from, and what is the effective risk of accidental collisions?

The Vertex id is either explicit

logstash/logstash-core/src/main/java/org/logstash/config/ir/graph/Vertex.java

Line 38 in 715295c

protected Vertex(SourceWithMetadata meta, String explicitId) {

by using the corresponding plugin id

logstash/logstash-core/src/main/java/org/logstash/config/ir/graph/Vertex.java

Line 38 in 715295c

protected Vertex(SourceWithMetadata meta, String explicitId) {

or uniquely generated by computing a hash from the source

logstash/logstash-core/src/main/java/org/logstash/config/ir/graph/Vertex.java

Lines 194 to 213 in 715295c

    
           public String getId() { 
        
               if (explicitId != null) return explicitId; 
        
               if (generatedId != null) return generatedId; 
        
               if (this.getGraph() == null) { 
        
                   throw new RuntimeException("Attempted to get ID from PluginVertex before attaching it to a graph!"); 
        
               } 
        
               // Generating unique hashes for vertices is very slow! 
        
               // We try to avoid this where possible, which means that generally only tests hit the path with hashes, since 
        
               // they have no source metadata. This might also be used in the future by alternate config languages which are 
        
               // willing to take the hit. 
        
               if (this.getSourceWithMetadata() != null) { 
        
                   generatedId = Util.digest(this.graph.uniqueHash() + "|" + this.getSourceWithMetadata().uniqueHash()); 
        
               } else { 
        
                   generatedId = this.uniqueHash(); 
        
               } 
        
               return generatedId; 
        
           }

for conditionals.

logstash/logstash-core/src/main/java/org/logstash/config/ir/graph/IfVertex.java

Line 20 in 715295c

super(meta);

So a pipeline plugins and conditional Vertex ids will be unique within that pipeline.

This raises a good question though - although plugin ids are required to be unique within a pipeline I believe identical ids can be reused across multiple pipelines which might be a problem here because the cache is a static field thus global across the whole runtime and not a per-pipeline cache... 🤔

would there be value in ever invalidating elements from the cache?

That is a good question; there was no previous invalidation logic but I believe if a pipeline is reloaded the corresponding generated cached classes should be cleared if we assume that when reloading a pipeline we want to recompile everything and not try to, say, reuse unchanged parts of the config which seems an potential optimization we may tackle but certainly not now.

Maybe we could namespace the cache per pipeline id or maybe have that cache live in the CompiledPipeline object and move the caching logic higher up in that class (which is actually what I did in my first stab trying to fix that before I realized a cache existed in ComputeStepSyntaxElement).

yaauie · 2020-01-09T19:43:58Z

🤔 If a vertex ID can be sourced from a plugin id, and the plugin id can be specified in the pipeline configuration, this change may introduce a bug in which reloads of any plugin that explicitly give their id in the pipeline configuration will be silently ignored. Is this okay? Are the directives that are a part of that plugin's invocation also cached (e.g., if they change which host an explicitly-id'd elasticsearch output points to, will that change get picked up when recompilation is skipped)?

I think you're right that there is a chance of collisions in the multiple pipeline scenario. If we move the cache up into CompiledPipeline, I believe we will lose cache persistence across pipeline reloads, which may negatively affect performance for some. Namespacing by pipeline id (or incorporating pipeline id into our vertex id) may be a good quick-fix.

I agree that cache eviction is beyond the scope for this PR, but we may want to file a separate issue to track it.

colinsurprenant · 2020-01-09T20:50:50Z

@yaauie

If a vertex ID can be sourced from a plugin id, and the plugin id can be specified in the pipeline configuration, this change may introduce a bug in which reloads of any plugin that explicitly give their id in the pipeline configuration will be silently ignored.

Yes you are right. This will be a problem here. Good catch.

Are the directives that are a part of that plugin's invocation also cached

AFAICT they are not.

If we move the cache up into CompiledPipeline, I believe we will lose cache persistence across pipeline reloads, which may negatively affect performance for some.

True, but I think this might be a good thing actually:

First, as of now, caching does not work at all so solving the multiple recompile per worker thread is definitely a win in performance and staying at the same single pipeline compile cost for the initial creation and also across pipeline reloads seems reasonable for now.
Having a per CompiledPipeline cache solves cache invalidation (and potential leaks) since it will be wiped across reloads.
The only benefit for persisting a cache across pipeline reloads would be for the plugin statements and only if they have an explicit id since a) conditionals have a generated vertex id which will change across reloads and b) plugins generated ids will also change across reloads (not 100% sure for the latter).
Implementing a cross reload persisted cache seems to me like an optimization we could tackle in a second phase and is also tricky to make right (how do we invalidate removed config elements across reloads to avoid leaks? etc).

WDYT?

colinsurprenant · 2020-01-09T22:45:26Z

I created #11482 as an alternate implementation as per above.

colinsurprenant · 2020-01-10T22:43:44Z

closing in favour of #11482

colinsurprenant added 3 commits January 9, 2020 12:43

use vertex ID as class cache key

ed4b6bb

refactor to add id field in ComputeStepSyntaxElement

0968e67

rename ket to vertexId

403d414

colinsurprenant mentioned this pull request Jan 9, 2020

Start time much longer after upgrade from 6.x to 7.x #11105

Closed

yaauie reviewed Jan 9, 2020

View reviewed changes

logstash-core/src/main/java/org/logstash/config/ir/compiler/DatasetCompiler.java Outdated Show resolved Hide resolved

logstash-core/src/main/java/org/logstash/config/ir/compiler/ComputeStepSyntaxElement.java Outdated Show resolved Hide resolved

fix formatting

29ebcd3

colinsurprenant changed the title ~~cache compiled classes by vertex ID~~ Cache compiled classes by vertex ID Jan 9, 2020

colinsurprenant mentioned this pull request Jan 9, 2020

Move class caching from ComputeStepSyntaxElement to CompiledPipeline #11482

Merged

colinsurprenant mentioned this pull request Jan 10, 2020

[meta] Pipeline Java Execution important issues #11175

Closed

8 tasks

colinsurprenant closed this Jan 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache compiled classes by vertex ID#11479

Cache compiled classes by vertex ID#11479
colinsurprenant wants to merge 4 commits intoelastic:masterfrom
colinsurprenant:class_cache_key

colinsurprenant commented Jan 9, 2020

Uh oh!

yaauie left a comment

Uh oh!

Uh oh!

Uh oh!

colinsurprenant commented Jan 9, 2020

Uh oh!

yaauie commented Jan 9, 2020

Uh oh!

colinsurprenant commented Jan 9, 2020

Uh oh!

colinsurprenant commented Jan 9, 2020

Uh oh!

colinsurprenant commented Jan 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

colinsurprenant commented Jan 9, 2020

Uh oh!

yaauie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

colinsurprenant commented Jan 9, 2020

Uh oh!

yaauie commented Jan 9, 2020

Uh oh!

colinsurprenant commented Jan 9, 2020

Uh oh!

colinsurprenant commented Jan 9, 2020

Uh oh!

colinsurprenant commented Jan 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants