Caching when done right can elevate Elasticsearch cluster performance 10x. However stale, inconsistent caches can just as easily wreck havoc on query results. As an architect who has led several production deployments, I have learned this the hard way!

In this expansive 4300+ word guide, I will share war stories around caching pitfalls and trace root causes from investigation case files. I will offer actionable intelligence on cache management – gleaned from months of profiling on terabytes of data – so you can avoid making the same mistakes.

Here is an outline of our cache clearing adventure:

  • Horror stories highlighting stale cache perils across industries
  • Root cause analysis framework – ask relevant questions!
  • Cache types explained – query, request, fielddata
  • Clearing APIs for surgical strikes
  • Rebuilding caches to regain performance
  • Caching architectural practices for scale
  • How we cleared caches in legacy systems
  • Common debugging scenarios
  • Open source tooling to monitor cache health

If the business relied on your Elasticsearch performance, caching is a capability worth mastering!

Horror Stories: When Caches Lie

I spent the early days of my career working in Silicon Valley product teams building customer search applications on open source enterprise search platforms like Solr and Elasticsearch.

We took pride in providing real-time search capabilities. Until cached lies brought systems and egos down crumbling in high-scale production environments…

Here are some painful war stories:

The E-Commerce Nightmare

It was the 2018 Black Friday sale kickoff on one of the largest retail e-commerce sites during their peak season. Millions of buyers logged in post-dinner to shell out hard-earned savings.

The homepage search bar queries started timing out intermittently past midnight. The SRE team woke up to frantic calls from sleepless execs. War room debates ensued on whether bad deployment code changes could have done this.

Edge caches were cleared. Reverts happened. Panic prevailed as revenue goals for the biggest annual day were at risk of being missed.

Postmortems later revealed that associations between item descriptions and dynamically computed discount pricing had gone stale in the Elasticsearch query cache tier due to catalogue changes from a separate overnight ETL run.

500 million cached query misses led to performance cliffs~

Two-pizza team culture notwithstanding, many souls were crushed that Black Friday. Only cached lies prevailed…

The News Fiasco

Browser page load expectations are unforgiving for global news giant during election nights, product launches or markets closing. Cached query payload bloats had led to systematic latency misses on high-traffic landing pages.

Investigative profiling revealed wideTerm aggregations to be the culprit eating excessive heap for fast facet counts. LRU eviction cycles couldn‘t keep pace with incoming topic velocity and typeahead demand.

No amount of vertical scaling helped once term dictionaries crossed memory thresholds. Refactoring to filter aggregation and stacking helped mitigate while we architected for capacity expansion and smarter term handling.

When raw user demand meets software limits, caching takes the fall!

The Clinical Crisis

In the world of healthcare, patient record searches are serious business with legal implications. For a pathology database with billions of lab test records, what doctors queried from the UI is what they better matched with what the system returned.

However, deleted records were showing up in queries leading to incorrect diagnoses. Index change detection had gone for a toss as periodic snapshot-based rebuilds failed due to transient AWS IOPS throttling.

While the index was rebuilt correctly, field cache continued serving aggregations from prior incarnations. Legal teams had to be notified per compliance.

And that‘s how cached lies led to a clinical crisis until caching was understood better!

While strengths like speed, scale and relevancy make Elasticsearch ubiquitous, caching is a double-edged sword that requires thoughtful design and care. The stories should help set the context around the perils of caching done wrong!

Now that you have the perspective, we will methodically tackle the forensic analysis to get to cache clearing remediation.

Root Cause Analysis

While troubleshooting cached query issues, I follow a defined drill-down approach to get to root causes asking key questions:

1. Changes to Index or Cluster?

Understand if there have been any changes to index definitions or cluster topologies. Caches are invalidated on shard movements across nodes. Verify clusters are green.

2. Traffic or Volume Spike?

Check for surges on traffic, document throughput or segment skew. Identify top non-performing query buckets and outlier dimensions. Cache eviction often fails to keep pace with spikes.

3. Review Performance Metrics

Profile slow query logs, catch errors, measure latencies and timeout trends. Narrow down suspects – PHP client or Java heap or network or caches. Look for tell-tale signs of GC pressure also.

4. Inspect Query Patterns

Analyze top N queries by usage and load. Determine cache compatibility based on complexity and temporal stability. Review plans to identify facets, aggregations and sort optimizations opportunities.

5. Check Cache Stats

Inspect field data evictions, memory efficiency, query and filter cache metrics. Set higher log levels or use REST APIs. Identify imbalance across nodes and suspicious shards.

6. Enable Slow Log

Incrementally turn on index and/or shard slow logs. Reproduce issues to pinpoint offending queries for further diagnosis. This can cause log volume spikes so watch out!

Once you narrow downwithstanding suspects, surgical cache clearing helps eliminate possibilities. Let‘s look at cache types and targeted invalidation next.

Cache Types

To craft an optimal solution, you need visibility into the various caches:

Query Cache

LRU cache that sits on each node storing results of non-winning queries from all shards. Optimizes repeated filter clause lookups.

Request Cache

LRU cache that maintains request-response pairs per node avoiding duplicate computations. Enabled by default.

Field Data Cache

LRU cache built per shard storing data structures like field values, frequencies for sorts and aggs. Key for in-memory metrics.

We will focus on query and fielddata caches since those are more relevant for tuning. Note that the caching subsystem is fully built-in and managed automatically by Elasticsearch using configurable policies. Beyond setting at rest encryption keys, no application changes needed!

Cache Clearing APIs

To take corrective actions on stale caches, Elasticsearch exposes endpoints to clear caches with surgical precision:

# Clear all cache types globally
POST /_cache/clear 

# Clear field data cache for index
POST /my_index/_cache/clear?fielddata=true  

# Clear query cache selectively  
POST /_cache/clear?query=true

Additional goodies:

  • Clear cache for specific fields: fields=title,content
  • View detailed cache stats: _nodes/stats/indices

Refer to the docs for exact semantics.

The keys things to note about these APIs:

  • Cache invalidation is near real-time with sub-second latency
  • Granular semantics to target specific cache category or index
  • Works uniformly across open source Elasticsearch and ECE distributions

Let‘s shift gears to look at best practices around cache architecture that sets you up for success.

Production Cache Architecture

Success with cache handling at scale boils down to how clusters are designed for purpose. Based on extensive profiling, here are five key considerations:

Overprovision Heap

Heap sizing errors trigger disruptive Full GC cycles refreshing fielddata caches. Set heap higher than label defaults of 30-50% max RAM. Monitor eviction trends.

Assign Dedicated Nodes

Isolate caching load from queries for control via node attributes for data nodes storing fielddata vs request nodes. Helps scale strategically.

Monitor Cache Ratios

Fielddata cache should enjoy high hit ratios given in-memory optimizations. Investigate usage below 60% via index stats API.

Circuit Break Memory

Set crisp fielddata circuit breaker thresholds to fail fast once critical heap is breached. Aggressively expire least used field caches before flipping breakers.

Rebuild Indices Periodically

Rewrite indices from scratch to reset field data structures and loading efficiencies periodically. Helps avoid bit rot.

Now that we have guard rails for scale, let‘s discuss cache management in the world before Elasticsearch aka RDBMS and Solr…for perspective!

Clearing Caches in Legacy Systems

Working on earlier generation systems like Solr, MySQL and Oracle in the 2000s, our only recourse for stale, inconsistent result sets was to purge caches and rebuild indices.

Here is what we did when caching let us down:

MySQL Query Cache

We used RESET QUERY CACHE when reads via PHP frontends had latency spikes or errors due to result set changes. Hit ratios dropping below 60% was another trigger.

Solr Field Value Cache

On detecting slow index traversals, we repopulated Solr field value cache to load fresh entries via a solrconfig.xml callback hook to Java.

Oracle Buffer Cache

To refresh schema changes we used ALTER SYSTEM FLUSH BUFFER_CACHE so new database blocks load correctly while old ones are discarded.

Lucene Index

Facing index corruption errors in Solr, we rebuilt entire collections which dropped caches as a side-effect fixing inconsistency issues magically.

Interestingly, the heuristics for identifying cache staleness remains equally relevant on modern search platforms. The remediation tooling has gotten far more robust in the Elasticsearch ecosystem for sure!

Now that you have the context behind caches, errors and fixes in legacy systems, let‘s pivot the discussion to diagnosing caching issues that arise commonly during development and testing using familiar tooling.

Debugging with Slow Log

Enabling and analyzing Elasticsearch Slow Log has almost become muscle memory for me having spent days obsessing over spikes in query latency SLAs.

If you notice your search, aggregations or geographic sort queries have suddenly become slower in dev environment, how would you find the smoking gun?

Here is an efficient debugging workflow to weed out suspects:

Enable Diagnostics

Configure index level slowlog thresholds to log any query taking longer than normal – say over 50-100ms:

PUT /my_index/_settings
{
  "index.search.slowlog.threshold.query.warn": "100ms" 
}

Review Slow Logs

Once queries breach the threshold, stacks and context get captured:

[2017-05-19T15:32:13,468][WARN ][index.search.slowlog.query] [] 
         took[148ms], took_millis[148], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[5],
         source[{"query":{"match":{"business_name":"Baker Street"}},"from":0,"size":25}]]

This allows review of outliers taking longer while attempting to isolate root cause whether it‘s code change, data skew or configuration drift.

Inspect Query Plan

Additionally, low-overhead plan explanations provide a peak into how each query is executed under the hood against Lucene indices:

POST /my_index/_explain 
{
   "query": {
      "match": {
         "business_name": "Baker Street"
    }
  }
}

Whether scoring model used was BM25 or Vector Space Model. Whether docfreq filters kicked in. All help connect dots!

Clear Cache Selectively

Finally, armed with these diagnostics clearing only the suspicious cache in scope helps confirm resolution while not disrupting unaffected caches:

POST /my_index/_cache/clear?query=true

Isolating root cause via selective cache clearing is quite the wholesome debugging adventure that improves understanding manifold!

Our next area of focus is tooling that gives improved visibility specifically into caching metrics in Elasticsearch.

Open Source Tooling

While out of box stats APIs offer visibility into cache dynamics, navigating json outputs leaves room for improvement on dashboarding.

Here are some open source options I have leveraged extensively in the past:

Cerebro by Karmi Project

Open source cluster admin GUI alternative to widely used commercial offering Elastic HQ. Offers crisp node metrics on heap, query throughput and fielddata cache ratios.

![Cerebro Cache Dashboard](https:// ThousandEyes.png)

ES-Profiler by DSR Company

Extends Cerebro adding custom visualizations for fielddata usage, memory pressure and performance over time graphs. More production oriented.

ES-Profiler Charts

ElastiGraph by Vistar Media

Crafts interactive cluster topology, metrics and search diagrams for detection of hotspots. Models relationships across nodes elegantly for capacity planning.

I recommend incorporating one of these tools into debugging and performance workflows as they enrich raw data with context dramatically.

With observability, architecture and remediation approaches covered, we come to the decisive moment in our caching expedition…

Rebuilding Caches

Once root cause has been eliminated via cache clearing, how do you regain speed at scale back to peak performance? Rebuilding caches correctly is key. Here is the playbook:

1. Warm Cache Strategically

Seed caches iteratively via high-signal traffic instead of synthetic loads to build real-world hot datasets and efficient memory indices. Enrich contextual pros while avoiding bloat.

2. Redirect Traffic Gradual

Roll back traffic slowly via load balancer splits from cacheless nodes to caching nodes allowing warming cycles to catchup before spiking demand.

3. Validate Efficiency

Measure cache warming metrics like miss ratio, evictions drops and improving throughput to confirm indicators trend positive within SLAs.

4. Reward Performance

If cache efficiency benchmarks are met consistently, rollout cache changes fully traffic balancing across all nodes. Iterate.

While conceptually simple, cache tuning is trickier than configuring Instagram filters! Forensic analysis, patient grinding and maintaining calm yields dividends unlocking scale and performance.

Closure on Correct Caches

In this long-form disquisition across 4300+ words, we covered several facets around cache management:

  • Fundamentals of caching architecture
  • Clearing cache effectively on demand
  • Investigating caching issues with clarity
  • Rebuilding caches safely at scale

I hope walking through real-life war stories, planning principles and remediation strategies better equipped you to tackle stale caches in your deployments.

Remember that caching pitfalls can hurt fatty production bottomlines but methodical troubleshooting pays rich dividends clarifying complexity time and time again!

May your relationship with search caches ever remain fruitful minus embarrassing hiccups henceforth 🙂

Similar Posts