[Monitoring] Fetch shard data more efficiently#54028
[Monitoring] Fetch shard data more efficiently#54028chrisronline merged 11 commits intoelastic:masterfrom
Conversation
💔 Build FailedTo update your PR or re-run it, just comment with: |
|
Pinging @elastic/stack-monitoring (Team:Monitoring) |
| "jUT5KdxfRbORSCWkb5zjmA": { | ||
| "shardCount": 38, | ||
| "indexCount": 20, | ||
| "shardCount": 5, |
There was a problem hiding this comment.
I'm pretty sure these are different because the new query now limits the shard data to the specific index instead of across all indices
| const { body } = await supertest | ||
| .post( | ||
| '/api/monitoring/v1/clusters/YCxj-RAgSZCP6GuOQ8M1EQ/elasticsearch/nodes/jxcP6ue7eRCieNNitFTT0EA' | ||
| '/api/monitoring/v1/clusters/YCxj-RAgSZCP6GuOQ8M1EQ/elasticsearch/nodes/jUT5KdxfRbORSCWkb5zjmA' |
There was a problem hiding this comment.
I have no idea why this was changed. The original node id doesn't actually exist in the archived data! See #23715
| const esIndexPattern = '*'; | ||
| const cluster = {}; | ||
| const stats = await getIndicesUnassignedShardStats(req, esIndexPattern, cluster); | ||
| expect(stats.indices).toEqual(indices); |
There was a problem hiding this comment.
Is it possible to also test status here? Since, looks like you already have the right replica/primary counts to test for all three colors 💚 💛 ❤️
There was a problem hiding this comment.
It should test for it now. There is a status field here -> https://github.com/elastic/kibana/pull/54028/files/ffdb7d79aa65d7694eb3f2d88d45c16bedfcfc27#diff-3429702abd39406ddf3dc4c1ad63f5a6R12
igoristic
left a comment
There was a problem hiding this comment.
This is awesome stuff @chrisronline! 🏆
My benchmarks were a little faster overall, but were still within similar margins. Maybe because I ran it from docker (or my computer is > than yours)
|
@elasticmachine merge upstream |
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
* For the nodes listing page, do not fetch shard data for indices * Optimize our shard queries for the index and node listing pages * This change isn't necessary * Rename file and function * Use optimized query for ml jobs and es overview * Apply to node/index detail page, and more renaming * Unnecessary change * Fix tests * Add basic tests Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* For the nodes listing page, do not fetch shard data for indices * Optimize our shard queries for the index and node listing pages * This change isn't necessary * Rename file and function * Use optimized query for ml jobs and es overview * Apply to node/index detail page, and more renaming * Unnecessary change * Fix tests * Add basic tests Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
|
Backport: 7.x: d811e4d |
* master: (69 commits) [Graph] Fix various a11y issues (elastic#54097) Add ApplicationService app status management (elastic#50223) logs in one time (elastic#54447) Deprecate using `elasticsearch.ssl.certificate` without `elasticsearch.ssl.key` and vice versa (elastic#54392) [Optimizer] Fix a stack overflow with watch_cache when it attempts to delete very large folders. (elastic#54457) Security - Role Mappings UI (elastic#53620) [SIEM] [Detection engine] Permission II (elastic#54292) Allow User to Cleanup Repository from UI (elastic#53047) [Detection engine] Some UX for rule creation (elastic#54471) share specific instances of some ui packages (elastic#54079) [ML] APM modules configs for RUM Javascript and NodeJS (elastic#53792) [APM] Delay rendering invalid license notification (elastic#53924) [Graph] Improve error message on graph requests (elastic#54230) [ILM] Kibana should allow a min_age setting of 0ms in ILM policy phases (elastic#53719) Unit Tests for common/lib (elastic#53736) [Graph] Only show explorable fields (elastic#54101) remove linting rule exception for markdown (elastic#54232) [Monitoring] Fetch shard data more efficiently (elastic#54028) [Maps] Add hiddenLayers option to embeddable map input (elastic#54355) Pass termOrder and hasTermsAgg properties to serializeThresholdWatch function (elastic#54391) ...
* master: (69 commits) [Graph] Fix various a11y issues (elastic#54097) Add ApplicationService app status management (elastic#50223) logs in one time (elastic#54447) Deprecate using `elasticsearch.ssl.certificate` without `elasticsearch.ssl.key` and vice versa (elastic#54392) [Optimizer] Fix a stack overflow with watch_cache when it attempts to delete very large folders. (elastic#54457) Security - Role Mappings UI (elastic#53620) [SIEM] [Detection engine] Permission II (elastic#54292) Allow User to Cleanup Repository from UI (elastic#53047) [Detection engine] Some UX for rule creation (elastic#54471) share specific instances of some ui packages (elastic#54079) [ML] APM modules configs for RUM Javascript and NodeJS (elastic#53792) [APM] Delay rendering invalid license notification (elastic#53924) [Graph] Improve error message on graph requests (elastic#54230) [ILM] Kibana should allow a min_age setting of 0ms in ILM policy phases (elastic#53719) Unit Tests for common/lib (elastic#53736) [Graph] Only show explorable fields (elastic#54101) remove linting rule exception for markdown (elastic#54232) [Monitoring] Fetch shard data more efficiently (elastic#54028) [Maps] Add hiddenLayers option to embeddable map input (elastic#54355) Pass termOrder and hasTermsAgg properties to serializeThresholdWatch function (elastic#54391) ...
While debugging some performance issues on an ESMS cluster with @pickypg, we discovered that our query to fetch shard data (in an oversharded environment) performed very poorly. It turns out that there are a few major issues with our existing query:
This PR fixes all of these issues and drastically improves the loading time of various ES monitoring pages that slow down for large clusters.
Performance
On a sample ESMS cluster (which is severely oversharded) in a constant, absolute time period, I tested the timing to fetch shard stats data.
Current
Indices listing: ~23s
Nodes listing: ~23s
ML jobs listing: ~23s
ES cluster overview: ~23s
Index detail page: ~23s
Node detail page: ~23s
PR
Indices listing: ~1.7s
Nodes listing: ~1.7s
ML jobs listing: ~1.7s
ES cluster overview: ~1.7s
Index detail page: ~215ms
Node detail page: ~1.2s
Testing
This is a bit tricky. The UI should be unaffected - the api should return the same data the UI needs so we're just looking to ensure we didn't miss something.
Notes