Skip to content

Add vectors search track#217

Merged
jtibshirani merged 13 commits intoelastic:masterfrom
mayya-sharipova:vectors_deep1b
Dec 15, 2021
Merged

Add vectors search track#217
jtibshirani merged 13 commits intoelastic:masterfrom
mayya-sharipova:vectors_deep1b

Conversation

@mayya-sharipova
Copy link
Copy Markdown
Contributor

Elasticsearch 8.0 introduces indexed vectors and _knn_search operation
on them. This tracks is for benchmarking indexing, force merging
and _knn_seach operations for indexed vectors.

TODO:

Elasticsearch 8.0 introduces indexed vectors and _knn_search operation
on them. This tracks is for benchmarking indexing, force merging
and _knn_seach operations for indexed vectors.

TODO:

- add deep1b dataset to rally-tracks.elastic.co

- add _knn_search operation to the challenge, once
elastic/rally#1380 is implemented.
Copy link
Copy Markdown
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this!

I took a first look, starting at generating the dataset and left a suggestion for making the convert script a bit more user friendly.

As next I'll test the track itself.

@mayya-sharipova
Copy link
Copy Markdown
Contributor Author

@dliappis Thanks for the feedback on the script, great suggestions! I've push your suggestions in 48e9487.

Also want to report results from my laptop while running:

./rally race --track-path=.../rally-tracks/vectors_deep1b/ --car="4gheap" --pipeline=from-sources --revision=current --user-tag="intention:vectors1"  --track-params="bulk_indexing_clients:1,ingest_percentage:10"
Results
|                                                         Metric |         Task |       Value |   Unit |
|---------------------------------------------------------------:|-------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |              |     1.11228 |    min |
|             Min cumulative indexing time across primary shards |              |   0.0116833 |    min |
|          Median cumulative indexing time across primary shards |              |    0.556142 |    min |
|             Max cumulative indexing time across primary shards |              |      1.1006 |    min |
|            Cumulative indexing throttle time of primary shards |              |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |              |           0 |    min |
| Median cumulative indexing throttle time across primary shards |              |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |              |           0 |    min |
|                        Cumulative merge time of primary shards |              |     9.07122 |    min |
|                       Cumulative merge count of primary shards |              |           1 |        |
|                Min cumulative merge time across primary shards |              |           0 |    min |
|             Median cumulative merge time across primary shards |              |     4.53561 |    min |
|                Max cumulative merge time across primary shards |              |     9.07122 |    min |
|               Cumulative merge throttle time of primary shards |              |           0 |    min |
|       Min cumulative merge throttle time across primary shards |              |           0 |    min |
|    Median cumulative merge throttle time across primary shards |              |           0 |    min |
|       Max cumulative merge throttle time across primary shards |              |           0 |    min |
|                      Cumulative refresh time of primary shards |              |     4.45042 |    min |
|                     Cumulative refresh count of primary shards |              |          25 |        |
|              Min cumulative refresh time across primary shards |              |   0.0253167 |    min |
|           Median cumulative refresh time across primary shards |              |     2.22521 |    min |
|              Max cumulative refresh time across primary shards |              |      4.4251 |    min |
|                        Cumulative flush time of primary shards |              |     5.44108 |    min |
|                       Cumulative flush count of primary shards |              |           7 |        |
|                Min cumulative flush time across primary shards |              |   0.0421167 |    min |
|             Median cumulative flush time across primary shards |              |     2.72054 |    min |
|                Max cumulative flush time across primary shards |              |     5.39897 |    min |
|                                        Total Young Gen GC time |              |       2.202 |      s |
|                                       Total Young Gen GC count |              |         118 |        |
|                                          Total Old Gen GC time |              |           0 |      s |
|                                         Total Old Gen GC count |              |           0 |        |
|                                                     Store size |              |     1.56353 |     GB |
|                                                  Translog size |              | 1.02445e-07 |     GB |
|                                         Heap used for segments |              |           0 |     MB |
|                                       Heap used for doc values |              |           0 |     MB |
|                                            Heap used for terms |              |           0 |     MB |
|                                            Heap used for norms |              |           0 |     MB |
|                                           Heap used for points |              |           0 |     MB |
|                                    Heap used for stored fields |              |           0 |     MB |
|                                                  Segment count |              |           8 |        |
|                                                 Min Throughput | index-append |     6753.77 | docs/s |
|                                                Mean Throughput | index-append |     10894.2 | docs/s |
|                                              Median Throughput | index-append |     11498.8 | docs/s |
|                                                 Max Throughput | index-append |     11829.6 | docs/s |
|                                        50th percentile latency | index-append |     391.876 |     ms |
|                                        90th percentile latency | index-append |     451.808 |     ms |
|                                        99th percentile latency | index-append |      624.93 |     ms |
|                                       100th percentile latency | index-append |     709.739 |     ms |
|                                   50th percentile service time | index-append |     391.876 |     ms |
|                                   90th percentile service time | index-append |     451.808 |     ms |
|                                   99th percentile service time | index-append |      624.93 |     ms |
|                                  100th percentile service time | index-append |     709.739 |     ms |
|                                                     error rate | index-append |           0 |      % |


---------------------------------
[INFO] SUCCESS (took 952 seconds)
---------------------------------

Copy link
Copy Markdown
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating @mayya-sharipova ! I've left a few questions, the main one is whether we should have a warmup time for indexing as per benchmarking best practices.

@mayya-sharipova
Copy link
Copy Markdown
Contributor Author

mayya-sharipova commented Nov 22, 2021

@dliappis Thank you for another round of review . I've added some comments to clarify and more questions to ask.

@mayya-sharipova
Copy link
Copy Markdown
Contributor Author

@dliappis Thanks for another round of review. I am going for vacation, and will be away for some time. @jtibshirani can the best point of contact.

@jtibshirani If you have time, feel free to add modifications to this PR.

@jtibshirani
Copy link
Copy Markdown
Contributor

I pushed some changes:

  • Update the README to use a different link (still for the same dataset). This link is explicit about the license (CC-BY-4.0). I parsed the dataset and uploaded it to the rally-tracks GCS bucket.
  • Add _knn_search operations and a script_score query that uses vector functions.
  • Update the index settings to use 2 shards. This lets us exercise the result merging and fetching logic.

Notes:

  • On a local test, indexing and force-merging one shard (5 million out of 10 million total vectors) took ~2 hours. So the benchmark will take quite a while to complete!
  • For now I just randomly chose 3 vectors from the dataset to use as query vectors. Ideally we would use a small "test set" containing many different vectors. This would be a good follow-up, but I wanted to keep the initial PR simple.
  • Since kNN search uses an approximate algorithm, it's important to measure result accuracy in addition to latency/ throughput. This helps ensure we're using realistic parameters in benchmarks. We don't measure or report accuracy right now, but we should look into that in a follow-up.

"operation": "knn-search-10-500",
"warmup-iterations": 100,
"iterations": 100,
"clients": 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure what values to select here, feedback is very welcome.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer this, first let's clarify if you deliberately want to run this as fast as possible; I don't see target-throughput specified, so it will do that. If this is the intended behavior, what value to choose here is a matter of what we want to benchmark. In this case it makes sense to keep this to 1 to avoid saturating the target.

One the other hand if you'd rather use a target-throughput we should run some tests on the target hardware to calculate the optimal target throughput and probably also specify a different schedule. In this case, we could also use >1 clients; in this case each client will run requests with a rate of target throughput/clients.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline and agreed it would make sense to set clients: 1 with no specific target-throughput. This lets us run the track as fast as possible, avoiding unnecessary complexity. We don't have a specific reason to test request parallelism too. The rally team is thinking of moving most "standard" tracks to this simple set-up.

Copy link
Copy Markdown
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani

Thank you for iterating and improving the track so much!

On a local test, indexing and force-merging one shard (5 million out of 10 million total vectors) took ~2 hours. So the benchmark will take quite a while to complete!

2hrs is quite a lot of time in nightly benchmarks. Would it make sense that for the nightlies we reduce the amount of data ingested via the track parameter ingest_percentage?

If we down that path, it may make sense to tweak the warmup and query iterations based on that.

"name": "knn-search-10-500",
"operation": "knn-search-10-500",
"warmup-iterations": 100,
"iterations": 100,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both iterations and warmup-iterations could be a function of ingest percentage. We'd need to do some measurements of course. See

"warmup-iterations": {{ scale_iterations(50) }},
and
{% macro scale_iterations(iterations) -%}
{{ (iterations * (query_percentage | default(100)) / 100) | int }}
{%- endmacro %}
for an example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be an obvious question, but could you explain why the iterations would scale with ingest percentage? It's not clear to me that there'd be a relationship.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be an obvious question, but could you explain why the iterations would scale with ingest percentage? It's not clear to me that there'd be a relationship.

I don't think it's an obvious question, so it's good we are discussing it. As I mentioned earlier it could (but doesn't have to). As you mentioned in your comment:

I would suggest ingest_percentage: 20, corresponding to 2 million total docs, or roughly 1 million per shard. I will test this locally and update the number of iterations.

So if we update amount of data we ingest, it might make sense to update the number of iterations (and warmup iterations?), and maybe there is a simple formula we could use.
However, I'd caution that evaluating the right amount of warmup and normal iterations depends on the hardware. Since we intend to run these on the nightly hardware, I suggest we make some measurements on similar hardware. Let's sync offline on how to do that.

"operation": "knn-search-10-500",
"warmup-iterations": 100,
"iterations": 100,
"clients": 4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer this, first let's clarify if you deliberately want to run this as fast as possible; I don't see target-throughput specified, so it will do that. If this is the intended behavior, what value to choose here is a matter of what we want to benchmark. In this case it makes sense to keep this to 1 to avoid saturating the target.

One the other hand if you'd rather use a target-throughput we should run some tests on the target hardware to calculate the optimal target throughput and probably also specify a different schedule. In this case, we could also use >1 clients; in this case each client will run requests with a rate of target throughput/clients.

@jtibshirani
Copy link
Copy Markdown
Contributor

jtibshirani commented Dec 8, 2021

Would it make sense that for the nightlies we reduce the amount of data ingested via the track parameter ingest_percentage?

We can definitely do this. I would suggest ingest_percentage: 20, corresponding to 2 million total docs, or roughly 1 million per shard. I will test this locally and update the number of iterations.

Update: I tested with ingest_percentage: 20, and the current number of iterations provides pretty tight results.

|                                   50th percentile service time |   knn-search-10-500 |     5.26591 |     ms |
|                                   90th percentile service time |   knn-search-10-500 |     6.01791 |     ms |
|                                   99th percentile service time |   knn-search-10-500 |     6.31891 |     ms |
|                                  100th percentile service time |   knn-search-10-500 |     6.36262 |     ms |

|                                   50th percentile service time |  knn-search-10-1000 |      6.0672 |     ms |
|                                   90th percentile service time |  knn-search-10-1000 |     7.26231 |     ms |
|                                   99th percentile service time |  knn-search-10-1000 |     8.08956 |     ms |
|                                  100th percentile service time |  knn-search-10-1000 |     40.5459 |     ms |

|                                   50th percentile service time | knn-search-100-1000 |     9.00785 |     ms |
|                                   90th percentile service time | knn-search-100-1000 |     10.1576 |     ms |
|                                   99th percentile service time | knn-search-100-1000 |     10.7948 |     ms |
|                                  100th percentile service time | knn-search-100-1000 |     10.8415 |     ms |

|                                   50th percentile service time |  script-score-query |     217.926 |     ms |
|                                   90th percentile service time |  script-score-query |      222.94 |     ms |
|                                   99th percentile service time |  script-score-query |      225.98 |     ms |
|                                  100th percentile service time |  script-score-query |     226.436 |     ms |

@rjurney
Copy link
Copy Markdown

rjurney commented Dec 13, 2021

It looks like this is the final issue holding up ANN support in Elastic? elastic/elasticsearch#78473

@jtibshirani
Copy link
Copy Markdown
Contributor

Hello @rjurney, we already merged support for ANN in Elasticsearch. You can actually try it out as part of an 8.0 early access build (https://www.elastic.co/downloads/past-releases/elasticsearch-8-0-0-beta1). This PR adds a set of benchmarks (to a separate repo) to help us identify performance issues/ improvements. The results look pretty good so far, and we are on track to ship what we've merged.

@jtibshirani
Copy link
Copy Markdown
Contributor

I ran some experiments on the nightly benchmark hardware to select better warm-up periods. Here are some screenshots showing the warm-up cutoffs for index-append and knn-search-10-500 (the other search operations are very similar):

Screen Shot 2021-12-14 at 6 57 15 PM
Screen Shot 2021-12-14 at 6 20 39 PM

I ran into some difficulties with the index-append operation. Setting a generous client timeout like 240s and using --on-error=abort showed that the bulk index operations sometimes stall and timeout. I managed to find a very stable configuration using bulk_indexing_clients: 1 and a 4GB heap that does not trigger timeouts.

These timeouts really need to be investigated, in case there is a problem with dense vector indexing. I would prefer to merge the track with this working configuration, then follow-up with an investigation and fix in Elasticsearch.

@dliappis dliappis self-requested a review December 15, 2021 08:04
Copy link
Copy Markdown
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating here and tuning warmup iterations and indexing clients.

We are basically ready to merge, I left a comment with a question about the need to explicitly disable refreshes. I also wanted to clarify: we intent to run this on the nightly setup with 20 bulk ingest percentage right? This is defined elsewhere, so no need to do anything here, just wanted to understand what we want to do later on for nightly runs.

@dliappis dliappis added enhancement new workload Any work related to adding a new track or functionality within a track and removed enhancement labels Dec 15, 2021
@dliappis dliappis self-requested a review December 15, 2021 16:21
Copy link
Copy Markdown
Contributor

@dliappis dliappis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you both @jtibshirani and @mayya-sharipova for the work here.

@jtibshirani
Copy link
Copy Markdown
Contributor

I also wanted to clarify: we intent to run this on the nightly setup with 20 bulk ingest percentage right?

That's correct. Here are the notable parameters I used in test runs: --car=4gheap --client-options="timeout:240" --track-params="ingest_percentage:20".

@dliappis
Copy link
Copy Markdown
Contributor

That's correct. Here are the notable parameters I used in test runs: --car=4gheap --client-options="timeout:240" --track-params="ingest_percentage:20".

That's perfect. We'll deal with passing the ingest_percentage parameter and car in another repo as it's specific to nightlies.

For now feel free to (squash please) merge this!

I don't know if you've considered backporting this to any branch so that it can be run against released versions of ES.
There is branch logic in Rally (described in depth here: https://esrally.readthedocs.io/en/latest/track.html?highlight=cherry%20pick#custom-track-repositories) regarding which branch of rally-tracks will be used depending on the ES version is benchmarking against.

In short the most recent branch in the 7 series in this repo is 7.14 so if you'd backport to 7.14 the particular branch would be picked when ES target version is >=7.14.0 and <8.

@jtibshirani jtibshirani merged commit ad80ba9 into elastic:master Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new workload Any work related to adding a new track or functionality within a track

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants