Skip to content

Improve brute force vector search speed by using Lucene functions#96617

Merged
elasticsearchmachine merged 9 commits intoelastic:mainfrom
benwtrent:feature/knn-brute-force-improvements
Jun 8, 2023
Merged

Improve brute force vector search speed by using Lucene functions#96617
elasticsearchmachine merged 9 commits intoelastic:mainfrom
benwtrent:feature/knn-brute-force-improvements

Conversation

@benwtrent
Copy link
Member

Lucene has integrated hardware accelerated vector calculations. Meaning, calculations like dot_product can be much faster when using the Lucene defined functions.

When a dense_vector is indexed, we already support this. However, when index: false we store float vectors as binary fields in Lucene and decode them ourselves. Meaning, we don't use the underlying Lucene structures or functions.

To take advantage of the large performance boost, this PR refactors the binary vector values in the following way:

  • Eagerly decode the binary blobs when iterated
  • Call the Lucene defined VectorUtil functions when possible

related to: #96370

@benwtrent benwtrent added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.9.0 labels Jun 6, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@benwtrent
Copy link
Member Author

Performance characteristics

I ran a custom rally track (modification of dense_vector) to test this change. Ensuring that the values were not indexed into the HNSW structure and only using script score. Here are some results.

These were ran on my M1 Max Macbook, preferredBitSize=128 (WHICH IS LAME!!! GIMME AT LEAST 256 APPLE!!!!!)

Default results vs. no Vector Incubator API

This run is the baseline (main) with my change ignoring vector incubator improvements. My change even makes this faster!

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                 Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|---------------------:|----------------:|----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |                      |     1.49652     |     1.58852     |     0.092   |    min |   +6.15% |
|             Min cumulative indexing time across primary shard |                      |     0.747817    |     0.791517    |     0.0437  |    min |   +5.84% |
|          Median cumulative indexing time across primary shard |                      |     0.748258    |     0.794258    |     0.046   |    min |   +6.15% |
|             Max cumulative indexing time across primary shard |                      |     0.7487      |     0.797       |     0.0483  |    min |   +6.45% |
|           Cumulative indexing throttle time of primary shards |                      |     0           |     0           |     0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |                      |     0           |     0           |     0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |                      |     0           |     0           |     0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |                      |     0           |     0           |     0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |                      |     0.8653      |     0.809883    |    -0.05542 |    min |   -6.40% |
|                      Cumulative merge count of primary shards |                      |     4           |     4           |     0       |        |    0.00% |
|                Min cumulative merge time across primary shard |                      |     0.431583    |     0.3972      |    -0.03438 |    min |   -7.97% |
|             Median cumulative merge time across primary shard |                      |     0.43265     |     0.404942    |    -0.02771 |    min |   -6.40% |
|                Max cumulative merge time across primary shard |                      |     0.433717    |     0.412683    |    -0.02103 |    min |   -4.85% |
|              Cumulative merge throttle time of primary shards |                      |     0.506683    |     0.496267    |    -0.01042 |    min |   -2.06% |
|       Min cumulative merge throttle time across primary shard |                      |     0.25275     |     0.24225     |    -0.0105  |    min |   -4.15% |
|    Median cumulative merge throttle time across primary shard |                      |     0.253342    |     0.248133    |    -0.00521 |    min |   -2.06% |
|       Max cumulative merge throttle time across primary shard |                      |     0.253933    |     0.254017    |     8e-05   |    min |   +0.03% |
|                     Cumulative refresh time of primary shards |                      |     0.109317    |     0.113283    |     0.00397 |    min |   +3.63% |
|                    Cumulative refresh count of primary shards |                      |    46           |    46           |     0       |        |    0.00% |
|              Min cumulative refresh time across primary shard |                      |     0.0539833   |     0.05465     |     0.00067 |    min |   +1.23% |
|           Median cumulative refresh time across primary shard |                      |     0.0546583   |     0.0566417   |     0.00198 |    min |   +3.63% |
|              Max cumulative refresh time across primary shard |                      |     0.0553333   |     0.0586333   |     0.0033  |    min |   +5.96% |
|                       Cumulative flush time of primary shards |                      |     0.0907833   |     0.1014      |     0.01062 |    min |  +11.69% |
|                      Cumulative flush count of primary shards |                      |     2           |     2           |     0       |        |    0.00% |
|                Min cumulative flush time across primary shard |                      |     0.045       |     0.0496333   |     0.00463 |    min |  +10.30% |
|             Median cumulative flush time across primary shard |                      |     0.0453917   |     0.0507      |     0.00531 |    min |  +11.69% |
|                Max cumulative flush time across primary shard |                      |     0.0457833   |     0.0517667   |     0.00598 |    min |  +13.07% |
|                                       Total Young Gen GC time |                      |     1.09        |     0.783       |    -0.307   |      s |  -28.17% |
|                                      Total Young Gen GC count |                      |    91           |    41           |   -50       |        |  -54.95% |
|                                         Total Old Gen GC time |                      |     0           |     0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |                      |     0           |     0           |     0       |        |    0.00% |
|                                                    Store size |                      |     1.99315     |     2.04484     |     0.0517  |     GB |   +2.59% |
|                                                 Translog size |                      |     1.02445e-07 |     1.02445e-07 |     0       |     GB |    0.00% |
|                                        Heap used for segments |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                          Heap used for points |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Segment count |                      |     2           |     2           |     0       |        |    0.00% |
|                                   Total Ingest Pipeline count |                      |     0           |     0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |                      |     0           |     0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |                      |     0           |     0           |     0       |        |    0.00% |
|                                                Min Throughput |         index-append | 23392.3         | 22829.6         |  -562.622   | docs/s |   -2.41% |
|                                               Mean Throughput |         index-append | 23866.3         | 23267.7         |  -598.594   | docs/s |   -2.51% |
|                                             Median Throughput |         index-append | 23824.8         | 23208.6         |  -616.263   | docs/s |   -2.59% |
|                                                Max Throughput |         index-append | 24283.7         | 23695.1         |  -588.604   | docs/s |   -2.42% |
|                                       50th percentile latency |         index-append |   170.492       |   177.02        |     6.52729 |     ms |   +3.83% |
|                                       90th percentile latency |         index-append |   196.944       |   204.761       |     7.81731 |     ms |   +3.97% |
|                                       99th percentile latency |         index-append |   216.506       |   233.9         |    17.3939  |     ms |   +8.03% |
|                                      100th percentile latency |         index-append |   537.944       |   366.611       |  -171.333   |     ms |  -31.85% |
|                                  50th percentile service time |         index-append |   170.492       |   177.02        |     6.52729 |     ms |   +3.83% |
|                                  90th percentile service time |         index-append |   196.944       |   204.761       |     7.81731 |     ms |   +3.97% |
|                                  99th percentile service time |         index-append |   216.506       |   233.9         |    17.3939  |     ms |   +8.03% |
|                                 100th percentile service time |         index-append |   537.944       |   366.611       |  -171.333   |     ms |  -31.85% |
|                                                    error rate |         index-append |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput |  refresh-after-index |     0.549916    |     0.505059    |    -0.04486 |  ops/s |   -8.16% |
|                                               Mean Throughput |  refresh-after-index |     0.549916    |     0.505059    |    -0.04486 |  ops/s |   -8.16% |
|                                             Median Throughput |  refresh-after-index |     0.549916    |     0.505059    |    -0.04486 |  ops/s |   -8.16% |
|                                                Max Throughput |  refresh-after-index |     0.549916    |     0.505059    |    -0.04486 |  ops/s |   -8.16% |
|                                      100th percentile latency |  refresh-after-index |  1815.22        |  1977.07        |   161.856   |     ms |   +8.92% |
|                                 100th percentile service time |  refresh-after-index |  1815.22        |  1977.07        |   161.856   |     ms |   +8.92% |
|                                                    error rate |  refresh-after-index |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput | refresh-after-update |    80.8276      |    79.9921      |    -0.8355  |  ops/s |   -1.03% |
|                                               Mean Throughput | refresh-after-update |    80.8276      |    79.9921      |    -0.8355  |  ops/s |   -1.03% |
|                                             Median Throughput | refresh-after-update |    80.8276      |    79.9921      |    -0.8355  |  ops/s |   -1.03% |
|                                                Max Throughput | refresh-after-update |    80.8276      |    79.9921      |    -0.8355  |  ops/s |   -1.03% |
|                                      100th percentile latency | refresh-after-update |     9.11138     |     9.25912     |     0.14775 |     ms |   +1.62% |
|                                 100th percentile service time | refresh-after-update |     9.11138     |     9.25912     |     0.14775 |     ms |   +1.62% |
|                                                    error rate | refresh-after-update |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput |          force-merge |     0.0595047   |     0.0653879   |     0.00588 |  ops/s |   +9.89% |
|                                               Mean Throughput |          force-merge |     0.0595047   |     0.0653879   |     0.00588 |  ops/s |   +9.89% |
|                                             Median Throughput |          force-merge |     0.0595047   |     0.0653879   |     0.00588 |  ops/s |   +9.89% |
|                                                Max Throughput |          force-merge |     0.0595047   |     0.0653879   |     0.00588 |  ops/s |   +9.89% |
|                                      100th percentile latency |          force-merge | 16802.2         | 15290.3         | -1511.91    |     ms |   -9.00% |
|                                 100th percentile service time |          force-merge | 16802.2         | 15290.3         | -1511.91    |     ms |   -9.00% |
|                                                    error rate |          force-merge |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput |   script-score-query |     7.0288      |     8.55991     |     1.53111 |  ops/s |  +21.78% |
|                                               Mean Throughput |   script-score-query |     7.55875     |     9.68523     |     2.12648 |  ops/s |  +28.13% |
|                                             Median Throughput |   script-score-query |     7.61418     |     9.8201      |     2.20593 |  ops/s |  +28.97% |
|                                                Max Throughput |   script-score-query |     7.68795     |    10.0394      |     2.35142 |  ops/s |  +30.59% |
|                                       50th percentile latency |   script-score-query |   128.268       |    96.7239      |   -31.5439  |     ms |  -24.59% |
|                                       90th percentile latency |   script-score-query |   129.359       |   100.52        |   -28.8388  |     ms |  -22.29% |
|                                       99th percentile latency |   script-score-query |   133.435       |   102.186       |   -31.2491  |     ms |  -23.42% |
|                                     99.9th percentile latency |   script-score-query |   146.3         |   114.949       |   -31.3514  |     ms |  -21.43% |
|                                      100th percentile latency |   script-score-query |   149.823       |   116.18        |   -33.6427  |     ms |  -22.46% |
|                                  50th percentile service time |   script-score-query |   128.268       |    96.7239      |   -31.5439  |     ms |  -24.59% |
|                                  90th percentile service time |   script-score-query |   129.359       |   100.52        |   -28.8388  |     ms |  -22.29% |
|                                  99th percentile service time |   script-score-query |   133.435       |   102.186       |   -31.2491  |     ms |  -23.42% |
|                                99.9th percentile service time |   script-score-query |   146.3         |   114.949       |   -31.3514  |     ms |  -21.43% |
|                                 100th percentile service time |   script-score-query |   149.823       |   116.18        |   -33.6427  |     ms |  -22.46% |
|                                                    error rate |   script-score-query |     0           |     0           |     0       |      % |    0.00% |

Default results vs. my change + Vector Incubator API

Numbers speak for themselves ;)

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                 Task |        Baseline |       Contender |        Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|---------------------:|----------------:|----------------:|------------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |                      |     1.49652     |     1.65228     |     0.15577 |    min |  +10.41% |
|             Min cumulative indexing time across primary shard |                      |     0.747817    |     0.826067    |     0.07825 |    min |  +10.46% |
|          Median cumulative indexing time across primary shard |                      |     0.748258    |     0.826142    |     0.07788 |    min |  +10.41% |
|             Max cumulative indexing time across primary shard |                      |     0.7487      |     0.826217    |     0.07752 |    min |  +10.35% |
|           Cumulative indexing throttle time of primary shards |                      |     0           |     0           |     0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |                      |     0           |     0           |     0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |                      |     0           |     0           |     0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |                      |     0           |     0           |     0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |                      |     0.8653      |     0.251483    |    -0.61382 |    min |  -70.94% |
|                      Cumulative merge count of primary shards |                      |     4           |     2           |    -2       |        |  -50.00% |
|                Min cumulative merge time across primary shard |                      |     0.431583    |     0.12565     |    -0.30593 |    min |  -70.89% |
|             Median cumulative merge time across primary shard |                      |     0.43265     |     0.125742    |    -0.30691 |    min |  -70.94% |
|                Max cumulative merge time across primary shard |                      |     0.433717    |     0.125833    |    -0.30788 |    min |  -70.99% |
|              Cumulative merge throttle time of primary shards |                      |     0.506683    |     0           |    -0.50668 |    min | -100.00% |
|       Min cumulative merge throttle time across primary shard |                      |     0.25275     |     0           |    -0.25275 |    min | -100.00% |
|    Median cumulative merge throttle time across primary shard |                      |     0.253342    |     0           |    -0.25334 |    min | -100.00% |
|       Max cumulative merge throttle time across primary shard |                      |     0.253933    |     0           |    -0.25393 |    min | -100.00% |
|                     Cumulative refresh time of primary shards |                      |     0.109317    |     0.106517    |    -0.0028  |    min |   -2.56% |
|                    Cumulative refresh count of primary shards |                      |    46           |    48           |     2       |        |   +4.35% |
|              Min cumulative refresh time across primary shard |                      |     0.0539833   |     0.0516167   |    -0.00237 |    min |   -4.38% |
|           Median cumulative refresh time across primary shard |                      |     0.0546583   |     0.0532583   |    -0.0014  |    min |   -2.56% |
|              Max cumulative refresh time across primary shard |                      |     0.0553333   |     0.0549      |    -0.00043 |    min |   -0.78% |
|                       Cumulative flush time of primary shards |                      |     0.0907833   |     0.1073      |     0.01652 |    min |  +18.19% |
|                      Cumulative flush count of primary shards |                      |     2           |     2           |     0       |        |    0.00% |
|                Min cumulative flush time across primary shard |                      |     0.045       |     0.0526167   |     0.00762 |    min |  +16.93% |
|             Median cumulative flush time across primary shard |                      |     0.0453917   |     0.05365     |     0.00826 |    min |  +18.19% |
|                Max cumulative flush time across primary shard |                      |     0.0457833   |     0.0546833   |     0.0089  |    min |  +19.44% |
|                                       Total Young Gen GC time |                      |     1.09        |     0.775       |    -0.315   |      s |  -28.90% |
|                                      Total Young Gen GC count |                      |    91           |    44           |   -47       |        |  -51.65% |
|                                         Total Old Gen GC time |                      |     0           |     0           |     0       |      s |    0.00% |
|                                        Total Old Gen GC count |                      |     0           |     0           |     0       |        |    0.00% |
|                                                    Store size |                      |     1.99315     |     2.01533     |     0.02218 |     GB |   +1.11% |
|                                                 Translog size |                      |     1.02445e-07 |     1.02445e-07 |     0       |     GB |    0.00% |
|                                        Heap used for segments |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                      Heap used for doc values |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for terms |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                           Heap used for norms |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                          Heap used for points |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                   Heap used for stored fields |                      |     0           |     0           |     0       |     MB |    0.00% |
|                                                 Segment count |                      |     2           |     2           |     0       |        |    0.00% |
|                                   Total Ingest Pipeline count |                      |     0           |     0           |     0       |        |    0.00% |
|                                    Total Ingest Pipeline time |                      |     0           |     0           |     0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |                      |     0           |     0           |     0       |        |    0.00% |
|                                                Min Throughput |         index-append | 23392.3         | 23195.8         |  -196.462   | docs/s |   -0.84% |
|                                               Mean Throughput |         index-append | 23866.3         | 23333.3         |  -532.962   | docs/s |   -2.23% |
|                                             Median Throughput |         index-append | 23824.8         | 23253.6         |  -571.197   | docs/s |   -2.40% |
|                                                Max Throughput |         index-append | 24283.7         | 23698.7         |  -585.011   | docs/s |   -2.41% |
|                                       50th percentile latency |         index-append |   170.492       |   188.767       |    18.2744  |     ms |  +10.72% |
|                                       90th percentile latency |         index-append |   196.944       |   218.767       |    21.8235  |     ms |  +11.08% |
|                                       99th percentile latency |         index-append |   216.506       |   266.375       |    49.8685  |     ms |  +23.03% |
|                                      100th percentile latency |         index-append |   537.944       |   551.11        |    13.166   |     ms |   +2.45% |
|                                  50th percentile service time |         index-append |   170.492       |   188.767       |    18.2744  |     ms |  +10.72% |
|                                  90th percentile service time |         index-append |   196.944       |   218.767       |    21.8235  |     ms |  +11.08% |
|                                  99th percentile service time |         index-append |   216.506       |   266.375       |    49.8685  |     ms |  +23.03% |
|                                 100th percentile service time |         index-append |   537.944       |   551.11        |    13.166   |     ms |   +2.45% |
|                                                    error rate |         index-append |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput |  refresh-after-index |     0.549916    |     0.528989    |    -0.02093 |  ops/s |   -3.81% |
|                                               Mean Throughput |  refresh-after-index |     0.549916    |     0.528989    |    -0.02093 |  ops/s |   -3.81% |
|                                             Median Throughput |  refresh-after-index |     0.549916    |     0.528989    |    -0.02093 |  ops/s |   -3.81% |
|                                                Max Throughput |  refresh-after-index |     0.549916    |     0.528989    |    -0.02093 |  ops/s |   -3.81% |
|                                      100th percentile latency |  refresh-after-index |  1815.22        |  1889.02        |    73.8069  |     ms |   +4.07% |
|                                 100th percentile service time |  refresh-after-index |  1815.22        |  1889.02        |    73.8069  |     ms |   +4.07% |
|                                                    error rate |  refresh-after-index |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput | refresh-after-update |    80.8276      |   101.564       |    20.7368  |  ops/s |  +25.66% |
|                                               Mean Throughput | refresh-after-update |    80.8276      |   101.564       |    20.7368  |  ops/s |  +25.66% |
|                                             Median Throughput | refresh-after-update |    80.8276      |   101.564       |    20.7368  |  ops/s |  +25.66% |
|                                                Max Throughput | refresh-after-update |    80.8276      |   101.564       |    20.7368  |  ops/s |  +25.66% |
|                                      100th percentile latency | refresh-after-update |     9.11138     |     7.33983     |    -1.77154 |     ms |  -19.44% |
|                                 100th percentile service time | refresh-after-update |     9.11138     |     7.33983     |    -1.77154 |     ms |  -19.44% |
|                                                    error rate | refresh-after-update |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput |          force-merge |     0.0595047   |     0.0643411   |     0.00484 |  ops/s |   +8.13% |
|                                               Mean Throughput |          force-merge |     0.0595047   |     0.0643411   |     0.00484 |  ops/s |   +8.13% |
|                                             Median Throughput |          force-merge |     0.0595047   |     0.0643411   |     0.00484 |  ops/s |   +8.13% |
|                                                Max Throughput |          force-merge |     0.0595047   |     0.0643411   |     0.00484 |  ops/s |   +8.13% |
|                                      100th percentile latency |          force-merge | 16802.2         | 15539           | -1263.26    |     ms |   -7.52% |
|                                 100th percentile service time |          force-merge | 16802.2         | 15539           | -1263.26    |     ms |   -7.52% |
|                                                    error rate |          force-merge |     0           |     0           |     0       |      % |    0.00% |
|                                                Min Throughput |   script-score-query |     7.0288      |    11.0674      |     4.03855 |  ops/s |  +57.46% |
|                                               Mean Throughput |   script-score-query |     7.55875     |    12.6622      |     5.10345 |  ops/s |  +67.52% |
|                                             Median Throughput |   script-score-query |     7.61418     |    12.8619      |     5.24775 |  ops/s |  +68.92% |
|                                                Max Throughput |   script-score-query |     7.68795     |    13.2074      |     5.51943 |  ops/s |  +71.79% |
|                                       50th percentile latency |   script-score-query |   128.268       |    73.065       |   -55.2028  |     ms |  -43.04% |
|                                       90th percentile latency |   script-score-query |   129.359       |    76.7706      |   -52.5883  |     ms |  -40.65% |
|                                       99th percentile latency |   script-score-query |   133.435       |    77.9628      |   -55.4722  |     ms |  -41.57% |
|                                     99.9th percentile latency |   script-score-query |   146.3         |    79.9961      |   -66.3044  |     ms |  -45.32% |
|                                      100th percentile latency |   script-score-query |   149.823       |    87.4413      |   -62.3815  |     ms |  -41.64% |
|                                  50th percentile service time |   script-score-query |   128.268       |    73.065       |   -55.2028  |     ms |  -43.04% |
|                                  90th percentile service time |   script-score-query |   129.359       |    76.7706      |   -52.5883  |     ms |  -40.65% |
|                                  99th percentile service time |   script-score-query |   133.435       |    77.9628      |   -55.4722  |     ms |  -41.57% |
|                                99.9th percentile service time |   script-score-query |   146.3         |    79.9961      |   -66.3044  |     ms |  -45.32% |
|                                 100th percentile service time |   script-score-query |   149.823       |    87.4413      |   -62.3815  |     ms |  -41.64% |
|                                                    error rate |   script-score-query |     0           |     0           |     0       |      % |    0.00% |

ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat();
vector[dim] = byteBuffer.getFloat((dim + vectorBR.offset) * Float.BYTES);
Copy link
Contributor

@ChrisHegarty ChrisHegarty Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that vectorBR.offset is not needed here. In fact it would be a problem if it is anything other than 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisHegarty I guess I misunderstand. I thought getFloat was the absolute position within the underlying bytes, not relative to the starting position (dictated by the wrapping of bytes with offset).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My equation is wrong for sure, I think it should be (dim * Float.BYTES) + vectorBR.offset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test passes with my fixed equation:

 public void testDecoding() {
        float[] inputFloats = new float[]{1f, 2f, 3f, 4f};
        ByteBuffer byteBuffer = ByteBuffer.allocate(16);
        for (float f : inputFloats) {
            byteBuffer.putFloat(f);
        }
        BytesRef floatBytes = new BytesRef(byteBuffer.array());
        floatBytes.length = 12;
        floatBytes.offset = 4;
        float[] outputFloats = new float[3];
        VectorEncoderDecoder.decodeDenseVector(floatBytes, outputFloats);
        assertArrayEquals(outputFloats, new float[]{2f, 3f, 4f}, 0f);
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisHegarty what do you think? I fixed my math. Is this still an issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Your changes are good. The wrap with an offset and length is not relevant any more - it could be a plain wrap(byte[]) - since we are now using absolute addressing of the buffer - the complete byte[] is available to us even with wrap(byte[],int,int), which just sets up an initial position and limit.

@benwtrent
Copy link
Member Author

Microbenchmark for Bytes

Lucene also added SIMD improvements for bytes. The function requires byte[] and assumes full length and offset: 0. When index: true, the benefit is obvious. However, index: false poses an interesting conundrum.

BytesRef is returned from the value iterator. Meaning, there is some offset and length associated with the underlying bytes reference. I am not 100% sure if these are always zero and byte[]#length. Copying the underlying array reference ensures this.

The following benchmark was ran on M1-Max, 128 byte width lanes. This was over 100 random byte vectors. One set is "short" (96 dims) the other is "medium" (768 dims).

If the vector API is NOT available, and we still copy the ByteRef, the calculation is slower.

However, if the vector API is present, speed up is achieved.

Here are some numbers (lower is better).

ByteBufferDecodingFunctionBenchmark.benchmarkByteCopied        768  avgt    5  13452.402 ± 21.980  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkByteCopied         96  avgt    5   2350.160 ±  5.545  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkByteNotCopied     768  avgt    5  11766.417 ± 20.773  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkByteNotCopied      96  avgt    5   1936.556 ±  3.638  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkDefault           768  avgt    5  24347.973 ± 24.249  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkDefault            96  avgt    5   3416.354 ± 65.553  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkDefaultCopied     768  avgt    5  26117.556 ± 50.439  ns/op
ByteBufferDecodingFunctionBenchmark.benchmarkDefaultCopied      96  avgt    5   3603.092 ± 42.338  ns/op

If we can bypass this internal byte[] copy, this change is really nice.

@jpountz && @jdconrad what do you think? Should we attempt to by-pass the copy if offset: 0 in the BytesRef?

@benwtrent
Copy link
Member Author

@jpountz && @jdconrad I cannot by pass the copy when we store in binary blobs as we append the danged magnitude at the end of the byte array.

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One minor comment. The benchmarks look awesome :)

return defaultValue;
}
return new BinaryDenseVector(value, dims, indexVersion);
VectorEncoderDecoder.decodeDenseVector(value, vectorValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible get is used w/ the same value twice. Wonder if it makes sense to cache it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can cache it for sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added code to cache

@benwtrent benwtrent requested a review from jdconrad June 7, 2023 12:56
@benwtrent
Copy link
Member Author

@jdconrad I went ahead and added binary byte support for Lucene accelerated calculations. The byte array copy adds less than a millisecond overhead when reading 25,000 values from the same field. But, it cuts runtime almost in half when it the Vector API is used. This seems justifiable to me.

Let me know if you significantly disagree.

@benwtrent
Copy link
Member Author

run elasticsearch-ci/bwc

@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the new changes. I agree with you that the performance trade off for decoding is worth it.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to find more places where these new vectorized methods can be used!

ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat();
vector[dim] = byteBuffer.getFloat((dim * Float.BYTES) + vectorBR.offset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?

Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz I just asked @mayya-sharipova about this same thing. Using getFloat(int) ignores the current position of the buffer. It may make sense to wrap without the offset and length parameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also unrelated to your PR, I would expect ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length).asFloatBuffer().get(vector) to run faster than reading a float at a time.

@jpountz it doesn't. This is 30% faster than making it a float buffer. For some reason, getting the absolute is faster. @ChrisHegarty might have an intuition as to why.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorBR.offset is already taken into account to create the ByteBuffer, so we shouldn't add it there again?

I honestly don't know. I added a test to verify (see: https://github.com/elastic/elasticsearch/pull/96617/files#diff-2d953a23603f9d7ef2f18d9f7bff3960307afc1a763e28fa2c7eca0ee3a65599). That test creates a bytebuffer with a custom length & offset and we get the expected results with my change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdconrad, you are right indeed! Thanks @benwtrent for adding a test.

@benwtrent Thanks for checking performance. I remember that MikeS found that wrapping as a float buffer was faster when adding DataInput#readFloats, but it was with a direct byte buffer with little-endian byte order, these differences might matter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the endianness of the bytebuffers we're using here - big in this case. It's most efficient to use the native endianness, otherwise byte swapping will occur. Since these are already in the index, is it possible to switch them to little?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz here are some benchmark results:

So, average time to calculate latency for decoding float[] from ByteBuffers. Here is the benchmark code (if you can spot any wonkiness): https://gist.github.com/benwtrent/29cc0338cd851c345cace5c486095507

Direct Read (via index in the buffer) is always faster. The floatbuffer vs. byte buffer are almost identical for both decoding kinds (iteration vs. direct read).

Benchmark                                                          (floatArraySizes)  Mode  Cnt         Score         Error  Units
ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead                  96  avgt    5   3348858.476 ±   13314.154  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead                 768  avgt    5  19960837.116 ±  137071.324  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferDirectRead                2048  avgt    5  49142580.882 ±  216614.762  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration                   96  avgt    5   4992161.167 ±   47263.300  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration                  768  avgt    5  33762525.212 ±  110609.781  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeByteBufferIteration                 2048  avgt    5  89409177.001 ± 1122561.441  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead                 96  avgt    5   3529213.563 ±   24635.601  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead                768  avgt    5  18152123.115 ±   49712.077  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferDirectRead               2048  avgt    5  46610729.883 ± 2462771.340  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration                  96  avgt    5   5085876.957 ±   75157.360  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration                 768  avgt    5  32359519.955 ±  399297.947  ns/op
ByteBufferFloatDecodeLatencyBenchmark.decodeFloatBufferIteration                2048  avgt    5  86083215.393 ± 1120197.606  ns/op

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 8, 2023
@benwtrent
Copy link
Member Author

run elasticsearch-ci/part-2

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent Thanks Ben, great speedups!

@elasticsearchmachine elasticsearchmachine merged commit 763cd14 into elastic:main Jun 8, 2023
@benwtrent benwtrent deleted the feature/knn-brute-force-improvements branch June 8, 2023 14:41
stu-elastic added a commit that referenced this pull request Jun 8, 2023
…esOpenChannels (#96635)

Also reduce all grace periods below 100ms
Fixes: #96617
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants