[RFC] Sparse Semantic Retrieval User Interface Solutions 

# Background
Per this RFC: https://github.com/opensearch-project/neural-search/issues/230, we’re going to implement a new feature in neural search plugin to provide the sparse embedding ingestion and query capability to customers. 

There’re several approaches to implement this new feature, and there are several user interface designs. We create this seperate RFC to list all alternatives we've considered. We also summarize their pros and cons.


# New Components to implement
The new feature including several main components:

- Mapper: to define a new field type to represent the new sparse embedding data in index.
- Ingestion processor: to ingest data into index.
- Query builder: to create the corresponding query and fetch data from index.


# Proposals

## Mapper

### Option 1 (**preferred version**s)

Create a new mapper field type: sparse. An example like below:
```
{
  "mappings": {
    "properties": {
      "FIELD_NAME": {
        "type": "sparse_vector"
      }
    }
  }
}
```

There’ll be a new Mapper class to process this field type, where the fundemental logic is leveraging the FeatureField in lucene to store , reference here: https://github.com/apache/lucene/issues/11799#issuecomment-1254691652. 

sample request to index a doc:
```
POST sample-index/_doc
{
    {
        "sparse_field_name":{
            "hello":1.1,
            "world":2.3,
            "token":5.8
        }
    }
}
```
#### pros

1. The existing FeatureField can meet our need to index and search sparse vectors in efficient way. Lucene has implemented the whole pipeline of searching, e.g. Query, Weight, Scorer.
2. New field type is clear and straightforward to user.

#### cons

1. Needs more learning effort for user to understand why and when to use this field type.
2. User needs to install neural search plugin to use this new field type.
3. The code in Lucene is not very extensible(final class, in-package visibility). If we want to modify them, we’ll need to create PR to lucene or rewrite it in our own repo. 

> **Important Note:** We will introduce a new configurable mapping settings “max_term_score_for_sparse_query” for the “sparse_vector” field mapper. Both match query and our sparse query use lucene Boolean query to connect all term-level subqueries. While Lucene BooleanQuery leverage *WAND (Weak AND) algorithm* to prune the search process, using each term's score upper bound to skip unnecessary traverses. For normal text match queries using BM25, the score upperbound is estimated as $IDF * (k1+1)$, where $k1$ stands for the weight control paramter for TF.
>
> However, the default Lucene FeatureQuery employed Float.MAX_VALUE as the term upperbounds, which will deprecate WAND and lead to performing exhausted search over sparse terms. During the training of sparse models, many components like log activations and FLOPS regularizier will keep the model away from large outputs. As a result, there is practical upperbound for output scores distribution. But different sparse models have different scores distribution on the terms, we are going to open this paramter to the users (but we'll also provide a best upperbound for our in-house pretrained sparse model). We will use the value in index settings by default for “max_term_score_for_sparse_query”, but users can also overwrite the upperbound in query. Our experiments show a proper "max_term_score_for_sparse_query" value can reduce search latency by 4x while losing the precision less than 0.1%. 
>
> As an alternative leading to the higest precison, if users don’t configure this setting, we’ll degrade to the default behavior of FeatureQuery: use Float.MAX_VALUE as the upperbound.

### Option 2 (**backup version**)

In this option, we can reuse the existing mapper field type, e.g. a list with text/keyword to store the tokens and scores to dedicated fields, and when querying from this field, user need to pass flags or using painless script to overwrite the default calculated scores. An example like below:
```
[
  {
    "token": "hello",
    "weight": 0.17347954
  },
  {
    "token": "world",
    "weight": 0.57349547
  }
]
```

#### pros

1. No new learnings needed for user.

#### cons

1. More storage consuming since the repeating keys.
2. The query on this is not efficient since post process(score calculation) is needed and the entire document need to be loaded for score calculation.
3. Complex implementation since user needs to pass extra flag/painless script to override the default calculated scores.

### Option 3 (**backup version**)

In this option, we will utilize the payload field of the opensearch “text” field. Since this field cannot be directly accessed, a typical implemetation is, to transform the token-weight map to a flat string first and a dedicated tokenizer will decode the string and save the weight to lucene Payload field. For query phase, we’ll decode the payload as float score and sum all terms score.

```
# user perspective
{
 "origin_field": "hello world"
}

# Inside neural search ingest pipeline, will create new text field
{
    "sparse_field":"hello|1.2|world|1.4"
}
# and this will be parsed to {hello:1.2,world:1.4} by our tokenizer
# tokens will be stored same as normal tokens, while weight are stored in payload

# For query, we'll use PayloadScoreQuery and boost to calculate weight of one token;
```

#### Pros:

1. no need to implement a new field

#### Cons:

1. The storage is not efficient, since text field will also save offset and position, which has no meaning in our scenerio.
2. The ingestion and query is dependent on our ingest pipeline, therefore dependent on the model settings inside opensearch. To fit the format of this method, users should prefer host model inside Opensearch.

### Option 4

reuse opensearch field mapper “rank_features”
#### Pros:

1. not introduce new component to opensearch

#### Cons:

1. may confuse users
2. Can not introduce new parameters to index mappings. E.g. “max_term_score_for_weighted_term_query” . We will also introduce index pruning algorithms, which requires some hyperparameters.
    1. alternative: set the hyperparameters in query and ingestion processor

## Ingestion processor

### Option 1 (**preferred version**)

Creating a new ingestion processor type called: `sparse_embedding`. Since the ingestion process is almost the same with current neural search ingestion processor, we prefer to reuse the existing code by making the current processor an abstract class and overwrite some logic for sparse format validation. There’ll no duplicate code and the new processor type is decoupled with the existing `text_embedding`.

#### pros

1. This new type is very clear for user to understand and there won’t be any confusion.
2. Decoupled with existing processor type in code perspective so it’s also clear to maintainers.

#### cons

1. A minor naming convention issue is that text_embedding is a generic name which can be sparse vector embedding or dense vector embedding which is not accurate for current dense vector embedding case. 

### Option 2 (**backup version**)

To reuse the existing ingestion processor type `text_embedding` and set up new flags in pipeline configuration, e.g.: `embedding_type` which can be `sparse` or `dense` like below:
```
{
    "description": "Embedding pipeline",
    "processors": [
        {
            "text_embedding": {
                "embedding_type": "sparse",
                "model_id": "TijfAYoBQ5nTrKUX7iTe",
                "field_map": {
                    "text": "text_knn"
                },
                "ignore_failure": false
            }
        }
    ]
}
```
#### pros

1. No need to create a new processor type.

#### cons

1. Need to add a new field to identify which type of embedding should be doing, this is an overhead for user to configure.
2. To user’s understanding, the text embedding is for dense vector embedding and adding a new field can break the impression on this processor type, this might cause confusion to user if user is not familiar with sparse vector embedding.
3. The implementation is more complex because we need to avoid attacking the existing logic.

## Query Builder
There’re two types query that should be supported:

- **Querying by raw vector inputs**

User can choose to process sparse vector outside of neural search and pass the sparse token & weight into the query, and the query will be performed against the sparse field in the index.

```
{
    "query": {
        "sparse": {
            "sample-field-name":{
                "query_tokens": {
                    "hello": 0.12314234324,
                    "world": 0.37439243453,
                    "#ello": 0.10854584966
                }
            }
        }
    }
}
```

- **Query with text through sparse embedding inference**

User can choose to invoke via ml-commons to parse the query texts to tokens and each token has default weight as 1, this can save the model inference effort which is more efficient.

```
{
    "query": {
        "sparse": {
            "sample-field-name":{
                "query_text": "hello world"，
                "model_id": "mock_sparse_model_id", //optional, showing up means needs sparse encoding
            }
        }
    }
}
```

### Option 1 (**preferred**)

Create two new different query type to support different queries like above.

#### pros

1. New query types are very clear and straightforward that they only serve for sparse query.
2. Totally new code to implement these query types which decoupled with existing code.

#### cons

1. New types to learn for user to understand why and when to use them.

### Option2 (**backup version**)

Reuse neural query with new flags like below:

```
{
    "query": {
        "neural": {
            "query_type": "sparse" //knn
            "query_text": "hello world",
            "model_id": "mock_model_id",
            "k": 10,
            "name": "test",
            "boost": 1.0,
            "filter": ""
        }
    }
}
```
#### pros

No need to add new query types

#### cons

1. Tightly coupled with knn and is error prone when implementing this feature since this might affect knn logic by accident.
2. Not compatible parameters between these query types, e.g. k is specific for knn not for sparse query which is not a clean and nice approach in term of user experience.
3. User’s understanding for neural search is to achieve k-nn query and adding a new query type to support sparse query could confuse users.



[RFC] Sparse Semantic Retrieval User Interface Solutions #284

Description

Background

New Components to implement

Proposals

Mapper

Option 1 (preferred versions)

pros

cons

Option 2 (backup version)

pros

cons

Option 3 (backup version)

Pros:

Cons:

Option 4

Pros:

Cons:

Ingestion processor

Option 1 (preferred version)

pros

cons

Option 2 (backup version)

pros

cons

Query Builder

Option 1 (preferred)

pros

cons

Option2 (backup version)

pros

cons

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions