Skip to content

.Net MEVD: LINQ-based filtering#10273

Merged
roji merged 8 commits intomicrosoft:feature-vector-data-preb1from
roji:LinqFiltering
Feb 11, 2025
Merged

.Net MEVD: LINQ-based filtering#10273
roji merged 8 commits intomicrosoft:feature-vector-data-preb1from
roji:LinqFiltering

Conversation

@roji
Copy link
Member

@roji roji commented Jan 23, 2025

LINQ-based filtering

This implements LINQ-based filtering for all current implementations of IVectorStore (with the exception of Pinecone, blocking on #10415). This implements support for:

  • Equality and inequality
  • Comparison (GreaterThan, GreaterThanOrEqual, LessThan, LessThanOrEqual)
  • Logical operators (and, or, not)
  • Contains over collection field (r => r.Tags.Contains("foo"))
  • Contains over inline collection (r => new[] { "foo", "bar" }.Contains(r.String))

Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.

API naming

Vector databases consistency refer to this feature as filter/filtering; in some cases (e.g. Chroma), where is used, following SQL nomenclature:

It seems the choice of exposing this via a VectorSearchOptions.Filter is indeed the correct one. Since we want to maintain backwards compatibility for existing users and preserve the current legacy filtering mechanism, this PR must introduce a separate, temporary NewFilter for the new LINQ-based filteirng. When releasing the stable version, we'll need to remove the old Filter property rename NewFilter to Filter (breaking users which switched to NewFilter).

Custom mapping

The samples currently contain a MappingVectorStoreRecordCollection, which is a decorator transforming a public, user-facing record type to another internal one - the user provides two delegates to handle the conversion from the public record type to the internal, and vice versa. Such conversions become considerably more complex, since the user provides an expression tree referencing public record properties.

To make this work, the user would need to provide some logic to translate property accesses on the public record type to the internal one. For flat records, this could be a simple API where the user simply implements some interface translating public to internal; but for hierarchical records this gets a bit more complicated. I'd suggest deferring such advanced, arbitrary mapping scenarios to post-release. In the meantime, I've commented out MappingVectorStoreRecordCollection (and also Step4_NonStringKey_VectorStore) from the samples.

Note that the problem here doesn't come from LINQ per se, but from the fact that the user specifies model property names (the legacy search API had them specify storage property names instead, side-stepping the problem but causing an inconsistent API where sometimes model properties are used, and sometimes storage properties).

Integration test suite

In addition, this implements new integration test projects for all IVectorStore connectors, implementing most of #10194 (comment):

  • A shared FiltersTestBase class defines all LINQ filter tests, and is extended and used by all providers (link). This can serve as the basic "spec" for what filters are supported across connector (subject to database support, of course).
  • Each connector has its own separate project
  • Testcontainers are used for Mongo, PostgreSQL, Qdrant, Redis, Sqlite, and Weaviate. Azure AI Search, CosmosNoSQL and CosmosMongoDB require configuration (environment variables, settings.json) to provide endpoint information, and tests are skipped if that's not provided.

/cc @westey-m @dmytrostruk

Closes #10156
Does most of #10194

@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Jan 23, 2025
@github-actions github-actions bot changed the title WIP on LINQ-based criteria filtering .Net: WIP on LINQ-based criteria filtering Jan 23, 2025
Task<VectorSearchResults<TRecord>> VectorizableTextSearchAsync(
string searchText,
VectorSearchOptions? options = default,
VectorSearchOptions<TRecord>? options = default,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making VectorSearchOptions generic over TRecord isn't trivial... Modern C# users can use target-typed new, and in that case this doesn't interfere in the method calling. But people not doing target-typed new would have to explicitly specify the full type every time (even if they don't use filtering at all), which maybe isn't ideal. The alternative would be to move the LINQ filter to an optional parameter directly on the method (no strong feelings from my side here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are thinking of also making property selection an expression, that may result in a few more optional parameters on the method signature, to avoid making options generic. For hybrid search it would be 3 additional, which isn't great, especially considering that 2 of those would very rarely be used, i.e. only if there is more than one dense vector or more than one fulltextsearch property on the record.

@westey-m
Copy link
Contributor

@roji, a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those.
We have received consistent feedback that this is a popular scenario for vector search filtering.

@roji
Copy link
Member Author

roji commented Jan 23, 2025

a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those.
We have received consistent feedback that this is a popular scenario for vector search filtering.

@westey-m thanks - will do that, it should be easily representible in LINQ by doing b => b.Tags.Contains("foo"). Maybe let's talk tomorrow about how that's represented e.g. in Qdrant (just because I'm currently focusing on it).

@roji roji changed the base branch from main to feature-vector-data-preb1 January 24, 2025 18:18
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @roji ! I am very happy to see the Testcontainers in action. Also, they way you have implemented the base class for tests is very clever!

@@ -0,0 +1,15 @@
<Project>
<Import Project="../../Directory.Build.props" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that we have a props file, so we can move all the things that would be common for all test projects here 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. There's some opportunity to refactor common stuff, but I didn't want to go too far in this already-huge PR.

Comment on lines +12 to +19
<ItemGroup>
<PackageReference Include="xunit"/>
<PackageReference Include="xunit.runner.visualstudio">
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
<PrivateAssets>all</PrivateAssets>
</PackageReference>
<PackageReference Include="Microsoft.NET.Test.Sdk"/>
</ItemGroup>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one of the future refactorings, we could move this part to the Directory.Build.props file


[AttributeUsage(AttributeTargets.Method)]
[XunitTestCaseDiscoverer("VectorDataSpecificationTests.Xunit.ConditionalFactDiscoverer", "VectorDataIntegrationTests")]
public sealed class ConditionalFactAttribute : FactAttribute;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered importing the package that dotnet repo use to get ConditionalFact and other xUnit extensions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there's something at least for internal .NET consumption, but I wasn't aware that there was a full public package meant for public consumption by anyone - if you point me to it we can consider using that (FWIW even EF has a copy of these internally). Though at the end of the day it's just a few classes which will basically never change, it's probably not super important.

@dmytrostruk
Copy link
Member

@roji There is small formatting warning in WeaviateTestStore.cs file:
image

If I remove it though, it says it can't find HttpClient for net472:
image

For simplicity of this PR, I think we can remove net472 from target frameworks and handle it in scope of separate task. We also need net472 for other projects and not only for vector stores.

@roji
Copy link
Member Author

roji commented Feb 11, 2025

For simplicity of this PR, I think we can remove net472 from target frameworks and handle it in scope of separate task. We also need net472 for other projects and not only for vector stores.

Yeah... Though I already did most of the work around this - I'll get rid of the last formatting stuff and merge, at the point where I am...

Copy link
Member

@dmytrostruk dmytrostruk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small improvements, but in general looks awesome!

return (collection as IVectorStoreRecordCollection<TKey, TRecord>)!;
}

#if DISABLED_FOR_NOW // TODO: See note on MappingVectorStoreRecordCollection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be #if !DISABLED_FOR_NOW?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this basically is just a glorified comment (I did it with #if instead of /* */ mostly because the latter doesn't nest). If I negate, we'd have to define DISABLED_FOR_NOW in the csproj for this to be commented out...

// TODO: The user provides an expression tree accepting a TPublicRecord, but we require an expression tree accepting a TInternalRecord.
// TODO: This is something that the user must provide, and is quite advanced.

#if DISABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be #if !DISABLED

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@dluc
Copy link
Contributor

dluc commented Feb 11, 2025

Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.

I could not understand the implications, where/when is VectorStoreGenericDataModel used? perhaps an old abstraction that could be deprecated?

@dluc
Copy link
Contributor

dluc commented Feb 11, 2025

I noticed a comment and discussion about this in the PR: if I understand correctly LINQ filters are expressed using internal storage field names, could you confirm?

If needed, is there a possibility for mapping external names to internal names? For example, let's assume I build a solution that works with TAGS, and they are stored in different ways, e.g.

  • a list of strings in Postgres, field name "tags"
  • a list of string fields in Redis, field names "tag1" "tag2" "tag3" ...
  • a JSON property in MongoDb, field name "doc.props.tags"

In this scenario would I be able to transform the incoming LINQ filter on "tags" to another LINQ filter targeting my storage of choice?

/// <param name="vectorValue">The vector to match.</param>
/// <param name="filter">The filter conditions for the query.</param>
/// <param name="legacyFilter">The filter conditions for the query.</param>
/// <param name="newFilter">The filter conditions for the query.</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming parameters is a notable breaking change. Would it make sense to tackle all the breaking changes at once, including removing the old filter and updating the type, to streamline the transition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we discussed quite extensively recently... The decision was made to obsolete but preserve the current search mechanism, to prevent breakage for current users. We'd remove the legacy search API before releasing, and rename NewFilter to Filter at that point (this creates more churn for users keeping up to date and using the new API).

#pragma warning disable CS0618 // VectorSearchFilter is obsolete
var (where, parameters) = (oldFilter: legacyFilter, newFilter) switch
{
(not null, not null) => throw new ArgumentException("Either Filter or NewFilter can be specified, but not both"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use nameof(Filter) and nameof(NewFilter) in case params are renamed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion - I'll pass in this specific case only because Filter will be removed soon (before release) anyway and this whole piece of code will go away.


private readonly StringBuilder _sql = new();

internal (string Clause, List<object> Parameters) Translate(
Copy link
Contributor

@dluc dluc Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think there's room for SQL injection, considering that the resulting string is merged into a SQL statement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely (and that's very important). There's a test exercising an input with a single quote to check exactly for this. I have a note to do a general comprehensive pass to make sure everything is airtight - stuff like property names isn't currently safe (those generally aren't user input, but I still need to make them airtight).

@dluc
Copy link
Contributor

dluc commented Feb 11, 2025

Could you confirm that NOT works also with Contains? example

  • NOT Contains over collection field (r => !r.Tags.Contains("foo"))

thanks!

await this._container.StartAsync();
var redis = await ConnectionMultiplexer.ConnectAsync($"{this._container.Hostname}:{this._container.GetMappedPublicPort(6379)},connectTimeout=60000,connectRetry=5");
this._database = redis.GetDatabase();
this._defaultVectorStore = new(this._database);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to test both instances of redis: Json and Hashset. It's possible to choose the flavour by setting the StorageType option on RedisVectorStoreOptions. They do differ in some non-obvious ways, especially with regards to how you specify property names.

Copy link
Member Author

@roji roji Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally missed that, thanks... I pushed tests for both. Depending on the exact differences, it might make sense to just treat these as two databases test-wise and have two projects (but for now fixtures and tests are duplicated in the same projects).

Query-wise there are some unsupported scenarios in HashSet compared to JSON (array fields, null values), but otherwise the tests pass.

@roji
Copy link
Member Author

roji commented Feb 11, 2025

@dluc

Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.

I could not understand the implications, where/when is VectorStoreGenericDataModel used? perhaps an old abstraction that could be deprecated?

VectorStoreGenericDataModel is basically a way to use this abstraction without a strongly-typed .NET type (e.g. for dynamic scenarios). This is already supported in MEVD, but this PR doesn't add support for it in the new filtering mechanism. This is absolutely something I intend to do.

I noticed a comment and discussion about this in the PR: if I understand correctly LINQ filters are expressed using internal storage field names, could you confirm?

No - the LINQ filter expression references properties on the user's .NET type, and so references regular .NET properties.

For example, in collection.VectorizedSearchAsync(vector, new() { NewFilter = r => r.Age > 8 }), Age is a .NET property on a real .NET type (represented here by the lambda parameter r). MEVD allows mapping .NET properties to arbitrary storage names (the names that you actually see in the database), and that's fully compatible with the new LINQ filtering (the translator goes through an internal translator table from user-facing property names to storage properties).

If needed, is there a possibility for mapping external names to internal names? For example, let's assume I build a solution that works with TAGS, and they are stored in different ways, e.g.

a list of strings in Postgres, field name "tags"
a list of string fields in Redis, field names "tag1" "tag2" "tag3" ...
a JSON property in MongoDb, field name "doc.props.tags"
In this scenario would I be able to transform the incoming LINQ filter on "tags" to another LINQ filter targeting my storage of choice?

For all of these scenarios, your .NET type would simply have a regular List<string> property called Tags; referencing it in the LINQ filter and e.g. calling Contains on that property should work simply work on all these databases. One note though: AFAIK in Redis, this would not be mapped to multiple string fields, but rather with a single Tag field; this already fully works with the Redis connector. This scenario is exercised by this test.

Could you confirm that NOT works also with Contains? example

Yes - it does in general. Specific database support varies sometimes (e.g. MongoDB has some general issues with NOT in vector search pre-filters), but it works fine on most databases.

@roji
Copy link
Member Author

roji commented Feb 11, 2025

BTW @dluc we're definitely interested in feedback about any other specific filtering features you might be interested, don't hesitate to let us know.

@dluc
Copy link
Contributor

dluc commented Mar 13, 2025

@roji this is a scenario I'm about to test.

I'm curious to see this in action since I don't have classes modeling data in storage, every definition is set at runtime. I guess internal field names and external field names will match in every scenario, thus it should just work™️

I noticed a comment and discussion about this in the PR: if I understand correctly LINQ filters are expressed using internal storage field names, could you confirm?

No - the LINQ filter expression references properties on the user's .NET type, and so references regular .NET properties.

@roji
Copy link
Member Author

roji commented Mar 13, 2025

I'm curious to see this in action since I don't have classes modeling data in storage, every definition is set at runtime. I guess internal field names and external field names will match in every scenario, thus it should just work

@dluc if I understand correctly, does that mean you using VectorStoreGenericDataModel? If so, note that the LINQ-based filteirng hasn't yet been implemented for that - #10468 tracks that. Full support for dynamic scenarios is absolutely high our list, but some fixes probably need to happen first (see #10802).

So long story short, for dynamic scenarios you'll likely need to wait a bit longer until everything is fully in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel.core kernel Issues or pull requests impacting the core kernel memory connector memory msft.ext.vectordata Related to Microsoft.Extensions.VectorData .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants