.Net MEVD: LINQ-based filtering#10273
.Net MEVD: LINQ-based filtering#10273roji merged 8 commits intomicrosoft:feature-vector-data-preb1from
Conversation
dotnet/src/Connectors/Connectors.Memory.Qdrant/QdrantFilterTranslator.cs
Outdated
Show resolved
Hide resolved
| Task<VectorSearchResults<TRecord>> VectorizableTextSearchAsync( | ||
| string searchText, | ||
| VectorSearchOptions? options = default, | ||
| VectorSearchOptions<TRecord>? options = default, |
There was a problem hiding this comment.
Making VectorSearchOptions generic over TRecord isn't trivial... Modern C# users can use target-typed new, and in that case this doesn't interfere in the method calling. But people not doing target-typed new would have to explicitly specify the full type every time (even if they don't use filtering at all), which maybe isn't ideal. The alternative would be to move the LINQ filter to an optional parameter directly on the method (no strong feelings from my side here).
There was a problem hiding this comment.
Since we are thinking of also making property selection an expression, that may result in a few more optional parameters on the method signature, to avoid making options generic. For hybrid search it would be 3 additional, which isn't great, especially considering that 2 of those would very rarely be used, i.e. only if there is more than one dense vector or more than one fulltextsearch property on the record.
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/VectorSearchOptions.cs
Show resolved
Hide resolved
dotnet/src/VectorDataTests/VectorDataSpecificationTests/Filter/BasicFilterTestsBase.cs
Outdated
Show resolved
Hide resolved
dotnet/src/VectorDataTests/QdrantTests/Support/TestContainer/QdrantBuilder.cs
Show resolved
Hide resolved
dotnet/src/VectorDataTests/QdrantTests/Support/TestContainer/QdrantConfiguration.cs
Show resolved
Hide resolved
dotnet/src/VectorDataTests/QdrantTests/Support/TestContainer/QdrantContainer.cs
Show resolved
Hide resolved
|
@roji, a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those. |
dotnet/src/VectorDataTests/VectorDataSpecificationTests/Filter/BasicFilterTestsBase.cs
Outdated
Show resolved
Hide resolved
@westey-m thanks - will do that, it should be easily representible in LINQ by doing |
adamsitnik
left a comment
There was a problem hiding this comment.
LGTM, great work @roji ! I am very happy to see the Testcontainers in action. Also, they way you have implemented the base class for tests is very clever!
...c/VectorDataIntegrationTests/CosmosMongoDBIntegrationTests/Support/CosmosMongoDBTestStore.cs
Outdated
Show resolved
Hide resolved
| @@ -0,0 +1,15 @@ | |||
| <Project> | |||
| <Import Project="../../Directory.Build.props" /> | |||
There was a problem hiding this comment.
It's great that we have a props file, so we can move all the things that would be common for all test projects here 👍
There was a problem hiding this comment.
Yep. There's some opportunity to refactor common stuff, but I didn't want to go too far in this already-huge PR.
| <ItemGroup> | ||
| <PackageReference Include="xunit"/> | ||
| <PackageReference Include="xunit.runner.visualstudio"> | ||
| <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets> | ||
| <PrivateAssets>all</PrivateAssets> | ||
| </PackageReference> | ||
| <PackageReference Include="Microsoft.NET.Test.Sdk"/> | ||
| </ItemGroup> |
There was a problem hiding this comment.
In one of the future refactorings, we could move this part to the Directory.Build.props file
dotnet/src/VectorDataIntegrationTests/MongoDBIntegrationTests/Support/MongoDBTestStore.cs
Show resolved
Hide resolved
dotnet/src/VectorDataIntegrationTests/VectorDataIntegrationTests/Filter/BasicFilterTestsBase.cs
Show resolved
Hide resolved
dotnet/src/VectorDataIntegrationTests/VectorDataIntegrationTests/Filter/FilterFixtureBase.cs
Show resolved
Hide resolved
|
|
||
| [AttributeUsage(AttributeTargets.Method)] | ||
| [XunitTestCaseDiscoverer("VectorDataSpecificationTests.Xunit.ConditionalFactDiscoverer", "VectorDataIntegrationTests")] | ||
| public sealed class ConditionalFactAttribute : FactAttribute; |
There was a problem hiding this comment.
Have you considered importing the package that dotnet repo use to get ConditionalFact and other xUnit extensions?
There was a problem hiding this comment.
I remember there's something at least for internal .NET consumption, but I wasn't aware that there was a full public package meant for public consumption by anyone - if you point me to it we can consider using that (FWIW even EF has a copy of these internally). Though at the end of the day it's just a few classes which will basically never change, it's probably not super important.
|
@roji There is small formatting warning in If I remove it though, it says it can't find For simplicity of this PR, I think we can remove |
Yeah... Though I already did most of the work around this - I'll get rid of the last formatting stuff and merge, at the point where I am... |
.../src/VectorDataIntegrationTests/VectorDataIntegrationTests/VectorDataIntegrationTests.csproj
Outdated
Show resolved
Hide resolved
dmytrostruk
left a comment
There was a problem hiding this comment.
Small improvements, but in general looks awesome!
...ors.AzureCosmosDBNoSQL.UnitTests/AzureCosmosDBNoSQLVectorStoreCollectionQueryBuilderTests.cs
Outdated
Show resolved
Hide resolved
| return (collection as IVectorStoreRecordCollection<TKey, TRecord>)!; | ||
| } | ||
|
|
||
| #if DISABLED_FOR_NOW // TODO: See note on MappingVectorStoreRecordCollection |
There was a problem hiding this comment.
Should it not be #if !DISABLED_FOR_NOW?
There was a problem hiding this comment.
Well this basically is just a glorified comment (I did it with #if instead of /* */ mostly because the latter doesn't nest). If I negate, we'd have to define DISABLED_FOR_NOW in the csproj for this to be commented out...
| // TODO: The user provides an expression tree accepting a TPublicRecord, but we require an expression tree accepting a TInternalRecord. | ||
| // TODO: This is something that the user must provide, and is quite advanced. | ||
|
|
||
| #if DISABLED |
There was a problem hiding this comment.
Should it be #if !DISABLED
I could not understand the implications, where/when is |
|
I noticed a comment and discussion about this in the PR: if I understand correctly LINQ filters are expressed using internal storage field names, could you confirm? If needed, is there a possibility for mapping external names to internal names? For example, let's assume I build a solution that works with TAGS, and they are stored in different ways, e.g.
In this scenario would I be able to transform the incoming LINQ filter on "tags" to another LINQ filter targeting my storage of choice? |
| /// <param name="vectorValue">The vector to match.</param> | ||
| /// <param name="filter">The filter conditions for the query.</param> | ||
| /// <param name="legacyFilter">The filter conditions for the query.</param> | ||
| /// <param name="newFilter">The filter conditions for the query.</param> |
There was a problem hiding this comment.
Renaming parameters is a notable breaking change. Would it make sense to tackle all the breaking changes at once, including removing the old filter and updating the type, to streamline the transition?
There was a problem hiding this comment.
This is something we discussed quite extensively recently... The decision was made to obsolete but preserve the current search mechanism, to prevent breakage for current users. We'd remove the legacy search API before releasing, and rename NewFilter to Filter at that point (this creates more churn for users keeping up to date and using the new API).
| #pragma warning disable CS0618 // VectorSearchFilter is obsolete | ||
| var (where, parameters) = (oldFilter: legacyFilter, newFilter) switch | ||
| { | ||
| (not null, not null) => throw new ArgumentException("Either Filter or NewFilter can be specified, but not both"), |
There was a problem hiding this comment.
nit: use nameof(Filter) and nameof(NewFilter) in case params are renamed
There was a problem hiding this comment.
Good suggestion - I'll pass in this specific case only because Filter will be removed soon (before release) anyway and this whole piece of code will go away.
|
|
||
| private readonly StringBuilder _sql = new(); | ||
|
|
||
| internal (string Clause, List<object> Parameters) Translate( |
There was a problem hiding this comment.
do you think there's room for SQL injection, considering that the resulting string is merged into a SQL statement?
There was a problem hiding this comment.
Absolutely (and that's very important). There's a test exercising an input with a single quote to check exactly for this. I have a note to do a general comprehensive pass to make sure everything is airtight - stuff like property names isn't currently safe (those generally aren't user input, but I still need to make them airtight).
|
Could you confirm that NOT works also with Contains? example
thanks! |
| await this._container.StartAsync(); | ||
| var redis = await ConnectionMultiplexer.ConnectAsync($"{this._container.Hostname}:{this._container.GetMappedPublicPort(6379)},connectTimeout=60000,connectRetry=5"); | ||
| this._database = redis.GetDatabase(); | ||
| this._defaultVectorStore = new(this._database); |
There was a problem hiding this comment.
We should make sure to test both instances of redis: Json and Hashset. It's possible to choose the flavour by setting the StorageType option on RedisVectorStoreOptions. They do differ in some non-obvious ways, especially with regards to how you specify property names.
There was a problem hiding this comment.
Totally missed that, thanks... I pushed tests for both. Depending on the exact differences, it might make sense to just treat these as two databases test-wise and have two projects (but for now fixtures and tests are duplicated in the same projects).
Query-wise there are some unsupported scenarios in HashSet compared to JSON (array fields, null values), but otherwise the tests pass.
VectorStoreGenericDataModel is basically a way to use this abstraction without a strongly-typed .NET type (e.g. for dynamic scenarios). This is already supported in MEVD, but this PR doesn't add support for it in the new filtering mechanism. This is absolutely something I intend to do.
No - the LINQ filter expression references properties on the user's .NET type, and so references regular .NET properties. For example, in
For all of these scenarios, your .NET type would simply have a regular
Yes - it does in general. Specific database support varies sometimes (e.g. MongoDB has some general issues with NOT in vector search pre-filters), but it works fine on most databases. |
|
BTW @dluc we're definitely interested in feedback about any other specific filtering features you might be interested, don't hesitate to let us know. |
|
@roji this is a scenario I'm about to test. I'm curious to see this in action since I don't have classes modeling data in storage, every definition is set at runtime. I guess internal field names and external field names will match in every scenario, thus it should just work™️
|
@dluc if I understand correctly, does that mean you using VectorStoreGenericDataModel? If so, note that the LINQ-based filteirng hasn't yet been implemented for that - #10468 tracks that. Full support for dynamic scenarios is absolutely high our list, but some fixes probably need to happen first (see #10802). So long story short, for dynamic scenarios you'll likely need to wait a bit longer until everything is fully in place. |


LINQ-based filtering
This implements LINQ-based filtering for all current implementations of IVectorStore (with the exception of Pinecone, blocking on #10415). This implements support for:
r => r.Tags.Contains("foo"))r => new[] { "foo", "bar" }.Contains(r.String))Notably missing at the moment is support for VectorStoreGenericDataModel in LINQ querying; I've opened #10468 and will do this a bit later, with additional planned work on the generic data model.
API naming
Vector databases consistency refer to this feature as filter/filtering; in some cases (e.g. Chroma),
whereis used, following SQL nomenclature:whereis used in the API, but filtering is also used in the docs)It seems the choice of exposing this via a
VectorSearchOptions.Filteris indeed the correct one. Since we want to maintain backwards compatibility for existing users and preserve the current legacy filtering mechanism, this PR must introduce a separate, temporaryNewFilterfor the new LINQ-based filteirng. When releasing the stable version, we'll need to remove the oldFilterproperty renameNewFiltertoFilter(breaking users which switched toNewFilter).Custom mapping
The samples currently contain a MappingVectorStoreRecordCollection, which is a decorator transforming a public, user-facing record type to another internal one - the user provides two delegates to handle the conversion from the public record type to the internal, and vice versa. Such conversions become considerably more complex, since the user provides an expression tree referencing public record properties.
To make this work, the user would need to provide some logic to translate property accesses on the public record type to the internal one. For flat records, this could be a simple API where the user simply implements some interface translating public to internal; but for hierarchical records this gets a bit more complicated. I'd suggest deferring such advanced, arbitrary mapping scenarios to post-release. In the meantime, I've commented out MappingVectorStoreRecordCollection (and also Step4_NonStringKey_VectorStore) from the samples.
Note that the problem here doesn't come from LINQ per se, but from the fact that the user specifies model property names (the legacy search API had them specify storage property names instead, side-stepping the problem but causing an inconsistent API where sometimes model properties are used, and sometimes storage properties).
Integration test suite
In addition, this implements new integration test projects for all IVectorStore connectors, implementing most of #10194 (comment):
/cc @westey-m @dmytrostruk
Closes #10156
Does most of #10194