Skip to content

.Net MEVD: Escaping and quoting#11676

Closed
adamsitnik wants to merge 5 commits intomicrosoft:feature-vector-data-preb2from
adamsitnik:escaping
Closed

.Net MEVD: Escaping and quoting#11676
adamsitnik wants to merge 5 commits intomicrosoft:feature-vector-data-preb2from
adamsitnik:escaping

Conversation

@adamsitnik
Copy link
Member

fixes #11154

@adamsitnik adamsitnik requested a review from a team as a code owner April 22, 2025 13:51
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Apr 22, 2025

protected virtual void GenerateColumn(string column, bool isSearchCondition = false)
=> this._sql.Append('"').Append(column.Replace("\"", "\"\"")).Append('"');
protected virtual void GenerateColumn(VectorStoreRecordPropertyModel column, bool isSearchCondition = false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the argument from string to VectorStoreRecordPropertyModel just to make sure all the derived types deal with StorageName in explicit way (they know this value is always escaped, while some random string may or may not be escaped).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.


protected abstract void TranslateCapturedVariable(string name, object? capturedValue);
// Don't use the variable name to avoid issues with escaping and quoting.
// The derived types are supposed to create the parameter name from a prefix and _parameterIndex.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided that instead of passing the captured variable name and creating a connector-specific parameter name from it, it will be easier to just always create the names based on parameter index. It's not as pretty as having the names, but it's very simple and foremost safe.

- paramName = $"@{variableName}{ _parameterIndex++}"
+ paramName = $"@{_parameterIndex++}"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed simpler, but the generated SQL would be significantly less readable... The SQL shows up in logs/telemetry/db perf tools, and we'd ideally do our best to keep it as readable as possible.

BTW it shouldn't be extremely hard to do this in a way that's widely compatible - the logic we use for this in EF is simply taking the variable name, and then adding a counter only if needed. Here's what I'm proposing in #11750 (link):

var origName = memberInfo.Name;
var name = origName;
for (var i = 0; this._parameterNames!.Contains(name); i++)
{
    name = $"{origName}_{i}";
}
this._parameterNames.Add(name);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply taking the variable name, and then adding a counter only if needed

But a C# (not to mention F# or IL that can contain almost any character) variable name can contain a character that is not valid in SQL parameter name. Example: any character like ą (polish) or ä (german) is not valid for Sqlite.

I know it's unlikely and I may be a little paranoid (side effect of working on BinaryFormatter), but I wanted to it to be always safe and correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I could so the same thing I did for SQL Server: stop on first non-ascii character

// In SQL Server, parameter names cannot be just a number like "@1".
// Parameter names must start with an alphabetic character or an underscore
// and can be followed by alphanumeric characters or underscores.
// Since we can't guarantee that the value returned by StoragePropertyName and DataModelPropertyName
// is valid parameter name (it can contain whitespaces, or start with a number),
// we just append the ASCII letters, stop on the first non-ASCII letter
// and append the index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are good points indeed... I can only say that in all the time handling EF issues, I don't recall ever seeing a problem here - probably nobody has tried to use weird characters in C# variable names etc.

But your proposal to stop on the first non-ascii character sounds like a good compromise - in all normal cases it will do nothing at all (since all characters will be ascii), but in case somebody does do it, it will kick in.

So up to you... I think it's OK for us to GA without this very high level of safety - but OK to add it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to apply this suggestion, but it turned out that for both Postgres and SqlServer some other parts of the code rely on this simple naming @number:

this._parameterValues.Add(value);
this._sql.Append('$').Append(this._parameterIndex++);

// SQL Server parameters can't start with a digit (but underscore is OK).
this._sql.Append("@_").Append(this._parameterIndex++);
}

I decided to keep it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think we should try to preserve the name where we can? Debugging/tracking down SQL performance issues can be quite hard if you can't read the SQL easily...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying we should not at all. It's just right now I don't have the time to unify all the 3 connectors in that regard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. If we need to deliver an imperfect thing (which is totally fine), I'd personally go the other way, i.e. produce more readable SQL for the 99% case while failing for the tiny minority case which have captured variable names with weird characters... But I'm OK to go like this as well.


for (var recordIndex = 0; recordIndex < records.Count; recordIndex++)
{
var rowIdentifierParameterName = GetParameterName(rowIdentifier, recordIndex);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was not being used

Properties =
[
// The Id and Floats properties are stored in the vec_ table and this extension currently does not support
// quoting column names, so don't customize their storage property names.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vec extension simply does not support quoting. Example:

CREATE VIRTUAL TABLE "vec_6a64a46d-6f95-4c8b-a8ad-4c4b009f0a68" USING vec0(
"id" TEXT PRIMARY KEY,
"embedding" FLOAT[3] distance_metric=cosine
);
SqliteException: SQLite Error 1: 'vec0 constructor error: Could not parse '"id" TEXT PRIMARY KEY''.

Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies again @adamsitnik for not reviewing this earlier, and for causing conflicts with my own PRs. I'm happy to take over this to resolve any conflicts etc. - it's my fault so I don't want you to lose time with merge conflicts etc.

But please take a look at the comments below first and let's make sure we agree on stuff first...

BTW do we also need to go over other non-SQL connectors to make sure they're safe etc.?


protected virtual void GenerateColumn(string column, bool isSearchCondition = false)
=> this._sql.Append('"').Append(column.Replace("\"", "\"\"")).Append('"');
protected virtual void GenerateColumn(VectorStoreRecordPropertyModel column, bool isSearchCondition = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

public static object? GetPropertyValue(NpgsqlDataReader reader, string propertyName, Type propertyType)
{
int propertyIndex = reader.GetOrdinal(propertyName);
int propertyIndex = reader.GetOrdinal(PostgresConstants.Unescape(propertyName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure unescaping is actually needed here (and in fact, is correct)? Escaping should only be needed when the column name is integrated in SQL (i.e. in a larger string containing a query), but this is an API specifically accepting the column name (not a SQL string). I don't think this is right (in fact, if the coiumn actually contains a double-column, this would cause a failure).

Regardless, for future improvement, a better approach here would simply to read back columns from NpgsqlDataReader based on their position, instead of based on their name - we're the ones generating the SQL that selects the columns, so we should know the order in which they were sent and can read it back... This is a bit more efficient and cleaner, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed, when I revert the change, the tests start to fail:

    Microsoft.Extensions.VectorData.VectorStoreOperationException : Call to vector store failed.
    ---- System.IndexOutOfRangeException : Field not found in row: embed""ding

In general, I've extended the SimpleModelFixture storage names with all kind of weird characters. Because of that, we have coverage for collection, single record and batch CRUD. Initially I've quoted only the obvious things, but after studying the failing tests I was able to find all the weird and unexpected places.

a better approach here would simply to read back columns from NpgsqlDataReader based on their position

I agree, but it would require more non-trivial changes (build a dictionary name-position for every query), as the order may vary (we don't always include vector fields etc). If that is not a problem, I would prefer not to do it in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed, when I revert the change, the tests start to fail:

I see, that's because you've changed model building so that PropertyModel.StorageName is escaped - so now it has to be un-escaped (code):

property.StorageName = this.Options.EscapeIdentifier is not null
    ? this.Options.EscapeIdentifier(storageName)
    : storageName;

I'm not sure I'm a fan of the approach... I think the model storage name itself shouldn't be escaped, and then we apply escaping to it when needed, and don't apply when not needed (like here, when interacting with DbDataReader). I guess you're trying to defensively ensure that identifiers are escaped by default (in case we forget to do it somewhere); but remember that this isn't a security problem - both property names and the names of captured variables are part of the application code, and so are trusted (any security concerns are around the contents of captured variables, since that's what typically comes from untrusted user inputs).

So if we forget to escape an identifier somewhere, that's just a bug we should fix, and since we have some contexts where we want unescaped names (DbDataReader.GetOrdinal(), possibly logging....), I think the model should probably just contain the raw, unescaped name.

In any case, I agree that reading back columns based on their position is something that belong out of this PR and can be done later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought on this...

IIUC the property's StorageName is escaped, but it isn't quoted... That means that wherever the name needs to be integrated in SQL, we still need to remember to quote it - which I think removes the "safety-by-default" that having the escaped name in the model was supposed to provide in the first place...

To go all the way with that approach, we'd have the model storage name be both escaped and quoted; we'd then need to unquote and unescape for certain contexts (DbDataReader.GetOrdinal(), combining table/column names into an index name).

Or we go the other way, and just keep the raw, unquoted and unescaped name in the model, and simply call DelimitIdentifier wherever needed (again, we currently need to do this anyway for quoting). Building the index name would just be Delimit($"{tableName}_{columnName}") or whatever.


protected abstract void TranslateCapturedVariable(string name, object? capturedValue);
// Don't use the variable name to avoid issues with escaping and quoting.
// The derived types are supposed to create the parameter name from a prefix and _parameterIndex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed simpler, but the generated SQL would be significantly less readable... The SQL shows up in logs/telemetry/db perf tools, and we'd ideally do our best to keep it as readable as possible.

BTW it shouldn't be extremely hard to do this in a way that's widely compatible - the logic we use for this in EF is simply taking the variable name, and then adding a counter only if needed. Here's what I'm proposing in #11750 (link):

var origName = memberInfo.Name;
var name = origName;
for (var i = 0; this._parameterNames!.Contains(name); i++)
{
    name = $"{origName}_{i}";
}
this._parameterNames.Add(name);

private static string GetParameterName(string propertyName, int index)
{
return $"@{propertyName}{index}";
StringBuilder builder = new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a bug here that this fixes? Or is this about slightly better perf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input value (propertyName) can be column name with a character that is not valid for parameter names:

string propertyName = property.StorageName;
if (include && record.TryGetValue(propertyName, out var value))
{
columns.Add($"\"{propertyName}\"");
parameterNames.Add(GetParameterName(propertyName, index));

The tests I've provided cover it (when I revert this change, they fail)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the same problem discussed above for parameter names inside the filter translator (i.e. TranslateQueryParameter).

For one thing, whatever logic we choose, we should probably not have two different implementations for handling parameter names (once in the filter translator, one here).

BTW if I'm very pedantic, just appending an index at the end isn't a 100% solution, since two parameters could still conflict (imagine having one parameter x that becomes x12 (because it's the 12th parameter), and x1 that also becomes x12 (because it's the second parameter). To fully ensure uniquenes, we need to simply record the parameters that have already been added in some set, and check if the new parameter is already in the set. This would also allow us only append the numbers when needed, rather than always appending them as the code currently does.

But I don't want us to get too bogged down in this. I'm also OK with merging for now and then revisiting later.

// Escape both table names before exposing them to anything that may build SQL commands.
this._dataTableName = name.EscapeIdentifier();
this._vectorTableName = GetVectorTableName(name, this._options).EscapeIdentifier();
this._dataTableName = SqliteConstants.Escape(name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use using static to not have to do SqliteConstants every time, e.g.:

Suggested change
this._dataTableName = SqliteConstants.Escape(name);
this._dataTableName = EscapeIdentifier(name);

BTW it's slightly odd for me that we're caching the escaped names (this code isn't that high-perf). But if we do want to do that, may as well also add the quotes around it?

In general, rather than treating escaping and quoting as two different concerns, we can have just a single API that accepts a table or column name and outputs the escaped, quoted string representation (in EF we call this DelimitIdentifier).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it's slightly odd for me that we're caching the escaped names (this code isn't that high-perf).

The goal I had here was to ensure nobody is using unescaped values anywhere (doing it once was just perf side effect).

In general, rather than treating escaping and quoting as two different concerns, we can have just a single API that accepts a table or column name and outputs the escaped

This is a great idea, however I am afraid we need both quoted and not:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for creating index names, we use table name and column name, if they were only quoted it would not work:

For this, we can interpolate the completely raw (unquoted and unescaped) table and column names, and then run Delimit on the result. This way, we always start from raw identifiers and perform both escaping and quoting at the very end, after concatenation etc.

the SQLite vector extension does not support quoting for column names

Ah, that is very shitty indeed... But it's a very special exception (that will hopefully be fixed), so I wouldn't design our code around this one flaw (we can specifically do escaping without quoting at that one place for now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal I had here was to ensure nobody is using unescaped values anywhere (doing it once was just perf side effect).

OK I think I missed this before.

See https://github.com/microsoft/semantic-kernel/pull/11676/files#r2064348530 for some thoughts on this.

…o escaping

# Conflicts:
#	dotnet/src/Connectors/Connectors.Memory.Common/SqlFilterTranslator.cs
#	dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresConstants.cs
#	dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresFilterTranslator.cs
#	dotnet/src/Connectors/Connectors.Memory.SqlServer/SqlServerConstants.cs
#	dotnet/src/Connectors/Connectors.Memory.SqlServer/SqlServerFilterTranslator.cs
#	dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteConstants.cs
#	dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteFilterTranslator.cs
…o escaping

# Conflicts:
#	dotnet/src/IntegrationTests/Connectors/Memory/Weaviate/WeaviateVectorStoreRecordCollectionTests.cs
PayloadSchemaType schemaType = PayloadSchemaType.Keyword,
CancellationToken cancellationToken = default)
=> this._qdrantClient.CreatePayloadIndexAsync(collectionName, fieldName, schemaType, cancellationToken: cancellationToken);
=> this._qdrantClient.CreatePayloadIndexAsync(collectionName, $"\"{fieldName}\"", schemaType, cancellationToken: cancellationToken);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW do we also need to go over other non-SQL connectors to make sure they're safe etc.?

@roji I went through the non-SQL connectors (except of cosmos) and this was the only line of code I needed to change to get the tests passing. We are usually using strongly typed .NET APIs for these and/or JSON serializer that does all the escaping for us.

@adamsitnik adamsitnik requested a review from roji April 28, 2025 20:41
@adamsitnik
Copy link
Member Author

@roji I've solved the merge conflicts, PTAL

@westey-m westey-m deleted the branch microsoft:feature-vector-data-preb2 April 29, 2025 15:43
@westey-m westey-m closed this Apr 29, 2025
@adamsitnik adamsitnik reopened this Apr 30, 2025
@adamsitnik adamsitnik changed the base branch from feature-vector-data-preb2 to feature-vector-data-preb3 April 30, 2025 10:09
@adamsitnik adamsitnik changed the base branch from feature-vector-data-preb3 to feature-vector-data-preb2 April 30, 2025 10:09
Copy link
Member

@roji roji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @adamsitnik. I reviewed the updated PR; I think we think a bit differently about how to go about this - let's try to sync on this offline. I'm also OK with merging this and then revisiting later, to go faster.

public static object? GetPropertyValue(NpgsqlDataReader reader, string propertyName, Type propertyType)
{
int propertyIndex = reader.GetOrdinal(propertyName);
int propertyIndex = reader.GetOrdinal(PostgresConstants.Unescape(propertyName));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed, when I revert the change, the tests start to fail:

I see, that's because you've changed model building so that PropertyModel.StorageName is escaped - so now it has to be un-escaped (code):

property.StorageName = this.Options.EscapeIdentifier is not null
    ? this.Options.EscapeIdentifier(storageName)
    : storageName;

I'm not sure I'm a fan of the approach... I think the model storage name itself shouldn't be escaped, and then we apply escaping to it when needed, and don't apply when not needed (like here, when interacting with DbDataReader). I guess you're trying to defensively ensure that identifiers are escaped by default (in case we forget to do it somewhere); but remember that this isn't a security problem - both property names and the names of captured variables are part of the application code, and so are trusted (any security concerns are around the contents of captured variables, since that's what typically comes from untrusted user inputs).

So if we forget to escape an identifier somewhere, that's just a bug we should fix, and since we have some contexts where we want unescaped names (DbDataReader.GetOrdinal(), possibly logging....), I think the model should probably just contain the raw, unescaped name.

In any case, I agree that reading back columns based on their position is something that belong out of this PR and can be done later.


this._parameters.Add(name, value);
this._sql.Append('@').Append(name);
string paramName = $"@{this._parameterIndex++}";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure why we're always removing the name... It's really useful for reading the SQL, 99% of the cases aren't going to be problematic, and the 1% (where there are weird chars) can be handled separately, no?

private static string GetParameterName(string propertyName, int index)
{
return $"@{propertyName}{index}";
StringBuilder builder = new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the same problem discussed above for parameter names inside the filter translator (i.e. TranslateQueryParameter).

For one thing, whatever logic we choose, we should probably not have two different implementations for handling parameter names (once in the filter translator, one here).

BTW if I'm very pedantic, just appending an index at the end isn't a 100% solution, since two parameters could still conflict (imagine having one parameter x that becomes x12 (because it's the 12th parameter), and x1 that also becomes x12 (because it's the second parameter). To fully ensure uniquenes, we need to simply record the parameters that have already been added in some set, and check if the new parameter is already in the set. This would also allow us only append the numbers when needed, rather than always appending them as the code currently does.

But I don't want us to get too bogged down in this. I'm also OK with merging for now and then revisiting later.

// Escape both table names before exposing them to anything that may build SQL commands.
this._dataTableName = name.EscapeIdentifier();
this._vectorTableName = GetVectorTableName(name, this._options).EscapeIdentifier();
this._dataTableName = SqliteConstants.Escape(name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal I had here was to ensure nobody is using unescaped values anywhere (doing it once was just perf side effect).

OK I think I missed this before.

See https://github.com/microsoft/semantic-kernel/pull/11676/files#r2064348530 for some thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants