.Net MEVD: Escaping and quoting#11676
.Net MEVD: Escaping and quoting#11676adamsitnik wants to merge 5 commits intomicrosoft:feature-vector-data-preb2from
Conversation
|
|
||
| protected virtual void GenerateColumn(string column, bool isSearchCondition = false) | ||
| => this._sql.Append('"').Append(column.Replace("\"", "\"\"")).Append('"'); | ||
| protected virtual void GenerateColumn(VectorStoreRecordPropertyModel column, bool isSearchCondition = false) |
There was a problem hiding this comment.
I've changed the argument from string to VectorStoreRecordPropertyModel just to make sure all the derived types deal with StorageName in explicit way (they know this value is always escaped, while some random string may or may not be escaped).
|
|
||
| protected abstract void TranslateCapturedVariable(string name, object? capturedValue); | ||
| // Don't use the variable name to avoid issues with escaping and quoting. | ||
| // The derived types are supposed to create the parameter name from a prefix and _parameterIndex. |
There was a problem hiding this comment.
I decided that instead of passing the captured variable name and creating a connector-specific parameter name from it, it will be easier to just always create the names based on parameter index. It's not as pretty as having the names, but it's very simple and foremost safe.
- paramName = $"@{variableName}{ _parameterIndex++}"
+ paramName = $"@{_parameterIndex++}"There was a problem hiding this comment.
This is indeed simpler, but the generated SQL would be significantly less readable... The SQL shows up in logs/telemetry/db perf tools, and we'd ideally do our best to keep it as readable as possible.
BTW it shouldn't be extremely hard to do this in a way that's widely compatible - the logic we use for this in EF is simply taking the variable name, and then adding a counter only if needed. Here's what I'm proposing in #11750 (link):
var origName = memberInfo.Name;
var name = origName;
for (var i = 0; this._parameterNames!.Contains(name); i++)
{
name = $"{origName}_{i}";
}
this._parameterNames.Add(name);There was a problem hiding this comment.
simply taking the variable name, and then adding a counter only if needed
But a C# (not to mention F# or IL that can contain almost any character) variable name can contain a character that is not valid in SQL parameter name. Example: any character like ą (polish) or ä (german) is not valid for Sqlite.
I know it's unlikely and I may be a little paranoid (side effect of working on BinaryFormatter), but I wanted to it to be always safe and correct.
There was a problem hiding this comment.
However, I could so the same thing I did for SQL Server: stop on first non-ascii character
There was a problem hiding this comment.
These are good points indeed... I can only say that in all the time handling EF issues, I don't recall ever seeing a problem here - probably nobody has tried to use weird characters in C# variable names etc.
But your proposal to stop on the first non-ascii character sounds like a good compromise - in all normal cases it will do nothing at all (since all characters will be ascii), but in case somebody does do it, it will kick in.
So up to you... I think it's OK for us to GA without this very high level of safety - but OK to add it!
There was a problem hiding this comment.
I've tried to apply this suggestion, but it turned out that for both Postgres and SqlServer some other parts of the code rely on this simple naming @number:
I decided to keep it as is.
There was a problem hiding this comment.
You don't think we should try to preserve the name where we can? Debugging/tracking down SQL performance issues can be quite hard if you can't read the SQL easily...
There was a problem hiding this comment.
I am not saying we should not at all. It's just right now I don't have the time to unify all the 3 connectors in that regard.
There was a problem hiding this comment.
OK. If we need to deliver an imperfect thing (which is totally fine), I'd personally go the other way, i.e. produce more readable SQL for the 99% case while failing for the tiny minority case which have captured variable names with weird characters... But I'm OK to go like this as well.
|
|
||
| for (var recordIndex = 0; recordIndex < records.Count; recordIndex++) | ||
| { | ||
| var rowIdentifierParameterName = GetParameterName(rowIdentifier, recordIndex); |
There was a problem hiding this comment.
it was not being used
| Properties = | ||
| [ | ||
| // The Id and Floats properties are stored in the vec_ table and this extension currently does not support | ||
| // quoting column names, so don't customize their storage property names. |
There was a problem hiding this comment.
The vec extension simply does not support quoting. Example:
CREATE VIRTUAL TABLE "vec_6a64a46d-6f95-4c8b-a8ad-4c4b009f0a68" USING vec0(
"id" TEXT PRIMARY KEY,
"embedding" FLOAT[3] distance_metric=cosine
);SqliteException: SQLite Error 1: 'vec0 constructor error: Could not parse '"id" TEXT PRIMARY KEY''.
roji
left a comment
There was a problem hiding this comment.
Apologies again @adamsitnik for not reviewing this earlier, and for causing conflicts with my own PRs. I'm happy to take over this to resolve any conflicts etc. - it's my fault so I don't want you to lose time with merge conflicts etc.
But please take a look at the comments below first and let's make sure we agree on stuff first...
BTW do we also need to go over other non-SQL connectors to make sure they're safe etc.?
|
|
||
| protected virtual void GenerateColumn(string column, bool isSearchCondition = false) | ||
| => this._sql.Append('"').Append(column.Replace("\"", "\"\"")).Append('"'); | ||
| protected virtual void GenerateColumn(VectorStoreRecordPropertyModel column, bool isSearchCondition = false) |
| public static object? GetPropertyValue(NpgsqlDataReader reader, string propertyName, Type propertyType) | ||
| { | ||
| int propertyIndex = reader.GetOrdinal(propertyName); | ||
| int propertyIndex = reader.GetOrdinal(PostgresConstants.Unescape(propertyName)); |
There was a problem hiding this comment.
Are you sure unescaping is actually needed here (and in fact, is correct)? Escaping should only be needed when the column name is integrated in SQL (i.e. in a larger string containing a query), but this is an API specifically accepting the column name (not a SQL string). I don't think this is right (in fact, if the coiumn actually contains a double-column, this would cause a failure).
Regardless, for future improvement, a better approach here would simply to read back columns from NpgsqlDataReader based on their position, instead of based on their name - we're the ones generating the SQL that selects the columns, so we should know the order in which they were sent and can read it back... This is a bit more efficient and cleaner, etc.
There was a problem hiding this comment.
It is needed, when I revert the change, the tests start to fail:
Microsoft.Extensions.VectorData.VectorStoreOperationException : Call to vector store failed.
---- System.IndexOutOfRangeException : Field not found in row: embed""ding
In general, I've extended the SimpleModelFixture storage names with all kind of weird characters. Because of that, we have coverage for collection, single record and batch CRUD. Initially I've quoted only the obvious things, but after studying the failing tests I was able to find all the weird and unexpected places.
a better approach here would simply to read back columns from NpgsqlDataReader based on their position
I agree, but it would require more non-trivial changes (build a dictionary name-position for every query), as the order may vary (we don't always include vector fields etc). If that is not a problem, I would prefer not to do it in this PR.
There was a problem hiding this comment.
It is needed, when I revert the change, the tests start to fail:
I see, that's because you've changed model building so that PropertyModel.StorageName is escaped - so now it has to be un-escaped (code):
property.StorageName = this.Options.EscapeIdentifier is not null
? this.Options.EscapeIdentifier(storageName)
: storageName;I'm not sure I'm a fan of the approach... I think the model storage name itself shouldn't be escaped, and then we apply escaping to it when needed, and don't apply when not needed (like here, when interacting with DbDataReader). I guess you're trying to defensively ensure that identifiers are escaped by default (in case we forget to do it somewhere); but remember that this isn't a security problem - both property names and the names of captured variables are part of the application code, and so are trusted (any security concerns are around the contents of captured variables, since that's what typically comes from untrusted user inputs).
So if we forget to escape an identifier somewhere, that's just a bug we should fix, and since we have some contexts where we want unescaped names (DbDataReader.GetOrdinal(), possibly logging....), I think the model should probably just contain the raw, unescaped name.
In any case, I agree that reading back columns based on their position is something that belong out of this PR and can be done later.
There was a problem hiding this comment.
One more thought on this...
IIUC the property's StorageName is escaped, but it isn't quoted... That means that wherever the name needs to be integrated in SQL, we still need to remember to quote it - which I think removes the "safety-by-default" that having the escaped name in the model was supposed to provide in the first place...
To go all the way with that approach, we'd have the model storage name be both escaped and quoted; we'd then need to unquote and unescape for certain contexts (DbDataReader.GetOrdinal(), combining table/column names into an index name).
Or we go the other way, and just keep the raw, unquoted and unescaped name in the model, and simply call DelimitIdentifier wherever needed (again, we currently need to do this anyway for quoting). Building the index name would just be Delimit($"{tableName}_{columnName}") or whatever.
|
|
||
| protected abstract void TranslateCapturedVariable(string name, object? capturedValue); | ||
| // Don't use the variable name to avoid issues with escaping and quoting. | ||
| // The derived types are supposed to create the parameter name from a prefix and _parameterIndex. |
There was a problem hiding this comment.
This is indeed simpler, but the generated SQL would be significantly less readable... The SQL shows up in logs/telemetry/db perf tools, and we'd ideally do our best to keep it as readable as possible.
BTW it shouldn't be extremely hard to do this in a way that's widely compatible - the logic we use for this in EF is simply taking the variable name, and then adding a counter only if needed. Here's what I'm proposing in #11750 (link):
var origName = memberInfo.Name;
var name = origName;
for (var i = 0; this._parameterNames!.Contains(name); i++)
{
name = $"{origName}_{i}";
}
this._parameterNames.Add(name);| private static string GetParameterName(string propertyName, int index) | ||
| { | ||
| return $"@{propertyName}{index}"; | ||
| StringBuilder builder = new(); |
There was a problem hiding this comment.
Was there a bug here that this fixes? Or is this about slightly better perf?
There was a problem hiding this comment.
The input value (propertyName) can be column name with a character that is not valid for parameter names:
The tests I've provided cover it (when I revert this change, they fail)
There was a problem hiding this comment.
So this is the same problem discussed above for parameter names inside the filter translator (i.e. TranslateQueryParameter).
For one thing, whatever logic we choose, we should probably not have two different implementations for handling parameter names (once in the filter translator, one here).
BTW if I'm very pedantic, just appending an index at the end isn't a 100% solution, since two parameters could still conflict (imagine having one parameter x that becomes x12 (because it's the 12th parameter), and x1 that also becomes x12 (because it's the second parameter). To fully ensure uniquenes, we need to simply record the parameters that have already been added in some set, and check if the new parameter is already in the set. This would also allow us only append the numbers when needed, rather than always appending them as the code currently does.
But I don't want us to get too bogged down in this. I'm also OK with merging for now and then revisiting later.
| // Escape both table names before exposing them to anything that may build SQL commands. | ||
| this._dataTableName = name.EscapeIdentifier(); | ||
| this._vectorTableName = GetVectorTableName(name, this._options).EscapeIdentifier(); | ||
| this._dataTableName = SqliteConstants.Escape(name); |
There was a problem hiding this comment.
nit: we can use using static to not have to do SqliteConstants every time, e.g.:
| this._dataTableName = SqliteConstants.Escape(name); | |
| this._dataTableName = EscapeIdentifier(name); |
BTW it's slightly odd for me that we're caching the escaped names (this code isn't that high-perf). But if we do want to do that, may as well also add the quotes around it?
In general, rather than treating escaping and quoting as two different concerns, we can have just a single API that accepts a table or column name and outputs the escaped, quoted string representation (in EF we call this DelimitIdentifier).
There was a problem hiding this comment.
BTW it's slightly odd for me that we're caching the escaped names (this code isn't that high-perf).
The goal I had here was to ensure nobody is using unescaped values anywhere (doing it once was just perf side effect).
In general, rather than treating escaping and quoting as two different concerns, we can have just a single API that accepts a table or column name and outputs the escaped
This is a great idea, however I am afraid we need both quoted and not:
- for creating index names, we use table name and column name, if they were only quoted it would not work:
- the SQLite vector extension does not support quoting for column names ;(
There was a problem hiding this comment.
for creating index names, we use table name and column name, if they were only quoted it would not work:
For this, we can interpolate the completely raw (unquoted and unescaped) table and column names, and then run Delimit on the result. This way, we always start from raw identifiers and perform both escaping and quoting at the very end, after concatenation etc.
the SQLite vector extension does not support quoting for column names
Ah, that is very shitty indeed... But it's a very special exception (that will hopefully be fixed), so I wouldn't design our code around this one flaw (we can specifically do escaping without quoting at that one place for now).
There was a problem hiding this comment.
The goal I had here was to ensure nobody is using unescaped values anywhere (doing it once was just perf side effect).
OK I think I missed this before.
See https://github.com/microsoft/semantic-kernel/pull/11676/files#r2064348530 for some thoughts on this.
…o escaping # Conflicts: # dotnet/src/Connectors/Connectors.Memory.Common/SqlFilterTranslator.cs # dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresConstants.cs # dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresFilterTranslator.cs # dotnet/src/Connectors/Connectors.Memory.SqlServer/SqlServerConstants.cs # dotnet/src/Connectors/Connectors.Memory.SqlServer/SqlServerFilterTranslator.cs # dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteConstants.cs # dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteFilterTranslator.cs
…o escaping # Conflicts: # dotnet/src/IntegrationTests/Connectors/Memory/Weaviate/WeaviateVectorStoreRecordCollectionTests.cs
| PayloadSchemaType schemaType = PayloadSchemaType.Keyword, | ||
| CancellationToken cancellationToken = default) | ||
| => this._qdrantClient.CreatePayloadIndexAsync(collectionName, fieldName, schemaType, cancellationToken: cancellationToken); | ||
| => this._qdrantClient.CreatePayloadIndexAsync(collectionName, $"\"{fieldName}\"", schemaType, cancellationToken: cancellationToken); |
There was a problem hiding this comment.
BTW do we also need to go over other non-SQL connectors to make sure they're safe etc.?
@roji I went through the non-SQL connectors (except of cosmos) and this was the only line of code I needed to change to get the tests passing. We are usually using strongly typed .NET APIs for these and/or JSON serializer that does all the escaping for us.
dotnet/src/Connectors/Connectors.Memory.SqlServer/SqlServerConstants.cs
Outdated
Show resolved
Hide resolved
|
@roji I've solved the merge conflicts, PTAL |
roji
left a comment
There was a problem hiding this comment.
Hey @adamsitnik. I reviewed the updated PR; I think we think a bit differently about how to go about this - let's try to sync on this offline. I'm also OK with merging this and then revisiting later, to go faster.
| public static object? GetPropertyValue(NpgsqlDataReader reader, string propertyName, Type propertyType) | ||
| { | ||
| int propertyIndex = reader.GetOrdinal(propertyName); | ||
| int propertyIndex = reader.GetOrdinal(PostgresConstants.Unescape(propertyName)); |
There was a problem hiding this comment.
It is needed, when I revert the change, the tests start to fail:
I see, that's because you've changed model building so that PropertyModel.StorageName is escaped - so now it has to be un-escaped (code):
property.StorageName = this.Options.EscapeIdentifier is not null
? this.Options.EscapeIdentifier(storageName)
: storageName;I'm not sure I'm a fan of the approach... I think the model storage name itself shouldn't be escaped, and then we apply escaping to it when needed, and don't apply when not needed (like here, when interacting with DbDataReader). I guess you're trying to defensively ensure that identifiers are escaped by default (in case we forget to do it somewhere); but remember that this isn't a security problem - both property names and the names of captured variables are part of the application code, and so are trusted (any security concerns are around the contents of captured variables, since that's what typically comes from untrusted user inputs).
So if we forget to escape an identifier somewhere, that's just a bug we should fix, and since we have some contexts where we want unescaped names (DbDataReader.GetOrdinal(), possibly logging....), I think the model should probably just contain the raw, unescaped name.
In any case, I agree that reading back columns based on their position is something that belong out of this PR and can be done later.
|
|
||
| this._parameters.Add(name, value); | ||
| this._sql.Append('@').Append(name); | ||
| string paramName = $"@{this._parameterIndex++}"; |
There was a problem hiding this comment.
I'm still not sure why we're always removing the name... It's really useful for reading the SQL, 99% of the cases aren't going to be problematic, and the 1% (where there are weird chars) can be handled separately, no?
| private static string GetParameterName(string propertyName, int index) | ||
| { | ||
| return $"@{propertyName}{index}"; | ||
| StringBuilder builder = new(); |
There was a problem hiding this comment.
So this is the same problem discussed above for parameter names inside the filter translator (i.e. TranslateQueryParameter).
For one thing, whatever logic we choose, we should probably not have two different implementations for handling parameter names (once in the filter translator, one here).
BTW if I'm very pedantic, just appending an index at the end isn't a 100% solution, since two parameters could still conflict (imagine having one parameter x that becomes x12 (because it's the 12th parameter), and x1 that also becomes x12 (because it's the second parameter). To fully ensure uniquenes, we need to simply record the parameters that have already been added in some set, and check if the new parameter is already in the set. This would also allow us only append the numbers when needed, rather than always appending them as the code currently does.
But I don't want us to get too bogged down in this. I'm also OK with merging for now and then revisiting later.
| // Escape both table names before exposing them to anything that may build SQL commands. | ||
| this._dataTableName = name.EscapeIdentifier(); | ||
| this._vectorTableName = GetVectorTableName(name, this._options).EscapeIdentifier(); | ||
| this._dataTableName = SqliteConstants.Escape(name); |
There was a problem hiding this comment.
The goal I had here was to ensure nobody is using unescaped values anywhere (doing it once was just perf side effect).
OK I think I missed this before.
See https://github.com/microsoft/semantic-kernel/pull/11676/files#r2064348530 for some thoughts on this.
fixes #11154