sql: fix substring(byte[]) to treat input as raw bytes without escaping#58265
sql: fix substring(byte[]) to treat input as raw bytes without escaping#58265craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
Release note (bug fix): The substring function on byte arrays would treat its input as unicode code points, which would cause the wrong bytes to be returned. Now it only operates on the raw bytes.
Release note (bug fix): The substring(byte[]) functions were not able to interpret bytes that had the `\` character since it was treating it as the beginning of an escape sequence. This is now fixed.
knz
left a comment
There was a problem hiding this comment.
Second commit LGTM. First commit unsure - I take it that you checked that pg does the same - but I'll let another reviewer approve.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)
|
Yeah, PG does this here: text_substring versus bytea_substring |
solongordon
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @rafiss, and @solongordon)
pkg/sql/sem/builtins/builtins.go, line 4930 at r1 (raw file):
end := start + length // Check for integer overflow.
These are kind of funky cases. If it's easy enough, please add tests and verify we match Postgres behavior.
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @solongordon)
pkg/sql/sem/builtins/builtins.go, line 4930 at r1 (raw file):
Sure. I have been adding these tests, but while comparing to Postgres, I think I discovered a bug/mistake in PG.
The PG source code https://github.com/postgres/postgres/blob/472e518a44eacd9caac7d618f1b6451672ca4481/src/backend/utils/adt/varlena.c#L884-L891
int E = S + length;
/*
* A negative value for L is the only way for the end position to
* be before the start. SQL99 says to throw an error.
*/
if (E < S)
ereport(ERROR,
(errcode(ERRCODE_SUBSTRING_ERROR),
errmsg("negative substring length not allowed")));
But that comment is wrong -- the end position could be before the start if L is not negative and the addition overflows.
I'm gonna go ahead and intentionally diverge from PG in that case to avoid this confusing behavior.
> select substr('string', 2147483646, 2147483646);
2021-01-04 12:23:00.298 EST [85734] ERROR: negative substring length not allowed
I submitted a bug report to PG as well.
|
Nice catch. Do you think we should consider just erroring out in the overflow case? But with a more accurate error than Postgres? |
|
The PG devs seem to be pretty responsive on the bug tracker, so I'll see how they want to handle it and decide if we should do the same. I would lean towards keeping our current behavior of just going to the end of the string. e.g. these should be the same. |
|
It's being discussed in the PG list here: https://www.postgresql.org/message-id/16804-f4eeeb6c11ba71d4%40postgresql.org |
|
PG devs landed on doing the same behavior we already implement, so merging this https://www.postgresql.org/message-id/3219376.1609795681%40sss.pgh.pa.us bors r=solongordon |
|
Build succeeded: |
fixes #57367
Release note (bug fix): The substring function on byte arrays would
treat its input as unicode code points, which would cause the wrong
bytes to be returned. Now it only operates on the raw bytes.
Release note (bug fix): The substring(byte[]) functions were not able to
interpret bytes that had the
\character since it was treating it asthe beginning of an escape sequence. This is now fixed.