Skip to content

sql: fix substring(byte[]) to treat input as raw bytes without escaping#58265

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
rafiss:fix-byte-substring
Jan 19, 2021
Merged

sql: fix substring(byte[]) to treat input as raw bytes without escaping#58265
craig[bot] merged 2 commits intocockroachdb:masterfrom
rafiss:fix-byte-substring

Conversation

@rafiss
Copy link
Copy Markdown
Collaborator

@rafiss rafiss commented Dec 24, 2020

fixes #57367

Release note (bug fix): The substring function on byte arrays would
treat its input as unicode code points, which would cause the wrong
bytes to be returned. Now it only operates on the raw bytes.

Release note (bug fix): The substring(byte[]) functions were not able to
interpret bytes that had the \ character since it was treating it as
the beginning of an escape sequence. This is now fixed.

Release note (bug fix): The substring function on byte arrays would
treat its input as unicode code points, which would cause the wrong
bytes to be returned. Now it only operates on the raw bytes.
Release note (bug fix): The substring(byte[]) functions were not able to
interpret bytes that had the `\` character since it was treating it as
the beginning of an escape sequence. This is now fixed.
@rafiss rafiss requested review from a team and jordanlewis December 24, 2020 00:51
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second commit LGTM. First commit unsure - I take it that you checked that pg does the same - but I'll let another reviewer approve.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis)

@rafiss
Copy link
Copy Markdown
Collaborator Author

rafiss commented Dec 28, 2020

Yeah, PG does this here: text_substring versus bytea_substring

Copy link
Copy Markdown
Contributor

@solongordon solongordon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @rafiss, and @solongordon)


pkg/sql/sem/builtins/builtins.go, line 4930 at r1 (raw file):

	end := start + length
	// Check for integer overflow.

These are kind of funky cases. If it's easy enough, please add tests and verify we match Postgres behavior.

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @solongordon)


pkg/sql/sem/builtins/builtins.go, line 4930 at r1 (raw file):
Sure. I have been adding these tests, but while comparing to Postgres, I think I discovered a bug/mistake in PG.

The PG source code https://github.com/postgres/postgres/blob/472e518a44eacd9caac7d618f1b6451672ca4481/src/backend/utils/adt/varlena.c#L884-L891

                        int			E = S + length;
			/*
			 * A negative value for L is the only way for the end position to
			 * be before the start. SQL99 says to throw an error.
			 */
			if (E < S)
				ereport(ERROR,
						(errcode(ERRCODE_SUBSTRING_ERROR),
						 errmsg("negative substring length not allowed")));

But that comment is wrong -- the end position could be before the start if L is not negative and the addition overflows.

I'm gonna go ahead and intentionally diverge from PG in that case to avoid this confusing behavior.

> select substr('string', 2147483646, 2147483646);
2021-01-04 12:23:00.298 EST [85734] ERROR:  negative substring length not allowed

I submitted a bug report to PG as well.

@solongordon
Copy link
Copy Markdown
Contributor

Nice catch. Do you think we should consider just erroring out in the overflow case? But with a more accurate error than Postgres?

@rafiss
Copy link
Copy Markdown
Collaborator Author

rafiss commented Jan 4, 2021

The PG devs seem to be pretty responsive on the bug tracker, so I'll see how they want to handle it and decide if we should do the same. I would lean towards keeping our current behavior of just going to the end of the string.

e.g. these should be the same.

> select substring('string', 2, 100);
tring

> select substring('string', 2, 9223372036854775807);
tring

@rafiss
Copy link
Copy Markdown
Collaborator Author

rafiss commented Jan 4, 2021

It's being discussed in the PG list here: https://www.postgresql.org/message-id/16804-f4eeeb6c11ba71d4%40postgresql.org

@rafiss
Copy link
Copy Markdown
Collaborator Author

rafiss commented Jan 19, 2021

PG devs landed on doing the same behavior we already implement, so merging this https://www.postgresql.org/message-id/3219376.1609795681%40sss.pgh.pa.us

bors r=solongordon

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 19, 2021

Build succeeded:

@craig craig bot merged commit 47eb9f3 into cockroachdb:master Jan 19, 2021
@rafiss rafiss deleted the fix-byte-substring branch February 5, 2021 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

could not parse "\\000x\\0ffyz" as type bytes: invalid bytea escape sequence

4 participants