Skip to content
This repository was archived by the owner on Mar 31, 2026. It is now read-only.

refactor(db_api): remove parsing of insert statements#630

Closed
larkee wants to merge 13 commits intogoogleapis:mainfrom
larkee:dbapi-bug-fix
Closed

refactor(db_api): remove parsing of insert statements#630
larkee wants to merge 13 commits intogoogleapis:mainfrom
larkee:dbapi-bug-fix

Conversation

@larkee
Copy link
Copy Markdown
Contributor

@larkee larkee commented Oct 21, 2021

No description provided.

@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/python-spanner API. label Oct 21, 2021
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Oct 21, 2021
@larkee larkee changed the title perf(db_api): generate one statement for homogeneous inserts perf(db_api): improve performance of insert statements Oct 25, 2021
@larkee larkee marked this pull request as ready for review November 3, 2021 03:15
@larkee larkee requested a review from a team November 3, 2021 03:15
@larkee larkee requested a review from a team as a code owner November 3, 2021 03:15
@larkee larkee requested review from IlyaFaer and olavloite November 3, 2021 03:39
Copy link
Copy Markdown
Contributor

@olavloite olavloite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, this change will change the behavior of dbapi execute_many is called with an INSERT statement. Previously such a call would result in a BatchUpdate RPC invocation with the SQL statement repeated multiple times, but with different parameter values. This will cause dbapi to alter the SQL statement that is executed into a single INSERT statement that inserts multiple rows (e.g. INSERT INTO Foo (c1, c2) VALUES (@p1, @p2), (@p3, @p4)). That is (probably) more efficient on Spanner than a BatchUpdate RPC with multiple INSERT statements.

In general I would say that:

  1. A driver should normally execute the SQL statement(s) it receives without altering them. It is the responsibility of the application or framework that uses it to generate the SQL statement that is needed.
  2. If the driver supports re-writing certain SQL statements it should be an op-in (or at least have an opt-out possibility) so it is clear to the user that it might happen. If it is an op-out, it should also be lenient enough to fall back to just executing the supplied SQL without returning an error (unless of course the SQL statement is invalid, in which case the backend is responsible for returning an error).

So if in any way possible, I would rather say that this change should be in SQLAlchemy and/or Django. If that is not feasible, then at least this should be optional and more lenient so it doesn't cause surprises for anyone that might be using dbapi directly.

about the resulting table in case c).
:returns: A list of (sql, param) tuples
""" # noqa
match = RE_INSERT.search(insert_sql)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is not part of this change, but I think this part should be more lenient. As far as I can tell, the function will return an error if it cannot match the insert regexp, and then refuse to execute the statement(s). I would rather expect something like:

  1. Try to parse the insert statement, and if it does not work, fall back to a 'simple' case where the sql statement and parameter list is converted to a normal batch update RPC invocation. That prevents failures when an application tries to execute an insert statement that does not match the regexp (for example if it contains comments, hints, ...)
  2. If the parse succeeds, then try the different scenarios in this function.
  3. If the parse fails because it encounters some other non-expected scenario, then also fall back to the default of just trying to execute the statements as a batch update.

The above would ensure that the common cases of (generated) sql are parsed into a more efficient format, while other sql is still executed without any unexpected errors.

It might also be good to consider making entire parsing optional. It might be surprising for users that dbapi actually changes the SQL that is executed. Normally a driver like dbapi should execute the SQL as is. The generation of efficient SQL statements should as far as possible be a responsibility of any framework or application that uses the driver.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking more closely, this parsing and rewriting of insert statements doesn't seem to add any benefits. It's only used for execute which means at best it was returning the single SQL statement and at worst it was splitting it into multiple statements.

I have removed parse_insert in favor of sending the SQL as is.

@larkee larkee changed the title perf(db_api): improve performance of insert statements refactor(db_api): remove parsing of insert statements Nov 15, 2021
@larkee
Copy link
Copy Markdown
Contributor Author

larkee commented Nov 15, 2021

If I understand this correctly, this change will change the behavior of dbapi execute_many is called with an INSERT statement. Previously such a call would result in a BatchUpdate RPC invocation with the SQL statement repeated multiple times, but with different parameter values. This will cause dbapi to alter the SQL statement that is executed into a single INSERT statement that inserts multiple rows (e.g. INSERT INTO Foo (c1, c2) VALUES (@p1, @p2), (@p3, @P4)). That is (probably) more efficient on Spanner than a BatchUpdate RPC with multiple INSERT statements.

This was indeed my original intent. However, I agree with your reasoning that this type of change does not belong here. As such, I have updated this PR to instead:

  • Remove unneeded mutations code
  • Remove insert statement parsing
  • Remove adding superfluous WHERE clauses
  • Strip out comments before classifying SQL

PTAL :)

@@ -706,9 +771,6 @@ def test_setoutputsize(self):
#
# def test_do_execute_insert_heterogenous(self):
# pass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can erase other commented cases as well now

@IlyaFaer
Copy link
Copy Markdown

It'd be marvelous if we could erase all of this code, but I'm afraid something will fail somewhere.

First of all, I don't think we can erase the function which adds a dummy WHERE clause to UPDATE operations. We tried it several times and always hit problems in Django and SQLAlchemy level. So, I guess, these changes should be tried with SQLAlchemy and Django tests first.

@@ -391,22 +389,13 @@ def run_statement(self, statement, retried=False):
self._statements.append(statement)

if statement.is_insert:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if statement at all any more? Or can it be merged with the default case for any other update statement?

if classification == parse_utils.STMT_UPDATING:
sql = parse_utils.ensure_where_clause(sql)

if classification != parse_utils.STMT_INSERT:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? As in; do we handle an insert in any different way than for example an UPDATE DML statement?

:rtype: str
:returns: The query type name.
"""
query = sqlparse.format(query, strip_comments=True).strip()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is an external library, it might be good to add a comment here that it has been verified that this library supports exactly the types of comments that Cloud Spanner supports, and no other comment types. That is, the library will strip the following types of comments (from what I can see in the source code of the library this holds true):

  • # single line comment starting with a dash
  • -- single line comment starting with two hyphens
  • `/* multi line comment */

And it does not support for example the PostgreSQL style dollar quoted string constants (https://www.postgresql.org/docs/current/sql-syntax-lexical.html) (or if it does, document that the parsing would fail if a SQL string contains a comment inside what would be interpreted as a dollar quoted string)

@IlyaFaer
Copy link
Copy Markdown

IlyaFaer commented Feb 7, 2022

@vi3k6i5, we have this a bit stale PR here, I'm going to take a look at it and finish it, 'cause I like these changes. They'll reduce the DB API code complexity.

@IlyaFaer
Copy link
Copy Markdown

Closing this PR in favor of: #679, which contains almost the same changes, but without conflicts and with some errors fixed.

@IlyaFaer IlyaFaer closed this Feb 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

api: spanner Issues related to the googleapis/python-spanner API. cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants