Skip to content

Introduce UPSERT in set_trial_user_attr#5703

Merged
not522 merged 7 commits intooptuna:masterfrom
porink0424:fix/introduce-upsert-in-set_trial_user_attr
Oct 28, 2024
Merged

Introduce UPSERT in set_trial_user_attr#5703
not522 merged 7 commits intooptuna:masterfrom
porink0424:fix/introduce-upsert-in-set_trial_user_attr

Conversation

@porink0424
Copy link
Copy Markdown
Member

@porink0424 porink0424 commented Oct 11, 2024

Motivation

When adding or updating an attribute, the method first checks whether the attribute already exists in the trial. If it does, it updates the value; if not, it adds a new one. This process results in two SQL calls for every single attribute addition or update.
This PR addresses this issue by combining two queries into one by using UPSERT.

Description of the changes

  • Since the implementation of UPSERT differs across databases, I implemented conditonal handling within _set_trial_user_attr_without_commit.
    • Note that PostgreSQL is not yet supported, and this should be implemented in a follow-up PR.
  • SQLAlchemy versions below v1.4.0 do not have insert function in sqlalchemy.dialects.sqlite, which causes an error. To fix this, I have updated SQLAlchemy to v1.4.2.
    • Since SQLAlchemy v1.4.0 supports Python 3.7 or upper, which Optuna also needs to support, I believe this change will not cause any issues for Optuna.
    • Note that I am specifying v1.4.2 instead of v1.4.0 because some tests encounter a bug in SQLAlchemy v1.4.0, which is resolved in v1.4.2.

Verification of SQL Changes

mysql

import optuna
import logging

optuna.logging.set_verbosity(optuna.logging.WARNING)

study = optuna.create_study(storage="mysql+pymysql://optuna:password@127.0.0.1:3306/optuna")
storage = study._storage


def objective(trial: optuna.Trial) -> float:
    trial.set_user_attr("user_attr_1", "value_1")

    logging.basicConfig()
    logging.getLogger("sqlalchemy.engine").setLevel(logging.INFO)
    trial.set_user_attr("user_attr_1", "new_value_1")
    trial.set_user_attr("user_attr_2", "value_2")
    logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)

    return trial.suggest_float("x", 0.0, 1.0) ** 2


study.optimize(objective, n_trials=1)

I confirmed that the SELECT statement issued with each call to set_trial_user_attr has been eliminated and integrated into the INSERT INTO ... ON DUPLICATE KEY UPDATE ... statement.
Only the differences are shown below:

Before

...
INFO:sqlalchemy.engine.Engine:SELECT trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes.`key` AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json 
FROM trial_user_attributes 
WHERE trial_user_attributes.trial_id = %(trial_id_1)s AND trial_user_attributes.`key` = %(key_1)s
INFO:sqlalchemy.engine.Engine:[cached since 0.005871s ago] {'trial_id_1': 163, 'key_1': 'user_attr_1'}
INFO:sqlalchemy.engine.Engine:UPDATE trial_user_attributes SET value_json=%(value_json)s WHERE trial_user_attributes.trial_user_attribute_id = %(trial_user_attributes_trial_user_attribute_id)s
INFO:sqlalchemy.engine.Engine:[generated in 0.00011s] {'value_json': '"new_value_1"', 'trial_user_attributes_trial_user_attribute_id': 4}
...
INFO:sqlalchemy.engine.Engine:SELECT trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes.`key` AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json 
FROM trial_user_attributes 
WHERE trial_user_attributes.trial_id = %(trial_id_1)s AND trial_user_attributes.`key` = %(key_1)s
INFO:sqlalchemy.engine.Engine:[cached since 0.01159s ago] {'trial_id_1': 163, 'key_1': 'user_attr_2'}
INFO:sqlalchemy.engine.Engine:INSERT INTO trial_user_attributes (trial_id, `key`, value_json) VALUES (%(trial_id)s, %(key)s, %(value_json)s)
INFO:sqlalchemy.engine.Engine:[cached since 0.01137s ago] {'trial_id': 163, 'key': 'user_attr_2', 'value_json': '"value_2"'}
...

After

...
INFO:sqlalchemy.engine.Engine:INSERT INTO trial_user_attributes (trial_id, `key`, value_json) VALUES (%(trial_id)s, %(key)s, %(value_json)s) AS new ON DUPLICATE KEY UPDATE value_json = new.value_json
INFO:sqlalchemy.engine.Engine:[no key 0.00009s] {'trial_id': 162, 'key': 'user_attr_1', 'value_json': '"new_value_1"'}
...
INFO:sqlalchemy.engine.Engine:INSERT INTO trial_user_attributes (trial_id, `key`, value_json) VALUES (%(trial_id)s, %(key)s, %(value_json)s) AS new ON DUPLICATE KEY UPDATE value_json = new.value_json
INFO:sqlalchemy.engine.Engine:[no key 0.00009s] {'trial_id': 162, 'key': 'user_attr_2', 'value_json': '"value_2"'}
...

sqlite

...
study = optuna.create_study(storage="sqlite:///example.db")
...

Before

...
INFO:sqlalchemy.engine.Engine:SELECT trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes."key" AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json 
FROM trial_user_attributes 
WHERE trial_user_attributes.trial_id = ? AND trial_user_attributes."key" = ?
INFO:sqlalchemy.engine.Engine:[cached since 0.003007s ago] (1, 'user_attr_1')
INFO:sqlalchemy.engine.Engine:UPDATE trial_user_attributes SET value_json=? WHERE trial_user_attributes.trial_user_attribute_id = ?
INFO:sqlalchemy.engine.Engine:[generated in 0.00010s] ('"new_value_1"', 1)
...
INFO:sqlalchemy.engine.Engine:SELECT trial_user_attributes.trial_user_attribute_id AS trial_user_attributes_trial_user_attribute_id, trial_user_attributes.trial_id AS trial_user_attributes_trial_id, trial_user_attributes."key" AS trial_user_attributes_key, trial_user_attributes.value_json AS trial_user_attributes_value_json 
FROM trial_user_attributes 
WHERE trial_user_attributes.trial_id = ? AND trial_user_attributes."key" = ?
INFO:sqlalchemy.engine.Engine:[cached since 0.005931s ago] (1, 'user_attr_2')
INFO:sqlalchemy.engine.Engine:INSERT INTO trial_user_attributes (trial_id, "key", value_json) VALUES (?, ?, ?)
INFO:sqlalchemy.engine.Engine:[cached since 0.005662s ago] (1, 'user_attr_2', '"value_2"')
...

After

...
INFO:sqlalchemy.engine.Engine:INSERT INTO trial_user_attributes (trial_id, "key", value_json) VALUES (?, ?, ?) ON CONFLICT (trial_id, "key") DO UPDATE SET value_json = excluded.value_json
INFO:sqlalchemy.engine.Engine:[no key 0.00009s] (1, 'user_attr_1', '"new_value_1"')
...
INFO:sqlalchemy.engine.Engine:INSERT INTO trial_user_attributes (trial_id, "key", value_json) VALUES (?, ?, ?) ON CONFLICT (trial_id, "key") DO UPDATE SET value_json = excluded.value_json
INFO:sqlalchemy.engine.Engine:[no key 0.00009s] (1, 'user_attr_2', '"value_2"')
...

@porink0424 porink0424 changed the title Introduce upsert of mysql in set_trial_user_attr Introduce UPSERT in set_trial_user_attr Oct 11, 2024
@porink0424 porink0424 marked this pull request as ready for review October 16, 2024 08:05
@c-bata c-bata added the enhancement Change that does not break compatibility and not affect public interfaces, but improves performance. label Oct 16, 2024
@y0z
Copy link
Copy Markdown
Member

y0z commented Oct 17, 2024

@not522 @nabenabe0928
Could you review this PR?

Copy link
Copy Markdown
Contributor

@nabenabe0928 nabenabe0928 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code changes and LGTM!

import os
import time

import optuna


mysql_url = os.environ["OPTUNA_STORAGE"]
sqlite_url = "sqlite:///upsert-test.db"
study_name = "test-upsert"

sqlite_study = optuna.create_study(storage=sqlite_url, study_name=study_name)
mysql_study = optuna.create_study(storage=mysql_url, study_name=study_name)

sqlite_trial = sqlite_study.ask()
mysql_trial = mysql_study.ask()

start = time.time()
for i in range(1000):
    sqlite_trial.set_user_attr("foo", i)
print(f"{sqlite_trial.user_attrs=}")
print(f"SQLite: {(time.time() - start) * 1000:.2f}[ms]")

start = time.time()
for i in range(1000):
    mysql_trial.set_user_attr("foo", i)
print(f"MySQL: {(time.time() - start) * 1000:.2f}[ms]")
print(f"{mysql_trial.user_attrs=}")

The time measured:

# This PR
[I 2024-10-22 15:05:07,173] A new study created in RDB with name: test-upsert
[I 2024-10-22 15:05:07,259] A new study created in RDB with name: test-upsert
sqlite_trial.user_attrs={'foo': 999}
SQLite: 5680.27[ms]
MySQL: 4420.31[ms]
mysql_trial.user_attrs={'foo': 999}

# The master branch
[I 2024-10-22 15:05:25,677] A new study created in RDB with name: test-upsert
[I 2024-10-22 15:05:25,783] A new study created in RDB with name: test-upsert
sqlite_trial.user_attrs={'foo': 999}
SQLite: 6599.58[ms]
MySQL: 4955.39[ms]
mysql_trial.user_attrs={'foo': 999}

if attribute is None:
attribute = models.TrialUserAttributeModel(
trial_id=trial_id, key=key, value_json=json.dumps(value)
if self.engine.name == "mysql":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the doc here to verify the code:
https://dev.mysql.com/doc/refman/8.4/en/insert-on-duplicate.html

session.add(attribute)
session.execute(mysql_upsert_stmt)
elif self.engine.name == "sqlite":
sqlite_insert_stmt = sqlalchemy_dialects_sqlite.insert(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

models.TrialUserAttributeModel.trial_id,
models.TrialUserAttributeModel.key,
],
set_=dict(value_json=sqlite_insert_stmt.excluded.value_json),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nabenabe0928
Copy link
Copy Markdown
Contributor

@porink0424

Note

As in the last comment, this PR does not speed up for MySQL.

@c-bata
Copy link
Copy Markdown
Member

c-bata commented Oct 22, 2024

@porink0424

Note

As in the last comment, this PR does not speed up for MySQL.

@nabenabe0928 It looks that your benchmark results shows this PR speeds up trial.set_user_attr() calls ( master 4955.39ms → this branch 4420.31ms). Is that correct?

@nabenabe0928
Copy link
Copy Markdown
Contributor

sqlite_pr = [5725.75, 5891.18, 5837.73, 5904.36, 5686.25, 5548.71, 5537.34, 5380.70, 5567.75, 5666.13]
mysql_pr = [4411.56, 4386.89, 4507.84, 4388.80, 4434.83, 4484.78, 4477.91, 4637.17, 4672.88, 4323.81]

sqlite_master = [5801.52, 5888.66, 7761.65, 6197.31, 6175.02, 6288.58, 6482.06, 5978.97, 5893.87, 5895.68]
mysql_master = [4588.37, 4585.37, 4594.20, 5047.10, 4829.42, 4913.89, 4271.38, 4795.67, 4378.23, 4953.27]

@c-bata @porink0424
I checked the runtime again over 10 times and then I could see the speedup, sorry for the confusion 🙏

@nabenabe0928 nabenabe0928 removed their assignment Oct 23, 2024
@not522
Copy link
Copy Markdown
Member

not522 commented Oct 23, 2024

In my local environment (Mac), the speed has not improved and is equivalent to master. But introducing this PR has the benefit of reducing the number of queries, so I think it is good to merge.
I am checking the range of environments (SQLAlchemy / MySQL / SQLite / ...) in which this PR works.

@porink0424
Copy link
Copy Markdown
Member Author

@not522
Thank you for your comment. I have fixed the code, PTAL.

Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@not522 not522 added this to the v4.1.0 milestone Oct 28, 2024
@not522 not522 merged commit c014b9d into optuna:master Oct 28, 2024
@porink0424 porink0424 deleted the fix/introduce-upsert-in-set_trial_user_attr branch October 30, 2024 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Change that does not break compatibility and not affect public interfaces, but improves performance.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants