Skip to content

Prevent Lock Blocking by Adding Timeout to JournalStorage#5971

Merged
not522 merged 14 commits intooptuna:masterfrom
sawa3030:fix/add-grace-period
Mar 7, 2025
Merged

Prevent Lock Blocking by Adding Timeout to JournalStorage#5971
not522 merged 14 commits intooptuna:masterfrom
sawa3030:fix/add-grace-period

Conversation

@sawa3030
Copy link
Copy Markdown
Collaborator

Motivation

Description of the changes

  • Introduced a grace period for acquiring locks in JournalFileSymlinkLock and JournalFileOpenLock.
  • If the lock cannot be acquired within the grace period, it is considered stale and released to prevent blocking new processes.

@y0z
Copy link
Copy Markdown
Member

y0z commented Feb 13, 2025

@not522 Could you review this PR?

Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR! Could you check the following points?

  • How about using time.monotonic()? This differs from datetime.now() in that it is not affected by system clock updates.
  • How about checking the last modified time of the lock? As the number of processes increases, it may continue to fail to acquire a lock.

@sawa3030
Copy link
Copy Markdown
Collaborator Author

@not522 Thank you for your suggestions. There are some changes I made.

  • I updated the logic to check the last modified time. The previous approach had issues when the grace period was too short, but with your suggested logic, it is now more robust regardless of the grace period length.
  • I have confirmed that this PR successfully resolves the issue: JournalStorage with JournalFileSymlinkLock may fail to resume #5921. After an unexpected process termination, the optimization process can now successfully resume even if the lock file still exists. It successfully resumes the optimization process whether a new process starts before or after the grace period has elapsed.
  • I added tests to ensure that a customized grace period functions correctly without issues.

And here are some concerns

  • As far as I know, time.monotonic() cannot be used for checking the modified time of a file. Do you think it would be better to keep the previous logic, as shown in the code below? Or would that make the implementation unnecessarily complex?
if min(datetime.datetime.now() - mtime, time.monotonic() - start_time) > datetime.timedelta(seconds=self.grace_period):
    self.release()
  • The documentation for st_mtime states that its resolution depends on the file system, and some file systems have a low resolution (e.g., 2-seconds). If the grace period is set too short, this could lead to unexpected behavior. I’ve added a warning if the grace period is set to less than 3 seconds, but I’m unsure whether this warning is necessary or if it just adds unnecessary complexity.

@not522
Copy link
Copy Markdown
Member

not522 commented Feb 13, 2025

Ah, sorry, my previous comment is misleading. I suggest using both checks, the last modified time of the lock and the waiting time (time.monotonic()).

@y0z y0z added the bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself. label Feb 13, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 13, 2025

Codecov Report

Attention: Patch coverage is 63.26531% with 18 lines in your changes missing coverage. Please review.

Project coverage is 88.46%. Comparing base (0a649e3) to head (3ba42a7).
Report is 88 commits behind head on master.

Files with missing lines Patch % Lines
optuna/storages/journal/_file.py 63.26% 18 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5971   +/-   ##
=======================================
  Coverage   88.46%   88.46%           
=======================================
  Files         205      206    +1     
  Lines       13768    13857   +89     
=======================================
+ Hits        12180    12259   +79     
- Misses       1588     1598   +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@c-bata
Copy link
Copy Markdown
Member

c-bata commented Feb 18, 2025

@y0z Could you review this PR?

@not522
Copy link
Copy Markdown
Member

not522 commented Feb 18, 2025

In my understanding, comparing st_mtime and the local time of the machine is unreliable (for example, NFS records the time of the server, not the machine where the process is running). But, we can trust whether it has been modified or not. So, how about the following logic?

start_time = time.monotonic()
mtime = datetime.datetime.fromtimestamp(os.stat(self._lock_file).st_mtime)
while True:
    ...
    current_mtime = datetime.datetime.fromtimestamp(os.stat(self._lock_file).st_mtime)
    if current_mtime != mtime:
        start_time = time.monotonic()
        mtime = current_mtime
        continue
    if time.monotonic() - start_time > self.grace_period:
        self.release()

@sawa3030 sawa3030 force-pushed the fix/add-grace-period branch from d57bd65 to 97203eb Compare February 19, 2025 02:13
@sawa3030
Copy link
Copy Markdown
Collaborator Author

Thank you for your suggestion! I've updated the logic accordingly.

@sawa3030
Copy link
Copy Markdown
Collaborator Author

sawa3030 commented Feb 19, 2025

By running the following code, I have confirmed that the logic successfully resolves the issue: JournalStorage with JournalFileSymlinkLock may fail to resume #5921.

import optuna

def objective(trial): 
    x = trial.suggest_uniform('x', -10, 10)
    return (x - 2) ** 2

storage = optuna.storages.JournalStorage(
    optuna.storages.journal.JournalFileBackend("./optuna_journal_storage.log")
)

study = optuna.create_study(study_name = "example",storage=storage, load_if_exists=True)
study.optimize(objective)

Before (master):
If the process is unexpectedly terminated with SIGKILL and the lock file is not removed, it blocks subsequent processes, preventing the optimization from resuming indefinitely.

After (PR):
Even if the lock file remains, the next process will remove it, allowing the optimization to resume after the grace period.

@sawa3030
Copy link
Copy Markdown
Collaborator Author

@not522 @y0z I’ve made some updates. Could you review the PR when you have time? Thank you!

Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your update! Could you check my suggestions?

@sawa3030 sawa3030 force-pushed the fix/add-grace-period branch from bcfa1c4 to ae4fe93 Compare February 20, 2025 10:24
@sawa3030
Copy link
Copy Markdown
Collaborator Author

@not522 Thank you for your advice! I've made some updates based on your feedback.

@not522 not522 changed the title Prevent Lock Blocking by Adding Timeout to JournalStorage Prevent Lock Blocking by Adding Timeout to JournalStorage Feb 21, 2025
@not522 not522 linked an issue Feb 25, 2025 that may be closed by this pull request
Comment on lines +240 to +246
try:
current_mtime = os.stat(self._lock_file).st_mtime
except OSError:
continue
if current_mtime != mtime:
mtime = current_mtime
last_update_time = time.monotonic()
Copy link
Copy Markdown
Member

@y0z y0z Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about why we need the three variables: last_update_time, mtime, and current_mtime. At first glance, it seems that last_update_time equals to current_mtime since current_mtime != mtime means the lock file has been modified. Could you please provide additional explanation as to the difference between their roles?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_update_time is based on monotonic clock time, whereas current_mtime and m_time are Epoch timestamps. I chose not to compare mtime directly with grace_period to avoid inconsistencies.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the name last_update_time is a little confusing, but I couldn't come up with a better variable name.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change the variable name from last_update_time to last_update_monotonic_time!

Comment on lines +160 to +166
try:
current_mtime = os.stat(self._lock_file).st_mtime
except OSError:
continue
if current_mtime != mtime:
mtime = current_mtime
last_update_time = time.monotonic()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""

def __init__(self, filepath: str) -> None:
def __init__(self, filepath: str, grace_period: int = 30) -> None:
Copy link
Copy Markdown
Member

@y0z y0z Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about allowing grace_period = None (i.e., unlimited)? Forcing the release of long-held locks seems to work in many cases, but I'm not sure it's always OK.

Suggested change: grace_period: int | None = 30

"for an extended period. Forcibly releasing the lock file."
)
try:
self.release()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to reset sleep_secs to 0.001 after calling release()? Another thread may acquire the lock between your release() and os.symlink().

"""

def __init__(self, filepath: str) -> None:
def __init__(self, filepath: str, grace_period: int = 30) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sawa3030
Copy link
Copy Markdown
Collaborator Author

@y0z I made the update. Thank you for your feedback!

@sawa3030
Copy link
Copy Markdown
Collaborator Author

sawa3030 commented Mar 5, 2025

@y0z @not522 I think all updates have been applied. Thank you for your suggestions! Please let me know if there's anything else that needs to be addressed.

Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. It's almost LGTM. Could you checking the following points?

  • The error type
  • Adding test for the invalid argument (i.e., zero and negative integer)

I think the warning message can be improved, but I think we can address in a separate PR.

try:
self.release()
sleep_secs = 0.001
except OSError:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should be RuntimeError.

Suggested change
except OSError:
except RuntimeError:

Copy link
Copy Markdown
Member

@y0z y0z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@y0z y0z removed their assignment Mar 5, 2025
sawa3030 and others added 2 commits March 7, 2025 10:31
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
@sawa3030
Copy link
Copy Markdown
Collaborator Author

sawa3030 commented Mar 7, 2025

@not522 Thank you for letting me know that. I updated the code as you suggested.

Copy link
Copy Markdown
Member

@not522 not522 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@not522 not522 added this to the v4.3.0 milestone Mar 7, 2025
@not522 not522 merged commit eb2367b into optuna:master Mar 7, 2025
14 checks passed
@sawa3030 sawa3030 deleted the fix/add-grace-period branch November 14, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue/PR about behavior that is broken. Not for typos/examples/CI/test but for Optuna itself.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JournalStorage with JournalFileSymlinkLock may fail to resume

4 participants