Prevent Lock Blocking by Adding Timeout to JournalStorage#5971
Prevent Lock Blocking by Adding Timeout to JournalStorage#5971not522 merged 14 commits intooptuna:masterfrom
JournalStorage#5971Conversation
|
@not522 Could you review this PR? |
not522
left a comment
There was a problem hiding this comment.
Thank you for your PR! Could you check the following points?
- How about using
time.monotonic()? This differs fromdatetime.now()in that it is not affected by system clock updates. - How about checking the last modified time of the lock? As the number of processes increases, it may continue to fail to acquire a lock.
|
@not522 Thank you for your suggestions. There are some changes I made.
And here are some concerns
|
|
Ah, sorry, my previous comment is misleading. I suggest using both checks, the last modified time of the lock and the waiting time ( |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5971 +/- ##
=======================================
Coverage 88.46% 88.46%
=======================================
Files 205 206 +1
Lines 13768 13857 +89
=======================================
+ Hits 12180 12259 +79
- Misses 1588 1598 +10 ☔ View full report in Codecov by Sentry. |
|
@y0z Could you review this PR? |
|
In my understanding, comparing start_time = time.monotonic()
mtime = datetime.datetime.fromtimestamp(os.stat(self._lock_file).st_mtime)
while True:
...
current_mtime = datetime.datetime.fromtimestamp(os.stat(self._lock_file).st_mtime)
if current_mtime != mtime:
start_time = time.monotonic()
mtime = current_mtime
continue
if time.monotonic() - start_time > self.grace_period:
self.release() |
d57bd65 to
97203eb
Compare
|
Thank you for your suggestion! I've updated the logic accordingly. |
|
By running the following code, I have confirmed that the logic successfully resolves the issue: JournalStorage with JournalFileSymlinkLock may fail to resume #5921. Before (master): After (PR): |
not522
left a comment
There was a problem hiding this comment.
Thank you for your update! Could you check my suggestions?
bcfa1c4 to
ae4fe93
Compare
|
@not522 Thank you for your advice! I've made some updates based on your feedback. |
JournalStorage
optuna/storages/journal/_file.py
Outdated
| try: | ||
| current_mtime = os.stat(self._lock_file).st_mtime | ||
| except OSError: | ||
| continue | ||
| if current_mtime != mtime: | ||
| mtime = current_mtime | ||
| last_update_time = time.monotonic() |
There was a problem hiding this comment.
I am confused about why we need the three variables: last_update_time, mtime, and current_mtime. At first glance, it seems that last_update_time equals to current_mtime since current_mtime != mtime means the lock file has been modified. Could you please provide additional explanation as to the difference between their roles?
There was a problem hiding this comment.
last_update_time is based on monotonic clock time, whereas current_mtime and m_time are Epoch timestamps. I chose not to compare mtime directly with grace_period to avoid inconsistencies.
There was a problem hiding this comment.
I feel like the name last_update_time is a little confusing, but I couldn't come up with a better variable name.
There was a problem hiding this comment.
I change the variable name from last_update_time to last_update_monotonic_time!
optuna/storages/journal/_file.py
Outdated
| try: | ||
| current_mtime = os.stat(self._lock_file).st_mtime | ||
| except OSError: | ||
| continue | ||
| if current_mtime != mtime: | ||
| mtime = current_mtime | ||
| last_update_time = time.monotonic() |
There was a problem hiding this comment.
optuna/storages/journal/_file.py
Outdated
| """ | ||
|
|
||
| def __init__(self, filepath: str) -> None: | ||
| def __init__(self, filepath: str, grace_period: int = 30) -> None: |
There was a problem hiding this comment.
What about allowing grace_period = None (i.e., unlimited)? Forcing the release of long-held locks seems to work in many cases, but I'm not sure it's always OK.
Suggested change: grace_period: int | None = 30
optuna/storages/journal/_file.py
Outdated
| "for an extended period. Forcibly releasing the lock file." | ||
| ) | ||
| try: | ||
| self.release() |
There was a problem hiding this comment.
Would it be a good idea to reset sleep_secs to 0.001 after calling release()? Another thread may acquire the lock between your release() and os.symlink().
optuna/storages/journal/_file.py
Outdated
| """ | ||
|
|
||
| def __init__(self, filepath: str) -> None: | ||
| def __init__(self, filepath: str, grace_period: int = 30) -> None: |
There was a problem hiding this comment.
|
@y0z I made the update. Thank you for your feedback! |
not522
left a comment
There was a problem hiding this comment.
Thank you. It's almost LGTM. Could you checking the following points?
- The error type
- Adding test for the invalid argument (i.e., zero and negative integer)
I think the warning message can be improved, but I think we can address in a separate PR.
optuna/storages/journal/_file.py
Outdated
| try: | ||
| self.release() | ||
| sleep_secs = 0.001 | ||
| except OSError: |
There was a problem hiding this comment.
Maybe it should be RuntimeError.
| except OSError: | |
| except RuntimeError: |
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
Co-authored-by: Naoto Mizuno <naotomizuno@preferred.jp>
|
@not522 Thank you for letting me know that. I updated the code as you suggested. |
Motivation
Description of the changes
JournalFileSymlinkLockandJournalFileOpenLock.