Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request significantly refactors the asynchronous job execution logic in runjobs.lua, moving to a much cleaner and more robust consumer-based model using semaphores. The changes also improve the semaphore implementation itself by switching from a hashset to a queue for waiting coroutines. While the overall design is a major improvement, I've identified a critical issue in the new semaphore post logic that could lead to a deadlock under error conditions. Additionally, there's a medium-severity issue in runjobs.lua regarding how job completion is counted, which leads to inaccurate progress reporting.
| function _semaphore:post(value) | ||
| if self._POSTING then | ||
| return self._VALUE | ||
| end | ||
| self._POSTING = true | ||
| local new_value = self._VALUE + value | ||
| self._VALUE = new_value | ||
| if new_value > 0 then | ||
| local pending = {} | ||
| local waiting = self._WAITING | ||
| for item in waiting:items() do | ||
| if #pending < new_value then | ||
| table.insert(pending, item) | ||
| else | ||
| local post_count = 0 | ||
| while not waiting:empty() do | ||
| local item = waiting:pop() | ||
| scheduler:co_resume(item) | ||
| post_count = post_count + 1 | ||
| if post_count >= new_value then | ||
| break | ||
| end | ||
| end | ||
| for _, item in ipairs(pending) do | ||
| scheduler:co_resume(item) | ||
| end | ||
| end | ||
| self._POSTING = false | ||
| return new_value | ||
| end |
There was a problem hiding this comment.
The _POSTING flag is not guaranteed to be reset to false if an error occurs within this function (e.g., if value is not a number, or if scheduler:co_resume is modified to throw an error). If an error is thrown, the function will exit without resetting self._POSTING, causing a permanent lock on the semaphore. All subsequent calls to :post() will then return early, effectively deadlocking any waiting coroutines.
To ensure the lock is always released, you should wrap the critical section in a construct that provides finally semantics, which can be emulated in Lua using pcall.
function _semaphore:post(value)
if self._POSTING then
return self._VALUE
end
self._POSTING = true
local ok, res_or_err = pcall(function()
local new_value = self._VALUE + (value or 1)
self._VALUE = new_value
if new_value > 0 then
local waiting = self._WAITING
local post_count = 0
while not waiting:empty() do
local item = waiting:pop()
local resume_ok, resume_err = scheduler:co_resume(item)
if not resume_ok then
error(resume_err, 0)
end
post_count = post_count + 1
if post_count >= new_value then
break
end
end
end
return new_value
end)
self._POSTING = false
if not ok then
error(res_or_err, 0)
end
return res_or_err
end
| state.finished_count = state.finished_count + 1 | ||
| job_func(job_index, total, {progress = state.progress_wrapper}) |
There was a problem hiding this comment.
Incrementing state.finished_count before the job function is called is problematic. It means the count represents jobs that have started, not finished. This has two main consequences:
- The variable name
finished_countis misleading and harms maintainability. It should probably bedispatched_countorstarted_count. - The progress reporting becomes inaccurate, as it will show progress based on job dispatch rather than completion. A long-running or failing job will appear as "finished" in the progress calculation.
A better approach is to increment a counter only after job_func completes successfully. This would require using a separate counter for dispatching versus completion and likely using a lock to safely increment the completion counter, as job_func can yield.
build.distcc.remote_onlypolicy to enable build on only remote machinesBefore optimization
After optimization