Make update_execution() atomic#5358
Conversation
…1/st2 into bkhushboo/race_condition
cognifloyd
left a comment
There was a problem hiding this comment.
So basically, all you did was add one line to create the with ...get_lock(...): line, and then indent then block of code to protect it within the lock. And then add some more to the tests. Nice.
Why did adding 1 lock in 1 method increase the number of lock side effects so much in the tests?
| with Timer(key="action.executions.calculate_result_size"): | ||
| result_size = len( | ||
| ActionExecutionDB.result._serialize_field_value(liveaction_db.result) | ||
| with coordination.get_coordinator().get_lock(liveaction_db.id): |
There was a problem hiding this comment.
Could we add a prefix to this to scope the change just to this block? Or are there other places already using this the live action id as the lock name that you also want to block?
There was a problem hiding this comment.
@cognifloyd No, get_lock() is not being used anywhere else.
And to answer why tests are failing, https://github.com/StackStorm/st2/blob/master/st2actions/tests/unit/test_workflow_engine.py#L155-L159 mocks get_lock() to return ToozConnectionError (to test
st2/st2common/st2common/services/workflows.py
Line 941 in 007beed
Because of the above mocking, action execution update calls will fail with my changes and cause the test assertions to fail. I'm working on the fix for this.
|
@cognifloyd All the tests are passing now. Could you please review and merge the request? Thanks |
cognifloyd
left a comment
There was a problem hiding this comment.
LGTM I'm not sure if this will get merged for 3.6.0 or if we'll wait till 3.7.0 to include it.
|
I remember there is another PR that is more involved related to changing the behavior in st2workflowengine #5367. @khushboobhatia01 Could you please include the Changelog record as well for this PR to make it complete? |
|
I added a changelog entry |
|
Thanks, looks we're all good ✔️ on this one! |
Making update_execution() atomic to avoid concurrent updates causing inconsistency.
In the below screenshot two concurrent updates to the execution object resulted in inconsistent data where overall execution is set to running, but the log field depicts the execution had succeeded.

How did this happen?
Two interleaved update_execution() can cause this.
succeededlog is pushedAfter the above interleaved execution, P1 will overwrite overall execution status (succeeded set by P2) to running.