asim: ensure consistent sequencing of changes applied to ranges#107696
asim: ensure consistent sequencing of changes applied to ranges#107696craig[bot] merged 4 commits intocockroachdb:masterfrom
Conversation
|
Let me know your thoughts on the added non-determinism tests since they just reproduce the data-driven tests but with more runs to check for determinism. Happy to remove if this is just extending CI time. |
|
hm looks like Feel free to put this PR on hold. I think there will also be merge conflicts since this was branched off before my rand framework PR got merged. |
kvoli
left a comment
There was a problem hiding this comment.
Do you think we are getting anything out of having both these tests for non-determinism? Would it be possible to augment the existing determinsm test?
Reviewed 1 of 1 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
-- commits line 22 at r2:
nit: this is a bit confusing to read.
Code quote:
2, 3.
pkg/kv/kvserver/asim/tests/testdata/example_zone_config line 1 at r2 (raw file):
# This test applies a configuration that prioritizes zone constraints, favoring
Nice test!
pkg/kv/kvserver/asim/asim_test.go line 181 at r2 (raw file):
}) num := 10
Is there a reason for 10 runs here instead of 3? I
b7984c1 to
1d07153
Compare
Upon investigating, we identified the non-deterministic behavior was due to the sequence of changes applied on the ranges within one tick. Changes within a tick are applied in one go. But the changes are stored in a map, and map iteration order is non-deterministic. Thus, the sequence of events applied and published via gossip could vary and could result in different event states. This patch resolves the issue by changing the storage of changes to be applied from a map to an array, ensuring consistent iteration and sequence of events. Note the shift to an array doesn't pose any issues since all changes need to be applied anyway. With this fix, the test `TestAsimDeterministic` fails only under race. This is less concerning since the simulator should run on a single goroutine without concurrency. Although the cause of the test failure in race conditions in unclear, fixing this is not a priority. Part Of: cockroachdb#105904 Release Note: None
394600a to
06366a3
Compare
|
Previously, kvoli (Austen) wrote…
Done. |
|
I'm removing these two tests that were added to catch non-deterministic failures. Initially, I added them as they were more effective than TestAsimDeterministic in reproducing certain failures. After overloading TestAsimDeterministic with more stores and replicas in the setup, they now only fail when TestAsimDeterministic fails, making them unnecessary. |
This patch adds an asim datadriven test case which sets up zone constraints and ensures its proper application and validation of these constraints. Part Of: cockroachdb#105904 Release Note: None
kvoli
left a comment
There was a problem hiding this comment.
Should be good to merge after addressing the storepool locking nit.
Reviewed 3 of 3 files at r3, 8 of 8 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/asim/state/impl.go line 1031 at r7 (raw file):
storeID StoreID, storeDescriptors map[roachpb.StoreID]*storepool.StoreDetail, ) { s.stores[storeID].storepool.DetailsMu.Lock()
nit: I would avoid locking in the simulator unless we understand why it is necessary. This doesn't appear necessary, since we still think there shouldn't be any concurrency.
|
Previously, kvoli (Austen) wrote…
Done. |
While investigating cockroachdb#105904, we found that the bug's cause is due to the simulator's iteration over an unordered map, leading to non-determinism. There are other occurrences in the simulator's code where unordered map iteration takes place. To make the code less error-prone, this patch checks the simulator's code and sorts unordered maps by key when iterating. While it's uncertain whether this change will solve any underlying non-determinism or pinpoint exactly where in the code non-determinism might occur, there is no harm in making the system less error-prone. Part Of:cockroachdb#105904 Release Note: None
|
TFTR! bors r=kvoli |
|
Build succeeded: |
asim: ensure consistent sequencing of changes applied to a range
Upon investigating, we identified the non-deterministic behavior was due to the
sequence of changes applied on the ranges within one tick. Changes within a tick
are applied in one go. But the changes are stored in a map, and map iteration
order is non-deterministic. Thus, the sequence of events applied and published
via gossip could vary and could result in different event states. This patch
resolves the issue by changing the storage of changes to be applied from a map
to an array, ensuring consistent iteration and sequence of events. Note the
shift to an array doesn't pose any issues since all changes need to be applied
anyway.
With this fix, the test
TestAsimDeterministicfails only under race. This isless concerning since the simulator should run on a single goroutine without
concurrency. Although the cause of the test failure in race conditions in
unclear, fixing this is not a priority.
Part Of: #105904
Release Note: None
asim: add tests for constraints validation, asim determinism
This patch adds an asim datadriven test case which sets up zone constraints and
ensures its proper application and validation of these constraints.
Part Of: #105904
Release Note: None
asim: sort map by key when iterating
While investigating #105904, we found that the bug's cause is due to the
simulator's iteration over an unordered map, leading to non-determinism. There
are other occurrences in the simulator's code where unordered map iteration
takes place. To make the code less error-prone, this patch checks the
simulator's code and sorts unordered maps by key when iterating. While it's
uncertain whether this change will solve any underlying non-determinism or
pinpoint exactly where in the code non-determinism might occur, there is no harm
in making the system less error-prone.
Part Of:#105904
Release Note: None
Each test were stressed with runs set to 10 and with --count = 100.
TestAsimDeterministic (with --count = 5 due to test wait time):

TestAsimDeterministicZoneConf:

TestAsimDeterministicDiskFull:

example_zone_config:

example_fulldisk:
