Skip to content

asim: ensure consistent sequencing of changes applied to ranges#107696

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
wenyihu6:fix-non-deter
Aug 2, 2023
Merged

asim: ensure consistent sequencing of changes applied to ranges#107696
craig[bot] merged 4 commits intocockroachdb:masterfrom
wenyihu6:fix-non-deter

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Jul 27, 2023

asim: ensure consistent sequencing of changes applied to a range

Upon investigating, we identified the non-deterministic behavior was due to the
sequence of changes applied on the ranges within one tick. Changes within a tick
are applied in one go. But the changes are stored in a map, and map iteration
order is non-deterministic. Thus, the sequence of events applied and published
via gossip could vary and could result in different event states. This patch
resolves the issue by changing the storage of changes to be applied from a map
to an array, ensuring consistent iteration and sequence of events. Note the
shift to an array doesn't pose any issues since all changes need to be applied
anyway.

With this fix, the test TestAsimDeterministic fails only under race. This is
less concerning since the simulator should run on a single goroutine without
concurrency. Although the cause of the test failure in race conditions in
unclear, fixing this is not a priority.

Part Of: #105904
Release Note: None


asim: add tests for constraints validation, asim determinism

This patch adds an asim datadriven test case which sets up zone constraints and
ensures its proper application and validation of these constraints.

Part Of: #105904
Release Note: None


asim: sort map by key when iterating

While investigating #105904, we found that the bug's cause is due to the
simulator's iteration over an unordered map, leading to non-determinism. There
are other occurrences in the simulator's code where unordered map iteration
takes place. To make the code less error-prone, this patch checks the
simulator's code and sorts unordered maps by key when iterating. While it's
uncertain whether this change will solve any underlying non-determinism or
pinpoint exactly where in the code non-determinism might occur, there is no harm
in making the system less error-prone.

Part Of:#105904
Release Note: None


Each test were stressed with runs set to 10 and with --count = 100.

TestAsimDeterministic (with --count = 5 due to test wait time):
Screenshot 2023-07-27 at 8 09 50 AM

TestAsimDeterministicZoneConf:
Screenshot 2023-07-27 at 8 08 48 AM

TestAsimDeterministicDiskFull:
Screenshot 2023-07-27 at 8 08 19 AM

example_zone_config:
Screenshot 2023-07-27 at 8 13 19 AM

example_fulldisk:
Screenshot 2023-07-27 at 8 18 36 AM

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 changed the title Fix non deter asim: ensure consistent sequencing of changes applied to a range Jul 27, 2023
@wenyihu6
Copy link
Copy Markdown
Contributor Author

Let me know your thoughts on the added non-determinism tests since they just reproduce the data-driven tests but with more runs to check for determinism. Happy to remove if this is just extending CI time.

@wenyihu6 wenyihu6 marked this pull request as ready for review July 27, 2023 12:18
@wenyihu6 wenyihu6 requested a review from a team as a code owner July 27, 2023 12:18
@wenyihu6 wenyihu6 requested a review from kvoli July 27, 2023 12:19
@wenyihu6 wenyihu6 changed the title asim: ensure consistent sequencing of changes applied to a range asim: ensure consistent sequencing of changes applied to ranges Jul 27, 2023
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jul 27, 2023

hm looks like TestAsimDeterministic and TestAsimDeterministicDiskFull are still failing under stress (Extended CI). I'm going to take a shot in the dark with some pieces of code that I suspect may have issues while investigating and hand off if I can't figure out within an hour.

Feel free to put this PR on hold. I think there will also be merge conflicts since this was branched off before my rand framework PR got merged.

Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we are getting anything out of having both these tests for non-determinism? Would it be possible to augment the existing determinsm test?

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


-- commits line 22 at r2:
nit: this is a bit confusing to read.

Code quote:

2, 3.

pkg/kv/kvserver/asim/tests/testdata/example_zone_config line 1 at r2 (raw file):

# This test applies a configuration that prioritizes zone constraints, favoring

Nice test!


pkg/kv/kvserver/asim/asim_test.go line 181 at r2 (raw file):

	})

	num := 10

Is there a reason for 10 runs here instead of 3? I

@wenyihu6 wenyihu6 force-pushed the fix-non-deter branch 6 times, most recently from b7984c1 to 1d07153 Compare July 31, 2023 21:32
Upon investigating, we identified the non-deterministic behavior was due to the
sequence of changes applied on the ranges within one tick. Changes within a tick
are applied in one go. But the changes are stored in a map, and map iteration
order is non-deterministic. Thus, the sequence of events applied and published
via gossip could vary and could result in different event states. This patch
resolves the issue by changing the storage of changes to be applied from a map
to an array, ensuring consistent iteration and sequence of events. Note the
shift to an array doesn't pose any issues since all changes need to be applied
anyway.

With this fix, the test `TestAsimDeterministic` fails only under race. This is
less concerning since the simulator should run on a single goroutine without
concurrency. Although the cause of the test failure in race conditions in
unclear, fixing this is not a priority.

Part Of: cockroachdb#105904
Release Note: None
@wenyihu6 wenyihu6 self-assigned this Jul 31, 2023
@wenyihu6 wenyihu6 force-pushed the fix-non-deter branch 2 times, most recently from 394600a to 06366a3 Compare July 31, 2023 22:13
@wenyihu6 wenyihu6 requested a review from kvoli August 1, 2023 00:53
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Aug 1, 2023

pkg/kv/kvserver/asim/asim_test.go line 181 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Is there a reason for 10 runs here instead of 3? I

Done.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Aug 1, 2023

I'm removing these two tests that were added to catch non-deterministic failures. Initially, I added them as they were more effective than TestAsimDeterministic in reproducing certain failures. After overloading TestAsimDeterministic with more stores and replicas in the setup, they now only fail when TestAsimDeterministic fails, making them unnecessary.

This patch adds an asim datadriven test case which sets up zone constraints and
ensures its proper application and validation of these constraints.

Part Of: cockroachdb#105904
Release Note: None
Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Should be good to merge after addressing the storepool locking nit.

Reviewed 3 of 3 files at r3, 8 of 8 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/asim/state/impl.go line 1031 at r7 (raw file):

	storeID StoreID, storeDescriptors map[roachpb.StoreID]*storepool.StoreDetail,
) {
	s.stores[storeID].storepool.DetailsMu.Lock()

nit: I would avoid locking in the simulator unless we understand why it is necessary. This doesn't appear necessary, since we still think there shouldn't be any concurrency.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Aug 1, 2023

pkg/kv/kvserver/asim/state/impl.go line 1031 at r7 (raw file):

Previously, kvoli (Austen) wrote…

nit: I would avoid locking in the simulator unless we understand why it is necessary. This doesn't appear necessary, since we still think there shouldn't be any concurrency.

Done.

While investigating cockroachdb#105904, we found that the bug's cause is due to the
simulator's iteration over an unordered map, leading to non-determinism. There
are other occurrences in the simulator's code where unordered map iteration
takes place. To make the code less error-prone, this patch checks the
simulator's code and sorts unordered maps by key when iterating. While it's
uncertain whether this change will solve any underlying non-determinism or
pinpoint exactly where in the code non-determinism might occur, there is no harm
in making the system less error-prone.

Part Of:cockroachdb#105904
Release Note: None
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Aug 2, 2023

TFTR!

bors r=kvoli

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 2, 2023

Build succeeded:

@wenyihu6 wenyihu6 deleted the fix-non-deter branch October 30, 2023 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants