Skip to content

asim: fix simulator non-deterministic behavior #108768

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:new-livenessmap
Aug 16, 2023
Merged

asim: fix simulator non-deterministic behavior #108768
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:new-livenessmap

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Aug 15, 2023

Previously, the allocator simulator exhibits non-deterministic behavior
because it set the liveness expiration record using the real system time
plus one minute. In addition, the liveness record was not updated between
ticks. While this was typically acceptable, it could lead to dead nodes in
deadlock test scenarios where tests took longer.

To address this, the following changes were made in this patch:

  1. Remove the usage of TestNodeVitality as the simulator uses a simulated clock
    rather than a real-time clock.
  2. Replace the existing TestNodeVitality mapping with an enum map for Liveness
    Status. The majority of the code changes just revert to the state before this
    commit.
  3. To hook the liveness component with the liveness status mapping, this patch
    also introduces a conversion method which constructs node vitality in a manner
    that respects all node liveness status properties.

Fixes: #105904, #108092
Release Note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the new-livenessmap branch 13 times, most recently from 4cd8572 to f4c6976 Compare August 15, 2023 15:54
@wenyihu6 wenyihu6 marked this pull request as ready for review August 15, 2023 16:56
@wenyihu6 wenyihu6 requested a review from a team as a code owner August 15, 2023 16:56
@wenyihu6 wenyihu6 requested a review from kvoli August 15, 2023 16:56
@wenyihu6 wenyihu6 force-pushed the new-livenessmap branch 2 times, most recently from 3042428 to f1bb738 Compare August 15, 2023 17:29
@wenyihu6
Copy link
Copy Markdown
Contributor Author

pkg/kv/kvserver/asim/state/impl.go line 1080 at r1 (raw file):

		count := 0
		for _, status := range s.nodeLiveness.statusMap {
			if status != livenesspb.NodeLivenessStatus_DECOMMISSIONED {

Do we want to include all nodes with an active membership? Should we use livenesspb.MembershipStatus_ACTIVE instead? It also excludes NodeLivenessStatus_DECOMMISSIONING.

Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


-- commits line 5 at r1:
nit: set


-- commits line 5 at r1:
nit: consider replacing "current time" with "system time"


-- commits line 15 at r1:
nit: You can drop the markdown link, and instead just reference the SHA e.g. 4a52da5c0d90e40c6e1d6e2b26f4a72dd04527ef


pkg/kv/kvserver/asim/state/impl.go line 1080 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Do we want to include all nodes with an active membership? Should we use livenesspb.MembershipStatus_ACTIVE instead? It also excludes NodeLivenessStatus_DECOMMISSIONING.

I don't think there is an equivalent mapping in the simulator yet, which separates liveness and membership.

You could add this in another commit, or file an issue to add the concept of membership, and separate it out from liveness status.


pkg/kv/kvserver/asim/state/liveness.go line 51 at r1 (raw file):

// Following the timeline of the states that a liveness record goes through:
//
//		|---LIVE---|---UNAVAILABLE---|---DEAD---> time

Is this diagram also somewhere else in the code? I think referencing the existing comments/code for this will age better.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

pkg/kv/kvserver/asim/state/impl.go line 1077 at r2 (raw file):

// TODO(kvoli): Find a better home for this method, required by the storepool.
// TODO(wenyihu6): introduce the concept of membership separated from the
// liveness map.

Added as a todo here.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

pkg/kv/kvserver/asim/state/liveness.go line 51 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Is this diagram also somewhere else in the code? I think referencing the existing comments/code for this will age better.

I'm using a permanent link here. Are there better ways to link to the code snippet?

Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with removing the link, and referencing a fn name.

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/asim/state/impl.go line 1077 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Added as a todo here.

Thanks


pkg/kv/kvserver/asim/state/liveness.go line 51 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I'm using a permanent link here. Are there better ways to link to the code snippet?

You can reference the function naming, if it is unlikely to change. For this one, something like "see liveness.LivenessStatus" would be fine.

Using a perma link is a bit clunky, and doesn't map so well within an editor.


pkg/kv/kvserver/asim/state/liveness.go line 100 at r1 (raw file):

		true,
		timeUntilNodeDead,
		0,

nit: consider naming the function arguments if they're not obvious e.g. 0, /* timeAfterNodeSuspect */.

Previously, the allocator simulator exhibits non-deterministic behavior
because it set the liveness expiration record using the real system time
plus one minute. In addition, the liveness record was not updated between
ticks. While this was typically acceptable, it could lead to dead nodes in
deadlock test scenarios where tests took longer.

To address this, the following changes were made in this patch:
1. Remove the usage of TestNodeVitality as the simulator uses a simulated clock
rather than a real-time clock.
2. Replace the existing TestNodeVitality mapping with an enum map for Liveness
Status. The majority of the code changes just revert to the state before this
[commit](4a52da5).
3. To hook the liveness component with the liveness status mapping, this patch
also introduces a conversion method which constructs node vitality in a manner
that respects all node liveness status properties.

Fixes: cockroachdb#105904, cockroachdb#108092
Release Note: None
@wenyihu6
Copy link
Copy Markdown
Contributor Author

pkg/kv/kvserver/asim/state/liveness.go line 51 at r1 (raw file):

Previously, kvoli (Austen) wrote…

You can reference the function naming, if it is unlikely to change. For this one, something like "see liveness.LivenessStatus" would be fine.

Using a perma link is a bit clunky, and doesn't map so well within an editor.

Done.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=kvoli

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 15, 2023

Build failed:

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Aug 16, 2023

Failure Free disk space requirements could not be met looks irrelevant.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 16, 2023

Build succeeded:

@craig craig bot merged commit c167946 into cockroachdb:master Aug 16, 2023
@wenyihu6 wenyihu6 deleted the new-livenessmap branch August 16, 2023 02:46
@wenyihu6 wenyihu6 changed the title asim: deflake TestAsimDeterministic asim: make simulator determinstic Feb 19, 2024
@wenyihu6 wenyihu6 changed the title asim: make simulator determinstic asim: fix simulator non-deterministic behavior Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

asim: non-deterministic result when overloaded with stores and replicas

3 participants