Skip to content

storage: ensure non-expired context before each liveness update attempt#25631

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/nlCtx
May 17, 2018
Merged

storage: ensure non-expired context before each liveness update attempt#25631
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/nlCtx

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented May 17, 2018

Fixes #25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get errRetryLiveness errors due to
AmbiguousResultErrors created by DistSender.

Release note: None

Fixes cockroachdb#25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get `errRetryLiveness` errors due to
`AmbiguousResultErrors` created by `DistSender`.

Release note: None
@nvb nvb requested review from a team, tbg and windchan7 May 17, 2018 20:03
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@windchan7
Copy link
Copy Markdown
Contributor

Do you think it makes more sense to move the check down to updateLivenessAttempt or not? Right now updateLiveness is the only caller of that function so it doesn't really matter much.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 17, 2018

That's where I had it before, but I moved it up because updateLiveness is the function that's "aware" of the retry loop, so I think it makes more sense there. My reasoning was that we don't add these context cancellation checks at the beginning of every function, only in cases like this that need it.

@windchan7
Copy link
Copy Markdown
Contributor

Great! Thanks for the explanation. LGTM!

@windchan7
Copy link
Copy Markdown
Contributor

windchan7 commented May 17, 2018

I'm also wondering why we are not replacing this endless for loop with a retry.XXX. That will save the hassle of checking if context has been cancelled or not at every place like this.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 17, 2018

There's always a debate between whether bounded loops help avoid problems or just mask issues (like this one) and make them harder to find. I think for now we should keep this as-is. The context cancellation should at-least bound the loop to some maximum duration.

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Should we backport this to 2.0? Is this just a test flake or could it explain some user issues like the import failures?

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 17, 2018

This could be more serious than a test flake, so I think we should backport it. If the user issues you're referring to included failed node liveness heartbeat: context deadline exceeded loops then it could explain them.

@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented May 17, 2018

bors r+

craig bot pushed a commit that referenced this pull request May 17, 2018
25542: changefeedccl: uri sinks, fill in record key, WITH envelope r=tschottdorf a=danhhz

See commit messages for details

25631: storage: ensure non-expired context before each liveness update attempt r=nvanbenschoten a=nvanbenschoten

Fixes #25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get `errRetryLiveness` errors due to
`AmbiguousResultErrors` created by `DistSender`.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@bdarnell
Copy link
Copy Markdown
Contributor

If the user issues you're referring to included failed node liveness heartbeat: context deadline exceeded loops then it could explain them.

I don't have a specific issue in mind but I know we've seen some reports of liveness problems (especially in conjunction with IMPORT).

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 17, 2018

Build succeeded

@craig craig bot merged commit 7d7f31d into cockroachdb:master May 17, 2018
@nvb nvb deleted the nvanbenschoten/nlCtx branch May 17, 2018 20:56
@tbg
Copy link
Copy Markdown
Member

tbg commented May 18, 2018 via email

craig bot pushed a commit that referenced this pull request May 22, 2018
25634: backport-2.0: storage: ensure non-expired context before each liveness update attempt r=nvanbenschoten a=nvanbenschoten

Backport 1/1 commits from #25631.

/cc @cockroachdb/release

---

Fixes #25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get `errRetryLiveness` errors due to
`AmbiguousResultErrors` created by `DistSender`.

Release note: None


Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants