Skip to content

kv: introduce a stopgap for lack of ReplicaState synchronization#59194

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:210120.replica-version-nil
Jan 20, 2021
Merged

kv: introduce a stopgap for lack of ReplicaState synchronization#59194
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:210120.replica-version-nil

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

There's a scary lack of synchronization around how we set the
ReplicaState for a given replica, and how we mark a replica as
"initialized". What this means is that very temporarily, it's possible
for the entry in Store.mu.replicas to be both "initialized" and have an
empty ReplicaState. This was an existing problem, but is now more likely
to bite us given the migrations infrastructure attempts to purge
outdated replicas at start up time (when replicas are being initialized,
and we're iterating through extan replicas in the Store.mu.replicas
map).

This issue has caused a bit of recent instability: #59180, #58489,
#58523, and #58378. While we work on a more considered fix to the
problem (tracked in #58489), we can introduce stop the bleeding in the
interim (and unskip some tests).

Release note: None

@irfansharif irfansharif requested review from ajwerner and tbg January 20, 2021 15:22
@irfansharif irfansharif requested a review from a team as a code owner January 20, 2021 15:22
@irfansharif irfansharif removed the request for review from a team January 20, 2021 15:22
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

@irfansharif
Copy link
Copy Markdown
Contributor Author

curl: (22) The requested URL returned error: 503 Service Unavailable: Back-end server is at capacity

Blocked on CI like everyone else. TFTR! Will merge after things settle down.

There's a scary lack of synchronization around how we set the
ReplicaState for a given replica, and how we mark a replica as
"initialized". What this means is that very temporarily, it's possible
for the entry in Store.mu.replicas to be both "initialized" and have an
empty ReplicaState. This was an existing problem, but is now more likely
to bite us given the migrations infrastructure attempts to purge
outdated replicas at start up time (when replicas are being initialized,
and we're iterating through extan replicas in the Store.mu.replicas
map).

This issue has caused a bit of recent instability: cockroachdb#59180, cockroachdb#58489,
\cockroachdb#58523, and cockroachdb#58378. While we work on a more considered fix to the
problem (tracked in cockroachdb#58489), we can introduce stop the bleeding in the
interim (and unskip some tests).

Release note: None
@irfansharif irfansharif force-pushed the 210120.replica-version-nil branch from fc314fb to f47e1f3 Compare January 20, 2021 18:46
@irfansharif
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 20, 2021

Build succeeded:

@craig craig bot merged commit e9d2cc1 into cockroachdb:master Jan 20, 2021
@irfansharif irfansharif deleted the 210120.replica-version-nil branch January 20, 2021 22:00
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Feb 10, 2021
See cockroachdb#59194 and cockroachdb#58489 for more details.

In cockroachdb#58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of cockroachdb#58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 10, 2021
60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif

See #59194 and #58489 for more details.

In #58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of #58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
craig bot pushed a commit that referenced this pull request Feb 10, 2021
60281: sql/pgwire: send placeholder BackendKeyData r=asubiotto,spaskob a=rafiss

fixes #13191 

Some tools expect this message to be returned at connection time and
will not connect without it. CockroachDB does not support pgwire
cancellation, but we can still send a placeholder value here, and
continue ignoring cancellation requests like we already do.

Added a small test to make sure nothing broke.

Release note (sql change): When a connection is established, CockroachDB
will now return a placeholder BackendKeyData message in the response.
This is for compatibility with some tools, but using the BackendKeyData
to cancel a query will still have no effect, just the same as before.



60429: kv: (re-)introduce a stopgap for lack of ReplicaState synchronization r=irfansharif a=irfansharif

See #59194 and #58489 for more details.

In #58489 we observed a scary lack of synchronization around how we set
the ReplicaState for a given replica, and how we mark a replica as
"initialized". What this meant is that it was possible for the entry in
Store.mu.replicas to be both "initialized" and have an empty
ReplicaState. This is now more likely to bite us given the migrations
infrastructure attempts to purge outdated replicas at start up time
(when replicas are being initialized, and we're iterating through extan
replicas in the Store.mu.replicas map).

We believed this was addressed as part of #58378, but that appears not
to be the case. Lets re-introduce this stop-gap while we investigate.

Release note: None

60441: bazel: quash unnecessary dependency on `pkg/util/uuid` from protos r=rickystewart a=rickystewart

This dependency can be replaced with a few `# keep` deps in a few choice
proto targets, which is what we should have done the whole time anyway.
This fixes build failures elsewhere in tree -- for example,
`pkg/util/uuid:uuid_test`, which doesn't play nicely with `rules_go` in
the presence of this dependency.

Fixes #59778.

Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants