Skip to content

Report more details of unobtainable ShardLock#61255

Merged
DaveCTurner merged 3 commits intoelastic:masterfrom
DaveCTurner:2020-08-18-log-shard-lock-age
Aug 19, 2020
Merged

Report more details of unobtainable ShardLock#61255
DaveCTurner merged 3 commits intoelastic:masterfrom
DaveCTurner:2020-08-18-log-shard-lock-age

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

Today a common reason for a ShardLockObtainFailedException is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.

The message in this case is typically as follows:

obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]

This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"

With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:

obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]

Relates #38807

Today a common reason for a `ShardLockObtainFailedException` is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.

The message in this case is typically as follows:

    obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]

This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"

With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:

    obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]

Relates elastic#38807
@DaveCTurner DaveCTurner added >enhancement :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v8.0.0 v7.10.0 labels Aug 18, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Store)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Aug 18, 2020
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

try {
if (mutex.tryAcquire(timeoutInMillis, TimeUnit.MILLISECONDS)) {
lockDetails = details;
lockDetails = Tuple.tuple(System.nanoTime(), details);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: setDetails(details);

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 98213df into elastic:master Aug 19, 2020
@DaveCTurner DaveCTurner deleted the 2020-08-18-log-shard-lock-age branch August 19, 2020 05:36
@DaveCTurner
Copy link
Copy Markdown
Member Author

Thanks both

DaveCTurner added a commit that referenced this pull request Aug 19, 2020
Today a common reason for a `ShardLockObtainFailedException` is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.

The message in this case is typically as follows:

    obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]

This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"

With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:

    obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]

Relates #38807
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >enhancement Team:Distributed Meta label for distributed team. v7.10.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants