Skip to content

Make CBE message creation more robust#87687

Merged
elasticsearchmachine merged 8 commits intoelastic:masterfrom
DaveCTurner:2022-06-15-robust-CBE-message-creation
Jun 21, 2022
Merged

Make CBE message creation more robust#87687
elasticsearchmachine merged 8 commits intoelastic:masterfrom
DaveCTurner:2022-06-15-robust-CBE-message-creation

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner commented Jun 15, 2022

Child circuit breakers rely on proper matching of acquire/release pairs.
This can be tricky to get right. If we get it wrong and accidentally
double-release a CB then it may end up with a negative used value.
This is definitely a bad situation in which to find ourselves, but today
in production it's made a whole lot worse because it causes exceptions
on every attempt to report a CircuitBreakerStats or to construct a
parent CircuitBreakingException.

This commit makes the message construction and stats serialization a
little more robust so that it's clearer what is going on in production.

Relates #86059.

@DaveCTurner DaveCTurner added >bug :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload v7.17.5 v8.4.0 v8.3.1 labels Jun 15, 2022
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jun 15, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Child circuit breakers rely on proper matching of acquire/release pairs.
This can be tricky to get right. If we get it wrong and accidentally
double-release a CB then it may end up with a negative `used` value.
This is definitely a bad situation in which to find ourselves, but today
in production it's made a whole lot worse because it causes exceptions
on every attempt to report a `CircuitBreakerStats` or to construct a
parent `CircuitBreakingException`.

This commit makes the message construction and stats serialization a
little more robust so that it's clearer what is going on in production.
@DaveCTurner DaveCTurner force-pushed the 2022-06-15-robust-CBE-message-creation branch from c15655e to c87f62f Compare June 15, 2022 12:54
@DaveCTurner DaveCTurner marked this pull request as draft June 15, 2022 12:56
@DaveCTurner DaveCTurner marked this pull request as ready for review June 16, 2022 08:19
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

@grcevski grcevski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Does it make sense to log a warning of some kind when we encounter a negative value with a stack trace? I'm hoping here that with enough stack traces we could at least pinpoint the general area where we have this double decrement, or perhaps that won't help us much.

@DaveCTurner
Copy link
Copy Markdown
Member Author

Sure, I added such logging in 5ff3a42. This should at least tell us which CB is affected in a way that's easy to search across all clusters' logs.

@DaveCTurner DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge labels Jun 21, 2022
@elasticsearchmachine elasticsearchmachine merged commit 9ff9026 into elastic:master Jun 21, 2022
@DaveCTurner DaveCTurner deleted the 2022-06-15-robust-CBE-message-creation branch June 21, 2022 09:40
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 21, 2022
Child circuit breakers rely on proper matching of acquire/release pairs.
This can be tricky to get right. If we get it wrong and accidentally
double-release a CB then it may end up with a negative `used` value.
This is definitely a bad situation in which to find ourselves, but today
in production it's made a whole lot worse because it causes exceptions
on every attempt to report a `CircuitBreakerStats` or to construct a
parent `CircuitBreakingException`.

This commit makes the message construction and stats serialization a
little more robust so that it's clearer what is going on in production.

Relates elastic#86059.
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💔 Backport failed

Status Branch Result
7.17 Commit could not be cherrypicked due to conflicts
8.3

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 87687

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 21, 2022
Child circuit breakers rely on proper matching of acquire/release pairs.
This can be tricky to get right. If we get it wrong and accidentally
double-release a CB then it may end up with a negative `used` value.
This is definitely a bad situation in which to find ourselves, but today
in production it's made a whole lot worse because it causes exceptions
on every attempt to report a `CircuitBreakerStats` or to construct a
parent `CircuitBreakingException`.

This commit makes the message construction and stats serialization a
little more robust so that it's clearer what is going on in production.

Relates elastic#86059.
Backport of elastic#87687.
elasticsearchmachine pushed a commit that referenced this pull request Jun 21, 2022
Child circuit breakers rely on proper matching of acquire/release pairs.
This can be tricky to get right. If we get it wrong and accidentally
double-release a CB then it may end up with a negative `used` value.
This is definitely a bad situation in which to find ourselves, but today
in production it's made a whole lot worse because it causes exceptions
on every attempt to report a `CircuitBreakerStats` or to construct a
parent `CircuitBreakingException`.

This commit makes the message construction and stats serialization a
little more robust so that it's clearer what is going on in production.

Relates #86059.
elasticsearchmachine pushed a commit that referenced this pull request Jun 21, 2022
Child circuit breakers rely on proper matching of acquire/release pairs.
This can be tricky to get right. If we get it wrong and accidentally
double-release a CB then it may end up with a negative `used` value.
This is definitely a bad situation in which to find ourselves, but today
in production it's made a whole lot worse because it causes exceptions
on every attempt to report a `CircuitBreakerStats` or to construct a
parent `CircuitBreakingException`.

This commit makes the message construction and stats serialization a
little more robust so that it's clearer what is going on in production.

Relates #86059.
Backport of #87687.
DaveCTurner added a commit that referenced this pull request Jun 21, 2022
DaveCTurner added a commit that referenced this pull request Jun 21, 2022
@DaveCTurner
Copy link
Copy Markdown
Member Author

DaveCTurner commented Jun 21, 2022

Reverted due to test failures - will be reinstated by #87881.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload Team:Core/Infra Meta label for core/infra team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants