Skip to content

backupccl: avoid div-by-zero crash on failed node count#55560

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:div-by-zero
Oct 27, 2020
Merged

backupccl: avoid div-by-zero crash on failed node count#55560
craig[bot] merged 1 commit intocockroachdb:masterfrom
dt:div-by-zero

Conversation

@dt
Copy link
Copy Markdown
Contributor

@dt dt commented Oct 14, 2020

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.

@dt dt requested review from a team and pbardea October 14, 2020 19:02
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@@ -1169,7 +1170,11 @@ func (r *restoreResumer) Resume(

numClusterNodes, err := clusterNodeCount(p.ExecCfg().Gossip)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also move this call to closer to the telemetry since that's the only thing that it's used for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sorta considered that but then we'd need to pass Gossip into restore too. I think the real answer is that we should be using number of nodes we actually plan the flow on, so I think this should eventually go away rather than move. In the interest of keeping this backport-friendly I think just leave it as is until then?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yep, that sounds good

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.
@dt
Copy link
Copy Markdown
Contributor Author

dt commented Oct 27, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 27, 2020

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 27, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants