Skip to content

Fix out-of-bounds panic in progress printer#378

Merged
mrnugget merged 2 commits into
mainfrom
mrn/fix-out-of-bounds
Nov 16, 2020
Merged

Fix out-of-bounds panic in progress printer#378
mrnugget merged 2 commits into
mainfrom
mrn/fix-out-of-bounds

Conversation

@mrnugget

Copy link
Copy Markdown
Contributor

We ran into an out-of-bounds panic when we had more tasks reporting as
"currently running" than the number of status bars we had.

That was due to a lack of defensive checks in the face of a possible
race condition.

The race condition leads the PrintStatuses method seeing more
"currently running" tasks than there can technically be.

The race condition is between the PrintStatus method that prints the
slices of pointers to *TaskStatus and the tasks themselves updating
the status non-atomatically.

At least that's what it seems like. I'm not 100% sure about the cause
yet, but I could reproduce the issue by simply injecting more statuses
at the top of PrintStatuses. The code here fixes it.


I'm still trying to find out exactly why it happens, but I do think this code should be in there in any case, that's why I'm opening up this PR.

We ran into an out-of-bounds panic when we had more tasks reporting as
"currently running" than the number of status bars we had.

That was due to a lack of defensive checks in the face of a possible
race condition.

The race condition leads the `PrintStatuses` method seeing more
"currently running" tasks than there can technically be.

The race condition is between the `PrintStatus` method that prints the
slices of pointers to `*TaskStatus` and the tasks themselves updating
the status non-atomatically.

At least that's what it seems like. I'm not 100% sure about the cause
yet, but I could reproduce the issue by simply injecting more statuses
at the top of `PrintStatuses`. The code here fixes it.
@mrnugget mrnugget requested a review from a team November 13, 2020 16:10

@eseliger eseliger left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little hacky but I don't know another simple way to get it in on a Friday evening, so LGTM

@chrispine chrispine left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LawnGnome LawnGnome left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also don't understand exactly how this happened, but as you say, let's get the defensive check in regardless.

@mrnugget

mrnugget commented Nov 16, 2020

Copy link
Copy Markdown
Contributor Author

HA! Yesterday, in the shower, I figured out how it happens!

It's a race condition between the PrintStatuses method and the (*executor).Start method enqueuing tasks one-by-one. While the executor uses a semaphore to ensure that a maximum of NumParallelism jobs are running at the same time, the status of each task can be read directly from a pointer, without any locking mechanism in place.

So what happens is this.

Say you have tasks A, B, C, D. A and B are currently running.

PrintStatus iterates over the tasks. PrintStatuses sees that A and B are running. Then it looks at C. Not running. And here's the race condition: before PrintStatus looks at job D, task A is finished (which can happen really fast, if results are cached, as was the case for the customer that ran into this) and task D is started. So now PrintStatus thinks D is running. End result: 3 tasks marked as running, even though only 2 should be.

A solution would be a lock around the []*campaigns.TaskStatus slice. I'll try to see if I can make that work. But I do think that this out-of-bounds check here is still something we should keep.

@mrnugget mrnugget merged commit 88371d6 into main Nov 16, 2020
@mrnugget mrnugget deleted the mrn/fix-out-of-bounds branch November 16, 2020 08:04
scjohns pushed a commit that referenced this pull request Apr 24, 2023
* Fix out-of-bounds panic in progress printer

We ran into an out-of-bounds panic when we had more tasks reporting as
"currently running" than the number of status bars we had.

That was due to a lack of defensive checks in the face of a possible
race condition.

The race condition leads the `PrintStatuses` method seeing more
"currently running" tasks than there can technically be.

The race condition is between the `PrintStatus` method that prints the
slices of pointers to `*TaskStatus` and the tasks themselves updating
the status non-atomatically.

At least that's what it seems like. I'm not 100% sure about the cause
yet, but I could reproduce the issue by simply injecting more statuses
at the top of `PrintStatuses`. The code here fixes it.

* Add changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants