Skip to content

Don't combine distribution if count == 0#25

Merged
23Skidoo merged 1 commit intohaskell-github-trust:masterfrom
mitchellwrosen:master
Nov 7, 2018
Merged

Don't combine distribution if count == 0#25
23Skidoo merged 1 commit intohaskell-github-trust:masterfrom
mitchellwrosen:master

Conversation

@mitchellwrosen
Copy link
Copy Markdown
Contributor

This is one possible solution to #24.

@23Skidoo
Copy link
Copy Markdown
Collaborator

23Skidoo commented Apr 27, 2018

I'm not very familiar with this part of ekg, would be nice if @tibbe could take a look.

@mitchellwrosen
Copy link
Copy Markdown
Contributor Author

Bump, thoughts @tibbe? :)

@23Skidoo
Copy link
Copy Markdown
Collaborator

23Skidoo commented Nov 7, 2018

@tibbe thinks it's good to go, merging.

@23Skidoo 23Skidoo merged commit db39450 into haskell-github-trust:master Nov 7, 2018
@23Skidoo
Copy link
Copy Markdown
Collaborator

23Skidoo commented Nov 7, 2018

Merged, thanks!

@pepeiborra
Copy link
Copy Markdown

I don't know how, but this change led to a 100% reproducible lock-up in one of our EKG enabled services at work (Strats SCB). This is using 8.4.4. We forked ekg-core while preparing for an upgrade to 8.6.2.
Unfortunately I cannot provide a repro for obvious reasons.

23Skidoo added a commit that referenced this pull request Nov 19, 2018
This reverts commit db39450, reversing
changes made to 9a8e748.
@mitchellwrosen
Copy link
Copy Markdown
Contributor Author

Ouch. I'm sorry about that @pepeiborra. Can you provide any more information?

@23Skidoo
Copy link
Copy Markdown
Collaborator

23Skidoo commented Nov 19, 2018

Reverted, released a new bugfix version.

@pepeiborra
Copy link
Copy Markdown

I would love to @mitchellwrosen but I can't share any concrete artefacts from work.
The only details I have:

  • It was a hard lock-up, the Haskell scheduler stopped generating events (I was running with -N1). This was observable using the debug runtime and using +RTS -De -Ds iirc. This suggested it was an unsafe foreign call stealing the thread away from the Haskell scheduler.
  • I found the culprit only by bisecting the packages in the diff between nightly-2018-11-xx and lts-12.18

@adamse
Copy link
Copy Markdown

adamse commented Nov 20, 2018

Possible diagnosis:

The quick exit path in hs_distrib_add_n added here doesn't release the lock so any subsequent calls to hs_distrib_add_n with the same b param will wait forever in hs_lock(&b->lock).

@nh2
Copy link
Copy Markdown

nh2 commented Nov 29, 2018

  hs_lock(&b->lock);

+  if (!b->count) {
+    return;
+  }

Returning after taking a lock, this cannot work.

@mitchellwrosen
Copy link
Copy Markdown
Contributor Author

@nh2 Right, that's the obvious bug. I apologize :(

@nh2
Copy link
Copy Markdown

nh2 commented Nov 29, 2018

@mitchellwrosen Can happen! It seems ekg needs better tests if a simple change like this can take down Standard Chartered's upgrade at runtime.

EDIT: I stand corrected. ekg has no tests at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants