Fix deadlock when flushing discard stats.#976
Conversation
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
There was a problem hiding this comment.
✅ A review job has been created and sent to the PullRequest network.
Check the status or cancel PullRequest code review here.
jarifibrahim
left a comment
There was a problem hiding this comment.
@Sch00lb0y We don't need this patch. Look at
Lines 1392 to 1397 in 9d7b751
After flushing, we reset the update count
But over here
Lines 361 to 369 in 9d7b751
We don't reset the counter. That's why compaction triggered another flush and the db crashed.
Fixing the reset counter should fix the issue. Also, try to write a test for it.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)
There was a problem hiding this comment.
I've reviewed the ticket and sch00lb0y feedback. This appears to be the right approach.
Reviewed with ❤️ by PullRequest
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
jarifibrahim
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 2 files reviewed, 2 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)
value.go, line 1418 at r2 (raw file):
// So ignoring it. // https://github.com/dgraph-io/badger/issues/970 if err != ErrBlockedWrites {
We don't need this. The BlockedWrites was being returned because the updateSinceFlush was not being reset.
Once we've reset it, the writes will never be triggered.
Here's what was happening in the existing implementation
- DB was closed
- Discard stats were flushed (but the update counter was not reset)
- L0 compaction was triggered
- The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
- This flush failed because the write channel was closed.
The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.
value_test.go, line 1089 at r2 (raw file):
} func TestBlockedDiscardStats(t *testing.T) {
This test can be simplified
// Regression test for https://github.com/dgraph-io/badger/issues/970
func TestBlockedDiscardStats(t *testing.T) {
dir, err := ioutil.TempDir("", "badger-test")
require.NoError(t, err)
defer os.Remove(dir)
db, err := Open(getTestOptions(dir))
require.NoError(t, err)
// Set discard stats.
db.vlog.lfDiscardStats = &lfDiscardStats{
m: map[uint32]int64{0: 0},
}
// This is important. Set updateSinceFlush to discardStatsFlushThresold so
// that the next update call flushes the discard stats.
db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1
require.NoError(t, db.Close())
}
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
jarifibrahim
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @ashish-goswami, @balajijinnah, and @manishrjain)
value.go, line 1393 at r3 (raw file):
if vlog.lfDiscardStats.updatesSinceFlush > discardStatsFlushThreshold { vlog.lfDiscardStats.Unlock() // flushDiscardStats also aquires lock. So, we need to unlock here.
Typo: aquires => acquires
value.go, line 1429 at r3 (raw file):
// encodedDiscardStats returns []byte representation of lfDiscardStats // This will be called while storing stats in BadgerDB // caller should aquire lock before encoding the stats.
Typo: aquire => acquire
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
manishrjain
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)
value.go, line 1430 at r4 (raw file):
// This will be called while storing stats in BadgerDB // caller should acquire lock before encoding the stats. func (vlog *valueLog) encodedDiscardStats() []byte {
Possibly use safemutex. So, you can assert that we have at least a read lock.
manishrjain
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r5.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami and @balajijinnah)
value.go, line 1415 at r5 (raw file):
Value: vlog.encodedDiscardStats(), }} req, err := vlog.db.sendToWriteCh(entries)
if err == ErrBlockedWrites { ... }
else if err != nil { ... }
Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io>
poonai
left a comment
There was a problem hiding this comment.
Reviewable status: 1 of 2 files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)
value.go, line 1418 at r2 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
We don't need this. The BlockedWrites was being returned because the
updateSinceFlushwas not being reset.
Once we've reset it, the writes will never be triggered.Here's what was happening in the existing implementation
- DB was closed
- Discard stats were flushed (but the update counter was not reset)
- L0 compaction was triggered
- The compaction tried to update the discard stats but since the update counter was on the edge of the threshold, the compaction caused a flush.
- This flush failed because the write channel was closed.
The write was triggered because the update counter was not being reset. Since you're resetting the counter now, we won't be flushing the discard stats when L0 is compacted. Hence, there will be no pushes to the write channel.
Done.
value.go, line 1393 at r3 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Typo: aquires => acquires
Done.
value.go, line 1429 at r3 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
Typo: aquire => acquire
Done.
value.go, line 1415 at r5 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if err == ErrBlockedWrites { ... } else if err != nil { ... }
Done.
value_test.go, line 1089 at r2 (raw file):
Previously, jarifibrahim (Ibrahim Jarif) wrote…
This test can be simplified
// Regression test for https://github.com/dgraph-io/badger/issues/970 func TestBlockedDiscardStats(t *testing.T) { dir, err := ioutil.TempDir("", "badger-test") require.NoError(t, err) defer os.Remove(dir) db, err := Open(getTestOptions(dir)) require.NoError(t, err) // Set discard stats. db.vlog.lfDiscardStats = &lfDiscardStats{ m: map[uint32]int64{0: 0}, } // This is important. Set updateSinceFlush to discardStatsFlushThresold so // that the next update call flushes the discard stats. db.vlog.lfDiscardStats.updatesSinceFlush = discardStatsFlushThreshold + 1 require.NoError(t, db.Close()) }
Done.
manishrjain
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r6.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @ashish-goswami, @jarifibrahim, and @manishrjain)
|
@campoy also had a look at this PR and said it looks like the right change to make. Given the LGTMs all around, I'll merge this. |
* In updateDiscardStats the vlog.lfDiscardStats lock was acquired twice: once in updateDiscardStats and then a second time when calling flushDiscardStats. Now, we first release the lock before calling flushDiscardStats. * Don't return an error if writes are blocked for discard stats. * Add tests and regression tests. Signed-off-by: பாலாஜி ஜின்னா <balaji@dgraph.io> (cherry picked from commit 398445a)
Signed-off-by: பாலாஜி ஜின்னா balaji@dgraph.io
Fixes - #970 and #976
This change is