Modify unchecked diffs query by elizabethengelman · Pull Request #84 · sky-ecosystem/vulcanizedb

elizabethengelman · 2020-04-28T22:19:59Z

Proof of concept that this approach will work for not returning diffs from GetNewDiffs if they're from blocks that are "old". Randomly chose the threshold of "old" to be from blocks that are greater than 500 back from the most recent header we have in the system, but totally happy to rethink this/do some more digging into a more appropriate threshold.

need to update the execute dockerfile to use this flag: Add DIFF_BLOCK_FROM_HEAD_OF_CHAIN to execute docker startup script vdb-mcd-transformers#184

…ad of the chain

rmulhol · 2020-04-29T00:13:12Z

libraries/shared/storage/diff_repository.go

+func (repository diffRepository) GetFirstDiffForBlockHeight(blockHeight int64) (types.PersistedDiff, error) {
+	var diff types.PersistedDiff
+	err := repository.db.Get(&diff,
+		`SELECT * FROM public.storage_diff WHERE checked IS false AND block_height = $1 LIMIT 1`, blockHeight)


wondering if it might make sense to say block_height >= $1 ORDER BY block_height ASC LIMIT 1 in case we don't have a diff at exactly that block height

ah, good catch!

rmulhol

Awesome work, and impressive turnaround! 😎

My main area of interest is minimizing the likelihood of GetFirstDiffForBlockHeight (or, less likely, GetFirstDiffForBlockHeight) returning an error (esp something like sql.ErrNoRows) so that we don't kill/restart the container unless things are really out of whack.

Other than that I'm curious how we can minimize the effort needed to tune the BlocksBackFromHead number up or down depending on observed results.

Overall, LGTM! 👍

rmulhol · 2020-04-29T00:15:27Z

libraries/shared/watcher/storage_watcher.go

+			return getHeaderErr
+		}
+		blockNumber := mostRecentHeader.BlockNumber - BlocksBackFromHead
+		diff, getDiffErr := watcher.StorageDiffRepository.GetFirstDiffForBlockHeight(blockNumber)


Don't know if this makes any difference but maybe this function could just return the ID if that's all we want from the diff

Good call - I had initially thought that this method could be useful in the future if it returned the whole diff, but we'll let future us deal with that, just in case it never is necessary.

rmulhol · 2020-04-29T00:20:24Z

libraries/shared/watcher/storage_watcher.go

-var ResultsLimit = 500
+var (
+	ResultsLimit       = 500
+	BlocksBackFromHead = int64(500)


Thinking that we might want this number to be a good bit higher - at least the first time we run this, since it's possible we could be at a really bad point in terms of cycling through the unchecked (and un-transformable) diffs. I wonder if maybe instead of skipOldDiffs being a bool, it could be an int64 configurable via the CLI? Then we could easily bump the number up or down depending on the results we observe

I really like the idea of passing in an int instead!

rmulhol · 2020-04-29T00:22:50Z

libraries/shared/watcher/storage_watcher.go

-	minID := 0
+	var minID int
+	if watcher.SkipOldDiffs {
+		mostRecentHeader, getHeaderErr := watcher.HeaderRepository.GetMostRecentHeader()


Maybe this could just return the block number?

…header

Previously we were passing the id of the first diff for a given block to GetNewDiffs, but this method is looking at diffs with ids > than the given minID, so the diff with the passed minID isn't included, which is fine for normal circumstances, since it starts at 0.

elizabethengelman · 2020-04-29T22:32:52Z

libraries/shared/watcher/storage_watcher.go


-		minID = int(diffID)
+		minID = int(diffID - 1)
 	}


Just pushed this change up for now... wondering if a cleaner approach would be to change the GetNewDiffs query to look at diffs where id >= instead of just >, but think I need to look at that again with fresh eyes tomorrow.

Makes sense to prefer >= to > 👍

Not sure I follow why we need to set minID to diffID - 1 though?

Cool, i'll make update to >=.
I needed to change the minID to diffID - 1 because when passing in a block to start looking at diffs, we would use the id from the first diff for that block. Then if we pass that id to GetNewDiffs it would look up diffs with an id > than the given id, so that diff would never be processed. So by making the minID one less than that diff's id, it would ensure that it would get processed. Does that make sense?

All that being said, changing GetNewDiffs to be inclusive for the minID seems a lot easier.

Actually now that I look at it a bit closer, if we change GetNewDiffs to look at ids >= minID then with the current watcher implementation we'd end up rechecking the last diff on the collection each time... or we'd need to increment the minID to avoid that. 🤔 Really not sure which approach is clearer.

just writing out the 2 options again to help myself understand the two approaches:

if we keep GetNewDiffs as is with id > minId then we will need to pass in an id less than the diffId that we’re getting from GetFirstDiffIDForBlockHeight otherwise that id won’t ever be included in GetNewDiffs, so that is why I was passing in firstDiffId -1

on the flip side, if we change GetNewDiffs to do id >= minId then whenever we pass in an updated minID which is the last id from the set of diffs returned, then we’d end up returning that last diff from GetNewDiffs again. So it seems like we’d need to pass in lastDiffId +1 as the new diff id

elizabethengelman · 2020-04-29T22:46:32Z

libraries/shared/storage/diff_repository.go

 	err := repository.db.Get(&diffID,
-		`SELECT id FROM public.storage_diff WHERE checked IS false AND block_height >= $1 LIMIT 1`, blockHeight)
+		`SELECT id FROM public.storage_diff WHERE block_height >= $1 LIMIT 1`, blockHeight)



I figured that this method doesn't necessarily need to care if a diff is checked or not, since we're doing that anyway with GetNewDiffs. Also, was seeing some failures locally when execute caught up with all unchecked diffs, this method would fail because it didn't have an sql results.

This also got me thinking that we'll probably need to tweak this method and MostRecentHeaderBlockNumber for fresh deployed systems - again, will plan to look at that again tomorrow with fresh eyes.

Nice catch! Makes me think maybe we want to handle sql.ErrNoRows upstream, since I imagine we could probably end up in this situation in other weird scenarios like if the interval is low and there aren't a lot of diffs being generated.

Also 👍 to thinking about freshly deployed systems. I wish it were as simple as updating docs to suggest not using this feature in a fresh deploy, but suppose that might not be feasible if we're defining the interval in the image. Maybe we could require the interval to be set via an env var that's passed to the image?

Was just checking out the PR updating the image and that looks legit - if we default to -1 but enable config, then fresh deployed systems can just not use this feature?

yeah, that's a good idea. also, i added a guard in the watcher so that if either the headers or the storage_diff table returns a sql.NoRowsErr we just set the minID to 0, so at least the process won't fall over

rmulhol · 2020-04-30T14:59:01Z

libraries/shared/storage/diff_repository_test.go

 				RawDiff: fakeRawDiff,
 				ID:      rand.Int63(),
-				Checked: true,
+				Checked: false,


rmulhol · 2020-04-30T15:00:28Z

libraries/shared/storage/diff_repository_test.go

+			Expect(insertErr).NotTo(HaveOccurred())
+
+			blockBeforeDiffBlockHeight := int64(fakeRawDiff.BlockHeight - 1)
+			_, diffErr := repo.GetFirstDiffIDForBlockHeight(blockBeforeDiffBlockHeight)


Maybe worth indicating here what is returned? Assuming it's the same as when the diff is unchecked?

rmulhol · 2020-04-30T15:00:51Z

libraries/shared/watcher/storage_watcher.go

-var ResultsLimit = 500
+var (
+	ResultsLimit       = 500
+	BlocksBackFromHead = int64(500)


Can probably remove this now?

rmulhol · 2020-04-30T15:04:26Z

pkg/datastore/postgres/repositories/header_repository.go

+func (repository HeaderRepository) GetMostRecentHeaderBlockNumber() (int64, error) {
+	var blockNumber int64
+	err := repository.database.Get(&blockNumber,
+		`SELECT block_number FROM headers ORDER BY block_number DESC LIMIT 1`)


Any idea how this compares performance-wise with select max(block_number) from headers?

good question! looks like SELECT block_number FROM headers ORDER BY block_number DESC LIMIT 1 may be slightly more performant?

explain SELECT block_number FROM headers ORDER BY block_number DESC LIMIT 1; Limit (cost=0.43..0.47 rows=1 width=8) -> Index Only Scan Backward using headers_block_number_eth_node_id_key on headers (cost=0.43..113428.77 rows=2925850 width=8) ======================= explain select max(block_number) from headers; Result (cost=0.47..0.48 rows=1 width=8) InitPlan 1 (returns $0) -> Limit (cost=0.43..0.47 rows=1 width=8) -> Index Only Scan Backward using headers_block_number_eth_node_id_key on headers (cost=0.43..120743.40 rows=2925850 width=8) Index Cond: (block_number IS NOT NULL)

rmulhol · 2020-04-30T15:06:45Z

pkg/datastore/postgres/repositories/header_repository_test.go

+		It("returns an error if it fails to get the most recent header", func() {
+			_, err := repo.GetMostRecentHeaderBlockNumber()
+			Expect(err).To(HaveOccurred())
+			Expect(err).To(MatchError(sql.ErrNoRows))


This seems like potentially an issue for freshly deployed systems as well, if execute starts before header sync. Wondering if we want to handle sql.ErrNoRows in the storage watched from both of these repository calls - seems like maybe we could delay + retry, set minID to 0, or something

paytonrules · 2020-04-30T17:26:17Z

libraries/shared/watcher/storage_watcher.go

+		minID = int(diffID - 1)
+	}
+
+	return minID, nil


So if DiffBlocksFromHeadOfChain is -1 you'll return for minID. Is that deliberate?

Yes, the idea is that if DidffBlocksFromHeadOfChain is -1 then we'll want the minID to be 0. Golang sets the starting value of an int to 0 if you don't define it (line 98), but perhaps it would be clearer to be explicit and set var minID= 0 instead.

I think it's the "-1" that's bothering me - it's a magic number that the execute.go file has to know about to start from the beginning.

Honestly it's not super important, just mildly confusing.

That's super fair w/r/t the magic number! I've added a comment explaining a bit more, and may also pull the 1 out into a variable/constant to give a bit more context.

paytonrules

There's a lot of context I don't understand here yet, but LGTM.

elizabethengelman added 5 commits April 28, 2020 15:48

Add GetMostRecentHeader method on header repo

017b31a

Add a method to get the first storage diff for a given block

2e4509a

Skip getting unchecked diffs that are older than 500 back from the he…

a2056b3

…ad of the chain

Apply go fmt changes

9fbe0c4

Allow for skipping old diffs via a flag

d497e64

elizabethengelman requested review from paytonrules, rmulhol and yaoandrew April 28, 2020 22:22

rmulhol reviewed Apr 29, 2020

View reviewed changes

rmulhol approved these changes Apr 29, 2020

View reviewed changes

elizabethengelman added 4 commits April 29, 2020 15:26

Return diffID from GetFirstDiffIDForBlockHeight instead of full diff

65b72c9

Pass in number of blocks back from head to start getting unchecked diffs

0da58f5

Have MostRecentHeaderBlockNumber return block number instead of full …

6474134

…header

elizabethengelman commented Apr 29, 2020

View reviewed changes

Handle edge case of all diffs already checked

5ed10bc

elizabethengelman commented Apr 29, 2020

View reviewed changes

elizabethengelman requested a review from rmulhol April 29, 2020 22:46

elizabethengelman mentioned this pull request Apr 30, 2020

Add DIFF_BLOCK_FROM_HEAD_OF_CHAIN to execute docker startup script sky-ecosystem/vdb-mcd-transformers#184

Merged

elizabethengelman force-pushed the modify-unchecked-diffs-query branch from b03a3be to 5ed10bc Compare April 30, 2020 14:51

rmulhol reviewed Apr 30, 2020

View reviewed changes

Handle fresh deploy where there are no diffs and/or no headers yet

b9c32d5

elizabethengelman force-pushed the modify-unchecked-diffs-query branch from 6720333 to d9f7c78 Compare April 30, 2020 16:43

paytonrules reviewed Apr 30, 2020

View reviewed changes

elizabethengelman force-pushed the modify-unchecked-diffs-query branch from d9f7c78 to 88802b7 Compare April 30, 2020 17:43

paytonrules approved these changes Apr 30, 2020

View reviewed changes

elizabethengelman force-pushed the modify-unchecked-diffs-query branch from 88802b7 to ecfa286 Compare April 30, 2020 18:44

Add PR feedback and some cleanup

307df45

elizabethengelman force-pushed the modify-unchecked-diffs-query branch 2 times, most recently from 6db0f51 to 307df45 Compare April 30, 2020 22:23

elizabethengelman merged commit a6b3c0b into staging Apr 30, 2020

elizabethengelman deleted the modify-unchecked-diffs-query branch April 30, 2020 22:41

Conversation

elizabethengelman commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmulhol left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmulhol Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elizabethengelman Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmulhol Apr 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elizabethengelman Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmulhol Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elizabethengelman Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paytonrules left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

elizabethengelman commented Apr 28, 2020 •

edited

Loading

rmulhol Apr 29, 2020 •

edited

Loading

elizabethengelman Apr 29, 2020 •

edited

Loading

rmulhol Apr 29, 2020 •

edited

Loading

elizabethengelman Apr 30, 2020 •

edited

Loading

rmulhol Apr 30, 2020 •

edited

Loading

elizabethengelman Apr 30, 2020 •

edited

Loading