Persist meta records to Cassandra index by replay · Pull Request #1471 · grafana/metrictank

replay · 2019-09-23T09:19:12Z

This implements persisting meta records into the persistent Cassandra index. It does not implement the same functionality for the persistent BigTable index yet.

woodsaj · 2019-09-23T13:40:26Z

idx/cassandra/cassandra.go

 	statSaveSkipped = stats.NewCounter32("idx.cassandra.save.skipped")
 	errmetrics      = cassandra.NewErrMetrics("idx.cassandra")
+
+	metaRecordRetryPolicy = gocql.ExponentialBackoffRetryPolicy{


This retry policy seems a bit aggressive. it will retry 10 times in less then 4seconds.

What do you think would make more sense?

What failure modes are the retries there to defend against?

network issues, the duration of those is not really predictable. if a cassandra pod restarts it can take over a minute for it to come back, so i guess the max backoff should be like 2 min then?

if a cassandra pod restarts, the request will just be retried on another pod.

I guess the real question is: how long can you afford to wait?
If you are protecting against network issues, then attempts will fail pretty quickly if the network is down. So you can work out the min and max that will allow retries up until your max desired execution time. eg, if these requests are originating from HTTP requests, then you could wait up to 60seconds (but probably best to target 30s or 45s).

Good point, it should definitely be under the HTTP request timeout. These requests are originating from an HTTP request, so in case of a final error we want to return an error to the HTTP client.

i explained the calculations in the comment:
88c309b

woodsaj · 2019-09-23T13:53:19Z

idx/cassandra/cassandra.go

+// schemaFile:  file containing table definition
+// entryName:   identifier of the schema within the file
+// tableName:   name of the table in cassandra
+func (c *CasIdx) EnsureTableExists(session *gocql.Session, schemaFile, entryName, tableName string) error {


instead of passing the schemaFile name, you should just pass an io.Reader that returns the schema file content. This will make it easier to do testing and enable other callers to be able to call the method without having to first write a schemaFile to disk

good idea, will do that

Since we're using go-toml to read those schema files (https://github.com/grafana/metrictank/blob/master/util/template.go) , I'd first need to read the schema file's correct entry into a string which then gets returned by util.ReadEntry(), wrap it into a reader, pass that reader into EnsureTableExists, and then read it back into a string via a buffer.
That seems like a bit of unnecessarily complication, wouldn't it be better to just pass a string with the table schema into EnsureTableExists, instead of a reader? Then we still get the benefit of being able to unit test it, but we don't need to construct a reader and a buffer to read the reader into.

What do you think about doing something like this:
0f0303e

I have pushed that into another branch, because I feel like it's getting out of scope of this PR. But with this we would have the advantages that initializing the cassandra index and stores becomes much more flexible because we can either provide a custom io.Reader to the <config>.ParseSchemas() functions, or we just directly set the schema we want in the according config struct.

fkaleo · 2019-09-24T16:59:05Z

Looks like we will have to copy the meta records during migrations.

replay · 2019-09-25T05:24:05Z

@fkaleo good point, I haven't thought about that. should be pretty simple though, as the amount of data in the meta_records table should be relatively small, so a simple dump, transfer, insert should be good enough.

woodsaj · 2019-09-25T12:29:27Z

idx/cassandra/cassandra.go

+}
+
+func (c *CasIdx) MetaTagRecordSwap(orgId uint32, records []tagquery.MetaTagRecord, persist bool) (uint32, uint32, error) {
+	added, deleted, err := c.MemoryIndex.MetaTagRecordSwap(orgId, records, persist)


doesnt adding to the memoryIdx before adding to cassandra lead to things getting out of sync? eg the memoryIdx gets updated, but the batch update to cassandra fails.

the plan is to reload from cassandra at some interval anyway, at this point the state of the memory index would then get reset to what it previously was.
the reason why the updating of the memory index should be done first is mainly because that way we will be able to detect whether anything has changed, or if the posted set of meta records is just the same as the old one. this is implemented in this subsequent PR:
https://github.com/grafana/metrictank/pull/1480/files#diff-c6ce4629577f4e064a1d2f636acc0568R496

I guess it would also be reasonable to just always flush to cassandra first, even if there was no change. And if the query to cassandra failed we don't update the memory index. I'm going to do that.

I forgot, there is one more reason to do the memory index update first. On upsert we want to be able to know whether a record has been created, or if it only updates an existing one. When one gets created then we set the createdat timestamp in cassandra, otherwise we only update the lastupdate timestamp.
We currently decide this based on the returned status from the Memory Index.
I don't think we really need the createdat column though, we could also just remove that one.

done: 6a73386

replay · 2019-10-15T12:58:22Z

Going to push some more modifications, according to the plan defined here:

https://docs.google.com/document/d/1arRpSuoecqOV8rA0Bus0EphX15dZliQSlzWkRbNWDpo/edit?usp=sharing

replay · 2019-10-22T15:24:10Z

FYI i deployed this branch in my QA instance to test it. I've done some upserts and some swaps, while doing them i checked to verify that cassandra gets updated as expected and I also queried multiple read pods to verify that they load the rules correctly.
AFAICT it looks fine.

robert-milan

LGTM

replay requested review from Dieterbe, fkaleo and woodsaj September 23, 2019 11:08

woodsaj reviewed Sep 23, 2019

View reviewed changes

replay force-pushed the persist_meta_records branch 2 times, most recently from 88c309b to b75e1c5 Compare September 24, 2019 14:28

replay force-pushed the persist_meta_records branch from 8211c8f to 5d90577 Compare September 25, 2019 05:18

replay mentioned this pull request Sep 25, 2019

[WIP] Regularly reload meta records #1480

Closed

woodsaj reviewed Sep 25, 2019

View reviewed changes

replay force-pushed the persist_meta_records branch from 6a73386 to 3e5c321 Compare October 14, 2019 12:04

replay force-pushed the persist_meta_records branch 16 times, most recently from 5d8eee9 to ffae89c Compare October 17, 2019 18:14

replay force-pushed the persist_meta_records branch 9 times, most recently from 168c91b to 86d6dd7 Compare October 21, 2019 11:56

replay added 3 commits October 28, 2019 15:30

remove cluster fanout of rule changes via http

63cb7d8

persist meta records into cassandra

495f6f7

regularly poll store for meta record changes

8108da5

replay force-pushed the persist_meta_records branch from 366fd96 to 0d19127 Compare October 28, 2019 18:32

replay added 6 commits October 28, 2019 15:32

compare meta tag records before swapping by hashing

e85c039

implement meta record pruning

28113d5

add meta record schema to scylladb

401179e

update breaking changes to inform about new tables

58c909e

update docs

574dd4e

backwards compatible to older go versions

7be06c4

replay force-pushed the persist_meta_records branch from 0d19127 to 7be06c4 Compare October 28, 2019 18:47

robert-milan self-requested a review October 29, 2019 17:11

robert-milan approved these changes Oct 29, 2019

View reviewed changes

replay merged commit 557eea9 into master Oct 29, 2019

replay deleted the persist_meta_records branch October 29, 2019 21:25

Conversation

replay commented Sep 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

replay Sep 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkaleo commented Sep 24, 2019

Uh oh!

replay commented Sep 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

replay Sep 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

replay commented Oct 15, 2019

Uh oh!

replay commented Oct 22, 2019

Uh oh!

robert-milan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

replay Sep 24, 2019 •

edited

Loading

replay Sep 25, 2019 •

edited

Loading