Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

Tag index#729

Merged
Dieterbe merged 72 commits intomasterfrom
tag_index
Oct 13, 2017
Merged

Tag index#729
Dieterbe merged 72 commits intomasterfrom
tag_index

Conversation

@replay
Copy link
Copy Markdown
Contributor

@replay replay commented Sep 14, 2017

Adds a tag index, together with the functions to query it.
It does not expose any API endpoints yet, those will follow in a separate PR (from branch https://github.com/grafana/metrictank/compare/tag_api_endpoints).

Comment thread idx/memory/memory.go Outdated
tags[tagSplits[0]] = make(map[string][]string)
}

tags[tagSplits[0]][tagSplits[1]] = append(tags[tagSplits[0]][tagSplits[1]], def.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to init tags[tagSplits[0]][tagSplits[1]] first? this code will be much more readable if you say tagName := tagSplits[0] tagValue := tagSplits[1]

Copy link
Copy Markdown
Contributor Author

@replay replay Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be necessary to init slices i think, only maps. Ok, i'll name them

Comment thread idx/memory/memory.go Outdated
return m.idsByTagQuery(tree, query)
}

func (m *MemoryIdx) idsByTagQuery(tags map[string]map[string][]string, query tag.TagQuery) []string {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function supposed to be returning the ids that match ANY expression or ALL expressions? Seems like it's ANY right now. I suppose this function is designed to be composable to support ALLs?

Copy link
Copy Markdown
Contributor

@DanCech DanCech Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@replay was confused, it should be ALL. He's sorting that out now.

From an end-user perspective, ANY matches can be done via a combination of using regular expression matches and/or using multiple calls to seriesByTag and combining the results with group

Copy link
Copy Markdown
Contributor Author

@replay replay Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@replay replay force-pushed the tag_index branch 3 times, most recently from d7dcf96 to d375fb9 Compare September 15, 2017 10:14
Comment thread idx/memory/tag_query.go Outdated
value := exprSplits[0][3]

// always anchor all regular expressions at the beginning
if operator[len(operator)-1] == byte('~') && len(value) > 0 && value[len(value)-1] != byte('^') {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value[0] != byte('^')

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread idx/memory/tag_query.go

// special case of empty value
if len(value) == 0 {
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a no-op

Copy link
Copy Markdown
Contributor Author

@replay replay Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, still going to add that. the idea is that i'll just reverse the operation.

so key!=~ becomes key=~.* and key=~ becomes key!=~.*

Copy link
Copy Markdown
Contributor Author

@replay replay Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, further down where the matching is done, i could also shortcut it to not even do the matching if the pattern is .* or .+. (for .+ this works assuming that no value that's in the index can be "")

Comment thread idx/memory/tag_query.go Outdated
return ids, nil
}
return make(map[string]struct{}), nil
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this just be return idx[key][value], nil?

Copy link
Copy Markdown
Contributor Author

@replay replay Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread idx/memory/tag_query.go Outdated
return nil, err
}

for v, ids := range values {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, it seems like this could just be for v, ids := range idx[key] {

Comment thread idx/memory/tag_query.go Outdated
return nil, errInvalidQuery
}

exprSplits := expressionRe.FindAllStringSubmatch(expr, -1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread idx/memory/tag_query.go
value = ".+"
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do better in handling terms that match the empty string, since we have a requirement that there must be a tag expression that doesn't match the empty string. Because of that requirement, we should be able to make a list of candidate series from the expression that requires a match, then use those as the list of "all" series that should be considered to match if they don't have the tags that we're matching the empty string for. I didn't implement that concept in the graphite tagdbs, but I'm going to go back and see whether I can use it.

Copy link
Copy Markdown
Contributor Author

@replay replay Sep 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i'm not completely sure i understand you right. but actually i think this already does what you describe:

  • further down queries are divided into "selects" and "filters". any query that's excluding something is a "filter", any query that's selecting by a match is a "select".
  • a query like key= is translated into key!=.+, this is a filter
  • the "select" queries are always executed first, building that list of candidates you mention
  • after the list of candidates is built, the "filters" are applied

we can further speed this up by not even matching the regex .+ because we can't have empty tag values in the index. so every series that has the tag key will match the value .+, so we can shortcut it and skip the regex match.

@replay replay force-pushed the tag_index branch 2 times, most recently from ebc48cf to e30aee4 Compare September 29, 2017 15:11
@Dieterbe
Copy link
Copy Markdown
Contributor

FTR: prior discussion: #532
feel free to link to any other tickets or discussions

@replay replay changed the title [WIP] (do not merge yet) Tag index Tag index Oct 1, 2017
@replay replay requested review from Dieterbe and woodsaj October 1, 2017 10:38
@replay replay force-pushed the tag_index branch 3 times, most recently from 7402db6 to a4ee0e2 Compare October 1, 2017 10:45
Comment thread idx/memory/memory_test.go Outdated
{"key1=~value[0-9]", "key2=~", "key3!=value3"},
{"key2=", "key1=value1"},
}
expecting := []int{3, 1, 4, 1, 0, 1, 2, 0, 1, 2}
Copy link
Copy Markdown
Contributor

@Dieterbe Dieterbe Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make 1 slice of cases, where each case shows its expression and which metrics are expected to match.
just checking the number of matches can hide bugs, is hard for readers, and having 2 structures is a bit confusing

Comment thread idx/cassandra/cassandra.go Outdated

func (c *CasIdx) IdsByTagExpressions(orgId int, expressions []string) ([]string, error) {
return c.MemoryIdx.IdsByTagExpressions(orgId, expressions)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these additions should not be necessary. go compiler should automatically wire these through since MemoryIdx is embedded into the CasIdx struct (that's why Find, List, etc work also on this index)

Comment thread idx/memory/memory.go Outdated

Enabled bool
Enabled bool
TagSupport bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need to export this i think

Comment thread idx/memory/memory.go Outdated
func (m *MemoryIdx) addTags(def *schema.MetricDefinition) {
tags, ok := m.Tags[def.OrgId]
if !ok {
tags = make(map[string]map[string]map[string]struct{})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make(TagIndex) here

Comment thread idx/memory/memory.go Outdated
tagValue := tagSplits[1]

if _, ok = tags[tagName]; !ok {
tags[tagName] = make(map[string]map[string]struct{})
Copy link
Copy Markdown
Contributor

@Dieterbe Dieterbe Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should give these substructures a type, so that it's easier/clearer to create them

Copy link
Copy Markdown
Contributor Author

@replay replay Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be good, but I can't come up with a good name :)
The top level structure is currently named TagIndex. I suggest the next one could be TagKey and it's nested values would be called TagValue. So each TagValue would then contain a set of IDs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:

TagIndex -> TagValues -> MetricDefinitions

Comment thread idx/memory/memory.go Outdated
m.Unlock()
}

func (m *MemoryIdx) addTags(def *schema.MetricDefinition) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func name is a bit confusing. we're indexing tags (or "adding tags to the index") for a metricdefinition that already has tags. so maybe call this indexTags or at least at a function comment explaining what this does. thanks :)

Comment thread idx/memory/memory.go Outdated
}
}

func (m *MemoryIdx) delTags(def *schema.MetricDefinition) {
Copy link
Copy Markdown
Contributor

@Dieterbe Dieterbe Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as above. clarify a bit via function comment, thanks (could be confused for "delete tags from a metric")

@Dieterbe
Copy link
Copy Markdown
Contributor

Dieterbe commented Oct 1, 2017

my main concern here is the amount of strings used. every string has a pointer internally, and GC workload is directly proportional to amount of live pointers on the heap. and we already have a too-high number of strings and pointers due to legacy reasons (eg all the AggMetric stuff, and other stuff in the current index), which results in a high GC workload, which combined with golang/go#14812 results occassionally in poor latencies.

so, I think we need a way to verify if this concern holds any ground, with minimal time investment.

should be pretty trivial to spin up a docker stack, use fakemetrics for ingest workload, and vegeta for query workload, and compare both cases. I can help with the specifics, i've done this a bunch of times before. see also #440 and https://github.com/grafana/metrictank/blob/7a508e43e9b48af9ecdf1723c91dd75e1b4ca212/docker/docker-cluster/load.sh

Comment thread idx/idx.go Outdated
Prune(int, time.Time) ([]Archive, error)
TagList(int, uint32) []string
Tag(int, string) map[string]uint32
IdsByTagExpressions(int, []string) ([]string, error)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 functions are not called by anything, it's not really clear what they are for

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding comments

Comment thread idx/memory/memory.go Outdated
memoryIdx := flag.NewFlagSet("memory-idx", flag.ExitOnError)
memoryIdx.BoolVar(&Enabled, "enabled", false, "")
memoryIdx.BoolVar(&tagSupport, "tag-support", false, "")
memoryIdx.IntVar(&matchCacheSize, "match-cache-size", 1000, "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fill in help message, and add these settings, along with the same help message and default value to metrictank-sample.ini, then run the sync-configs script to apply it to the other files as well. and config-to-doc.sh

Dieterbe and others added 7 commits October 12, 2017 19:23
no need to loop twice
no need to do map lookups if key doesn't even match
only need to track values, not entire k=v pair
* add a corruption case metric+log
* standardize the format of all recoverable internal errors
  makes a simpler, cleaner namespace, on which we can easily alert.
* standardize terms:
  - corrupt for internal datastructures
  - invalid for user data
* no need to export vars
Comment thread idx/memory/tag_query.go
@@ -328,11 +318,19 @@ func (q *TagQuery) filterByMatch(resultSet TagIDs, byId map[string]*idx.Archive,
continue
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just break? There shouldn't be another tag value for this tag key, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm github is not showing anything besides the continue line. are you talking about the case where we found an entry in notMatchingTags ?

Copy link
Copy Markdown
Contributor Author

@replay replay Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shanson7 you're right. each def should only have one value per key

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i gave the inner loop a name for consistency with the outer one: 6fb73bf

@Dieterbe
Copy link
Copy Markdown
Contributor

@shanson7 this will be merged very soon now, unless you have any more feedback :)
main thing missing is we need to pull in the tag input validation raintank/schema#11

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants