Skip to content

Implemented classifier transform#67

Merged
ErikBjare merged 16 commits intomasterfrom
dev/classify
Oct 18, 2019
Merged

Implemented classifier transform#67
ErikBjare merged 16 commits intomasterfrom
dev/classify

Conversation

@ErikBjare
Copy link
Copy Markdown
Member

No description provided.

@codecov
Copy link
Copy Markdown

codecov bot commented Sep 20, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@589115f). Click here to learn what that means.
The diff coverage is 72.46%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #67   +/-   ##
=========================================
  Coverage          ?   80.74%           
=========================================
  Files             ?       30           
  Lines             ?     3132           
  Branches          ?        0           
=========================================
  Hits              ?     2529           
  Misses            ?      603           
  Partials          ?        0
Impacted Files Coverage Δ
src/transform/mod.rs 85.71% <ø> (ø)
src/models/event.rs 53.84% <0%> (ø)
src/transform/classify.rs 68.18% <68.18%> (ø)
src/query/functions.rs 92.13% <91.66%> (ø)
tests/query.rs 88.46% <92.85%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 589115f...f65d6ec. Read the comment docs.

}

#[test]
fn test_classify() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have all tests in the tests folder.

The Rust way to do things is to have unittests inside the source and integration tests inside the tests directory. Not sure which way is the best way for us.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like having small unittests close to the function definition. Although it doesn't seem to show up in the coverage properly?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, coverage is kind of broken currently anyway though, so whatever I guess

@ErikBjare ErikBjare changed the title Started working on classifier transform Implemented classifier transform Sep 21, 2019

fn choose_category(tags: HashSet<String>) -> String {
tags.iter().fold(&"Uncategorized".to_string(), |acc: &String, item: &String| {
if item.matches("->").count() >= acc.matches("->").count() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bad practice, please don't use -> anywhere in the server code and let the web-ui pretty-print it with -> later.

Copy link
Copy Markdown
Member Author

@ErikBjare ErikBjare Sep 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tags/categories be arrays doesn't make sense imo. (having classes be a Vec<(Vec<String>, Regex)> would be annoying)

The "->" is not primarily intended to be pretty, it's intended to:

  1. Convey the hierarchy order
  2. Act as an intuitive separator
  3. Differentiate a category from a tag (a category is a special case of a tag)

I'm really unconvinced about changing this into an array, I've been writing all the classifiers like this and it's really not that bad. I'm actually very happy with using "->" in this way, even though I get why you might not like it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, dot uses the "->" separator (which is where I got the idea from, I think).

A nice/fun consequence of this is that if you basically dumped all your categories into a .dot file then you'd get a nice visualization of the category structure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having classes be a Vec<(Vec, Regex)> would be annoying

I disagree, it clearly defines what the data is without hiding anything

The "->" is not primarily intended to be pretty, it's intended to:

  • Convey the hierarchy order
  • Act as an intuitive separator
  • Differentiate a category from a tag (a category is a special case of a tag)

To me all of those reasons say "pretty" but in a different way.

  • Arrays have a order
  • It's a separator, that's for sure. I don't know what intuitive means in this case as it's just a separator, but I assume that you mean more readable which I agree but that also comes at a cost of being a exotic construction which otherwise doesn't fit how the other transforms work which might actually make it less intuitive.
  • A comment could just as well serve as a differentiation

Also, dot uses the "->" separator (which is where I got the idea from, I think).

A nice/fun consequence of this is that if you basically dumped all your categories into a .dot file then you'd get a nice visualization of the category structure.

Doing ["a", "b"].join("->") would do the same thing, it's not something complex we're talking about here.

Other reasons to avoid it:

  • Dot is a whole markup language, this is an entry in an already existing language which means that it's an format within a format which is messy
  • Using -> means extra parsing and simply matching them with the currently weak system by simply splitting them would mean possible errors. There's a huge amount of edge cases in both the aw-server-python and aw-server-rust code which needs extra testing because of this.
  • If someone automates something which creates tags and it contains "->" it will get messy unless you escape the -> somehow.
  • Arrays already have a standard hierarchy order which is left to right (same as your "->" solution)

Just because you are used to something and that it currently works doesn't mean that it's a good idea.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, it clearly defines what the data is without hiding anything

It doesn't. It says a tiny bit about how tags are for whatever reason Vec's, which helps the Rust compiler a little bit but doesn't define anything about the underlying behavior of tags/categories. This is unlike using # for tags and -> for categories, which makes it apparent to the reader that there is a distinction.

A comment could just as well serve as a differentiation

No. Tags begin with #, categories have a hierarchy order. These need to be checked at runtime, and I'm not going to both check if an array is 1-length and starts with a # to determine if it's a tag or not.

Using -> means extra parsing and simply matching them with the currently weak system by simply splitting them would mean possible errors. There's a huge amount of edge cases in both the aw-server-python and aw-server-rust code which needs extra testing because of this.

Not really, the parsing is literally just a check for # or -> and then a split. I'm not sure what edge cases you are referring to and I can't recall any open issues about it.

If someone automates something which creates tags and it contains "->" it will get messy unless you escape the -> somehow.

This seems extremely unlikely, and would be easy for them to just avoid doing.


Regardless, I've gone over solutions like these before in the aw-research code, and what I've come up with was the easiest to work with in the end. So unless you want to get into a long debate over all the other options I've explored that solve the same problem in a better way than the Vec you're proposing (such as letting tags have metadata, or letting the class spec be a dict) I'm going to go ahead and merge this.

Copy link
Copy Markdown
Member Author

@ErikBjare ErikBjare Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I obviously can't disagree with that, but it doesn't help me to merge this PR and isn't an argument for or against the options I mentioned.

I've experimented long enough to have settled on what I believe will be a good API. Unless you think I'm being totally unreasonable here I'd suggest you take my word for the fact that I've explored plenty of options and that I didn't find any of them particularly appealing.

Another option would be to have a category and/or tag type in the query language.

What does this even mean? How does that affect the underlying transform?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the short answer before, did not have time to write a proper answer (and probably don't have now either).

I obviously can't disagree with that, but it doesn't help me to merge this PR

Reviews aren't supposed to help getting something merged, it's supposed to increase the quality of the code getting merged.
It's also already been as a PR for 9 days and nothing else has been merged since so there shouldn't be a hurry.
You should also expect that a bigger PR should get a longer review time, I have only taken a quick glance at the code as of now.

A third option is to split up the behavior for tags and categories entirely, in which case I'd be fine with using your Vec<(Vec, Regex)> for the category part and keep using Vec<(String, Regex)> for plain tags. But I don't think that makes sense since it's more or less identical code apart from selecting the highest-ranking category.

To me that sounds reasonable. If they have no reason to be coupled and they do different things they shouldn't be coupled.

What does this even mean? How does that affect the underlying transform?

It was just kind of a stupid idea to maybe have a construct in the language to make things more clear. Either just defining a function called category("work", "project") which returns a actual category type, or possibly even have a custom syntax such as category["work" -> "project"]. I don't think it's a good idea, just brainstorming to see if it's possible to get the best of both worlds.

Copy link
Copy Markdown
Member Author

@ErikBjare ErikBjare Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviews aren't supposed to help getting something merged, it's supposed to increase the quality of the code getting merged.

It's the same thing in the end: We only merge code that's in good shape, and reviews help get code in shape, therefore reviews help get things merged. All I was saying was that your short comment didn't help me fix the issues you mentioned.

It's also already been as a PR for 9 days and nothing else has been merged since so there shouldn't be a hurry.

It's been stale for 9 days in part because I myself have been bikeshedding over the very thing you requested changes on since you were seemingly adamant about the status quo not being satisfactory.

You should also expect that a bigger PR should get a longer review time, I have only taken a quick glance at the code as of now.

But it's not big? It's +177 -11. (You might be thinking about the aw-webui PR)

To me that sounds reasonable. If they have no reason to be coupled and they do different things they shouldn't be coupled.

Really? This would lead to a lot of almost duplicated code/APIs in several places (two query2 functions, two transforms, two settings in the web UI). They don't really do different things, they do 90% exactly the same thing, apart from the category selection at the end. All that work just to get rid of a few string splits with -> seems ridiculous tbh.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All I was saying was that your short comment didn't help me fix the issues you mentioned.

Fair enough, misunderstood you

But it's not big? It's +177 -11. (You might be thinking about the aw-webui PR)

The API part is a big thing to me. This will be with us for a long time.

Really? This would lead to a lot of almost duplicated code/APIs in several places (two query2 functions, two transforms, two settings in the web UI). They don't really do different things, they do 90% exactly the same thing, apart from the category selection at the end. All that work just to get rid of a few string splits with -> seems ridiculous tbh.

Two query2 functions is fine, that boilerplate code is minimal and stable.
Two transforms is acceptable. It's worth it for having to avoid a custom string format IMO.
Two settings in the web UI is probably a good thing if they don't belong with each other, no point in mixing them really.

Also, currently there does not seem to be any examples of using tagging in the visualization yet?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to try to be more generic and merge the functionality of categories and tags and just have them as one implementation and instead add the ability to save different sets of classes and choose which one you want to use.

Maybe make it optional to have a single event have multiple tags/categories or not and always allow the depth to be more than 1. If you don't want the depth to be more than 1, simply don't specify any classes which has more than 1 entry. That's essentially the same way we do with filter_events_by_keys, it allows supplying multiple keys but most often we only supply one.

Kind of hard to explain, but I hope I get my point across.

}

// TODO: Classes should be &Vec<(String, Rule)>
pub fn autotag(mut events: Vec<Event>, rules: &Vec<(String, Regex)>) -> Vec<Event> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autotag makes it sound like it comes up with the tags itself (which would be really cool, but unfortunately that's not what it does).
Just "match_tags" would better fit as a name.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to just tag because otherwise, I feel like we should also rename the categorize function to match_category and that just seems unnecessarily wordy.

Copy link
Copy Markdown
Member

@johan-bjareholt johan-bjareholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API looks good and from a quick glance code looks good as well!

}
}

impl TryFrom<&DataType> for Vec<DataType> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome with all these TryFrom!

Also nice that you moved types to a seperate file.

@ErikBjare ErikBjare merged commit 48b78f4 into master Oct 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants