Skip to content

Introduce classification analysis runner.#701

Merged
przemekwitek merged 5 commits intoelastic:masterfrom
przemekwitek:classification
Oct 4, 2019
Merged

Introduce classification analysis runner.#701
przemekwitek merged 5 commits intoelastic:masterfrom
przemekwitek:classification

Conversation

@przemekwitek
Copy link
Copy Markdown

@przemekwitek przemekwitek commented Sep 27, 2019

Make CDataFrameBoostedTreeRunner class abstract and derive two subclasses from it:

  1. CDataFrameRegressionRunner
  2. CDataFrameClassificationRunner

Relates elastic/elasticsearch#46735

@droberts195
Copy link
Copy Markdown

I have labelled this >non-issue on the assumption that it will be release-noted via the Java side PR.

Copy link
Copy Markdown
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I left some minor comments.

One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments). It would also be good to add a unit test to CDataFrameAnalyzerTest for this behaviour. You can access the data frame CDataFrameAnalyzer creates to do this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination would be to pass in the actual field names and return a vector bool which is an indicator function for which columns to treat empty as missing. This will make the check much more efficient, i.e.

Suggested change
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const;
TBoolVec columnsForWhichEmptyIsMissing(const TStrVec& fieldNames) const;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move before private typedefs as per coding standards

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to local namespace at the start of this file, i.e. these should have local linkage.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move unsafe access in to else in case handle fatal doesn't abort.

Suggested change
}
} else {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding standards

Suggested change
for (std::size_t i = 0; i < maps.size(); i++) {
for (std::size_t i = 0; i < maps.size(); ++i) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For size_t it makes no difference. However, for a user defined type the compiler is forced to call a different function which would usually create a temporary. Since it is just as easy to type ++i as i++ we just mandate that we always use ++i if the return type isn't used to save having to think about it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before

Suggested change
const auto& tree{boostedTree()};
const auto& tree = this->boostedTree();

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::set<std::string> m_EmptyAsMissing;
TBoolVec m_EmptyIsMissing;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 60 to 63
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my suggestion I'd move this into CDataFrameAnalyzer::captureFieldNames

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_EmptyAsMissing.count(m_FieldNames[i]) > 0,
m_EmptyIsMissing[i],

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assert here let's return by constant reference, i.e. const maths::CBoostedTree&.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I left some minor comments.

One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments).

Done.

It would also be good to add a unit test to CDataFrameAnalyzerTest for this behaviour. You can access the data frame CDataFrameAnalyzer creates to do this.

WIP

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 60 to 63
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() {
void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@przemekwitek
Copy link
Copy Markdown
Author

WIP

I've added unit tests. @tveasey, please take another look.

Copy link
Copy Markdown
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Left a couple more minor change requests, but happy for this to go in without any further reviews so I'll go ahead and approve.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() {
void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this maths::CDataFrameUtils::isMissing? This should match whatever value we use to indicate missing.

Suggested change
auto nan = []() { return [](double actual) { return isnan(actual); }; };
auto missing = []() { return [](double actual) { return maths::CDataFrameUtils::isMissing(actual); }; };

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good test

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick to camel case we use this in 99.9% of all code (we should fix the other non-standard names in this suite as well, but you don't have to make that change)

Suggested change
void testColumnsForWhichEmptyIsMissing_Classification();
void testColumnsForWhichEmptyIsMissingClassification();

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void testColumnsForWhichEmptyIsMissing_Regression();
void testColumnsForWhichEmptyIsMissingRegression();

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Classification() {
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingClassification() {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Regression() {
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingRegression() {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@przemekwitek
Copy link
Copy Markdown
Author

run elasticsearch-ci

Przemyslaw Witek added 4 commits October 3, 2019 17:08
@przemekwitek przemekwitek merged commit 0a95ace into elastic:master Oct 4, 2019
@przemekwitek przemekwitek deleted the classification branch October 4, 2019 05:05
przemekwitek pushed a commit to przemekwitek/ml-cpp that referenced this pull request Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants