Introduce classification analysis runner.#701
Conversation
885a693 to
cff6d69
Compare
13381dd to
d6181ba
Compare
|
I have labelled this |
d6181ba to
88e02b4
Compare
tveasey
left a comment
There was a problem hiding this comment.
This looks pretty good. I left some minor comments.
One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments). It would also be good to add a unit test to CDataFrameAnalyzerTest for this behaviour. You can access the data frame CDataFrameAnalyzer creates to do this.
There was a problem hiding this comment.
My inclination would be to pass in the actual field names and return a vector bool which is an indicator function for which columns to treat empty as missing. This will make the check much more efficient, i.e.
| void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const; | |
| TBoolVec columnsForWhichEmptyIsMissing(const TStrVec& fieldNames) const; |
There was a problem hiding this comment.
nit: can we move before private typedefs as per coding standards
lib/api/CDataFrameAnalyzer.cc
Outdated
There was a problem hiding this comment.
Can we move these to local namespace at the start of this file, i.e. these should have local linkage.
lib/api/CDataFrameAnalyzer.cc
Outdated
There was a problem hiding this comment.
move unsafe access in to else in case handle fatal doesn't abort.
| } | |
| } else { |
lib/api/CDataFrameAnalyzer.cc
Outdated
There was a problem hiding this comment.
coding standards
| for (std::size_t i = 0; i < maps.size(); i++) { | |
| for (std::size_t i = 0; i < maps.size(); ++i) { |
There was a problem hiding this comment.
Done.
BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)
There was a problem hiding this comment.
For size_t it makes no difference. However, for a user defined type the compiler is forced to call a different function which would usually create a temporary. Since it is just as easy to type ++i as i++ we just mandate that we always use ++i if the return type isn't used to save having to think about it.
There was a problem hiding this comment.
As before
| const auto& tree{boostedTree()}; | |
| const auto& tree = this->boostedTree(); |
include/api/CDataFrameAnalyzer.h
Outdated
There was a problem hiding this comment.
| std::set<std::string> m_EmptyAsMissing; | |
| TBoolVec m_EmptyIsMissing; |
lib/api/CDataFrameAnalyzer.cc
Outdated
There was a problem hiding this comment.
As per my suggestion I'd move this into CDataFrameAnalyzer::captureFieldNames
lib/api/CDataFrameAnalyzer.cc
Outdated
There was a problem hiding this comment.
| m_EmptyAsMissing.count(m_FieldNames[i]) > 0, | |
| m_EmptyIsMissing[i], |
There was a problem hiding this comment.
If we assert here let's return by constant reference, i.e. const maths::CBoostedTree&.
There was a problem hiding this comment.
This looks pretty good. I left some minor comments.
One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments).
Done.
It would also be good to add a unit test to
CDataFrameAnalyzerTestfor this behaviour. You can access the data frameCDataFrameAnalyzercreates to do this.
WIP
include/api/CDataFrameAnalyzer.h
Outdated
lib/api/CDataFrameAnalyzer.cc
Outdated
lib/api/CDataFrameAnalyzer.cc
Outdated
There was a problem hiding this comment.
Done.
BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)
lib/api/CDataFrameAnalyzer.cc
Outdated
lib/api/CDataFrameAnalyzer.cc
Outdated
lib/api/CDataFrameAnalyzer.cc
Outdated
2f0265e to
dd86f4e
Compare
There was a problem hiding this comment.
| void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() { | |
| void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() { |
I've added unit tests. @tveasey, please take another look. |
tveasey
left a comment
There was a problem hiding this comment.
Great! Left a couple more minor change requests, but happy for this to go in without any further reviews so I'll go ahead and approve.
There was a problem hiding this comment.
| void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() { | |
| void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() { |
There was a problem hiding this comment.
Can we make this maths::CDataFrameUtils::isMissing? This should match whatever value we use to indicate missing.
| auto nan = []() { return [](double actual) { return isnan(actual); }; }; | |
| auto missing = []() { return [](double actual) { return maths::CDataFrameUtils::isMissing(actual); }; }; |
There was a problem hiding this comment.
Let's stick to camel case we use this in 99.9% of all code (we should fix the other non-standard names in this suite as well, but you don't have to make that change)
| void testColumnsForWhichEmptyIsMissing_Classification(); | |
| void testColumnsForWhichEmptyIsMissingClassification(); |
There was a problem hiding this comment.
| void testColumnsForWhichEmptyIsMissing_Regression(); | |
| void testColumnsForWhichEmptyIsMissingRegression(); |
There was a problem hiding this comment.
| void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Classification() { | |
| void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingClassification() { |
There was a problem hiding this comment.
| void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Regression() { | |
| void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingRegression() { |
77d998e to
7e30490
Compare
|
run elasticsearch-ci |
…sses from it: CDataFrameRegressionRunner and CDataFrameClassificationRunner.
7e30490 to
97ae2ae
Compare
Make CDataFrameBoostedTreeRunner class abstract and derive two subclasses from it:
Relates elastic/elasticsearch#46735