-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Deliver a list of Featurizers #2958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add StandardScaleWrapperTransformer kernel and tests. Add Numericalize Featurizer. Bring kernel implementation for MeanImputer, MedianImputer, MinMaxInputer and ModeImputer. Add MinMaxImputerTransformer Add mean_imputer_transformer_test Add MedianImputerTest. Add ModeImputerTest.
Add FromStringTransformer Add NormalizeFeaturizer and tests. First verson of PCA and Truncated SVD implementation. TODO: CountVectorizer, TfidfVectorizer
with their tests.
…izer tests due to linkage errors.
| namespace onnxruntime { | ||
| namespace featurizers { | ||
|
|
||
| template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these structs need to be repeated in every source file? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a part of a generated code. They are per featurizer so they may be different.
In reply to: 373739890 [](ancestors = 373739890)
| } | ||
| }; | ||
|
|
||
| class ModeImputerTransformer final : public OpKernel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to allow copy/move operations for this class? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class is stateless and kernels are never copied. Not sure I understood the question.
In reply to: 373740339 [](ancestors = 373740339)
| Eigen::Map<MatrixT> output_matrix(output_data, dim_0, dim_1); | ||
|
|
||
| std::function<void(MatrixT val)> callback; | ||
| callback = [&output_matrix](MatrixT val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any disadvantage in assigning this lambda in the previous line itself? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous line would not be an assignment. It would require a constuctor which std::function does not have.
In reply to: 373741707 [](ancestors = 373741707)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh may be the constructor doesn't exist in the compiler version we're using? For reference https://en.cppreference.com/w/cpp/utility/functional/function/function #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| namespace { | ||
| template <typename T> | ||
| std::vector<uint8_t> GetStream(const std::vector<typename NS::Traits<T>::nullable_type>& trainingBatches, size_t colIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming convention of colIndex is off. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| ); | ||
| void RegisterFromStringFeaturizerVer1() { | ||
| static const char* doc = R"DOC( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to avoid such static strings for op documentation as they occupy space in the binary. They're not required when running the model. May be we can disable them behind a macro like ONNX (__ONNX_NO_DOC_STRINGS). #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately SetDoc only reduces the memory footprint of the opschema object. The static string is still stuck in the binary occupying space. The ONNX schema is filled with many such examples and so is our contrib ops. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmake/onnxruntime_providers.cmake
Outdated
|
|
||
| if (onnxruntime_USE_FEATURIZERS) | ||
| if(NOT MSVC) | ||
| set_source_files_properties(${onnxruntime_cpu_featurizers_cc_srcs} PROPERTIES COMPILE_FLAGS "-Wno-unknown-warning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may add the following line to: https://github.com/microsoft/onnxruntime/blob/master/cmake/CMakeLists.txt#L502
check_cxx_compiler_flag(-Wno-unknown-warning HAS_NO_UNKNOWN_WARNING)
Then at here you can use:
if(HAS_NO_UNKNOWN_WARNING)
set_source_files_properties(${onnxruntime_cpu_featurizers_cc_srcs} PROPERTIES COMPILE_FLAGS "-Wno-unknown-warning")
| @@ -1,3 +1,4 @@ | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to exclude this file from this PR
|
Maybe it is a bit out of topic, what is L1NormalizeFeaturizer? Based on what I know, L1/L2 norm is only used in training, not inferencing. |
Merge up to commit 4f4f4bc There were several very large pull requests in public master: #2956 #2958 #2961 **BERT-Large, FP16, seq=128:** Batch = 66 Throughput = 189.049 ex/sec **BERT-Large, FP16, seq=512:** Batch = 10 Throughput = 36.6335 ex/sec **BERT-Large, FP32, seq=128:** Batch = 33 Throughput = 42.2642 ex/sec **BERT-Large, FP32, seq=512:** Batch = 5 Throughput = 9.32792 ex/sec **BERT-Large LAMB convergence:**  `$ python watch_experiment.py --subscription='4aaa645c-5ae2-4ae9-a17a-84b9023bc56a' --resource_group='onnxtraining' --workspace='onnxtraining' --remote_dir='logs/tensorboard/' --local_dir='D:/tensorboard/bert-large/fp16/lamb/seq128/lr3e-3/wr0.2843/master/' --run='BERT-ONNX_1581120364_71872cef'` **E2E**: PASSED https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=117300&view=results
Description: Provide ORT kernels for the following:
CountVectorizer, Tfidfvectorizer left out pending decision.
• FromStringFeaturizer
• L1NormalizeFeaturizer
• L2NormalizeFeaturizer
• MaxNormalizeFeaturizer
• MeanImputerFeaturizer
• MedianImputerFeaturizer
• MinMaxImputerFeaturizer
• ModeImputerFeaturizer
• NumericalizeFeaturizer
• PCAFeaturizer
• StandardScaleWrapperFeaturizer
• TruncatedSVDFeaturizer