Skip to content

Conversation

@yuslepukhin
Copy link
Member

@yuslepukhin yuslepukhin commented Jan 31, 2020

Description: Provide ORT kernels for the following:
CountVectorizer, Tfidfvectorizer left out pending decision.

• FromStringFeaturizer
• L1NormalizeFeaturizer
• L2NormalizeFeaturizer
• MaxNormalizeFeaturizer
• MeanImputerFeaturizer
• MedianImputerFeaturizer
• MinMaxImputerFeaturizer
• ModeImputerFeaturizer
• NumericalizeFeaturizer
• PCAFeaturizer
• StandardScaleWrapperFeaturizer
• TruncatedSVDFeaturizer

  Add StandardScaleWrapperTransformer kernel and tests.
  Add Numericalize Featurizer.
  Bring kernel implementation for MeanImputer, MedianImputer, MinMaxInputer and
  ModeImputer.
  Add MinMaxImputerTransformer
  Add mean_imputer_transformer_test
  Add MedianImputerTest.
  Add ModeImputerTest.
  Add FromStringTransformer
  Add NormalizeFeaturizer and tests.
  First verson of PCA and Truncated SVD implementation.
  TODO: CountVectorizer, TfidfVectorizer
@yuslepukhin yuslepukhin marked this pull request as ready for review February 1, 2020 00:24
@yuslepukhin yuslepukhin requested a review from a team as a code owner February 1, 2020 00:24
namespace onnxruntime {
namespace featurizers {

template <typename T>
Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these structs need to be repeated in every source file? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a part of a generated code. They are per featurizer so they may be different.


In reply to: 373739890 [](ancestors = 373739890)

}
};

class ModeImputerTransformer final : public OpKernel {
Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to allow copy/move operations for this class? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is stateless and kernels are never copied. Not sure I understood the question.


In reply to: 373740339 [](ancestors = 373740339)

Eigen::Map<MatrixT> output_matrix(output_data, dim_0, dim_1);

std::function<void(MatrixT val)> callback;
callback = [&output_matrix](MatrixT val) {
Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any disadvantage in assigning this lambda in the previous line itself? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous line would not be an assignment. It would require a constuctor which std::function does not have.


In reply to: 373741707 [](ancestors = 373741707)

Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh may be the constructor doesn't exist in the compiler version we're using? For reference https://en.cppreference.com/w/cpp/utility/functional/function/function #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will play safe for older compilers.


In reply to: 373744508 [](ancestors = 373744508,373741707)


namespace {
template <typename T>
std::vector<uint8_t> GetStream(const std::vector<typename NS::Traits<T>::nullable_type>& trainingBatches, size_t colIndex) {
Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming convention of colIndex is off. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected everywhere


In reply to: 373741877 [](ancestors = 373741877)

}
);
void RegisterFromStringFeaturizerVer1() {
static const char* doc = R"DOC(
Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid such static strings for op documentation as they occupy space in the binary. They're not required when running the model. May be we can disable them behind a macro like ONNX (__ONNX_NO_DOC_STRINGS). #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we omit SetDoc call?


In reply to: 373742554 [](ancestors = 373742554)

Copy link
Contributor

@pranavsharma pranavsharma Feb 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately SetDoc only reduces the memory footprint of the opschema object. The static string is still stuck in the binary occupying space. The ONNX schema is filled with many such examples and so is our contrib ops. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will both Omit SetDoc and make strings comments


In reply to: 373746248 [](ancestors = 373746248)

pranavsharma
pranavsharma previously approved these changes Feb 1, 2020

if (onnxruntime_USE_FEATURIZERS)
if(NOT MSVC)
set_source_files_properties(${onnxruntime_cpu_featurizers_cc_srcs} PROPERTIES COMPILE_FLAGS "-Wno-unknown-warning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may add the following line to: https://github.com/microsoft/onnxruntime/blob/master/cmake/CMakeLists.txt#L502

  check_cxx_compiler_flag(-Wno-unknown-warning HAS_NO_UNKNOWN_WARNING)

Then at here you can use:

 if(HAS_NO_UNKNOWN_WARNING)	
    set_source_files_properties(${onnxruntime_cpu_featurizers_cc_srcs} PROPERTIES COMPILE_FLAGS "-Wno-unknown-warning")

snnn
snnn previously approved these changes Feb 4, 2020
@@ -1,3 +1,4 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to exclude this file from this PR

@snnn
Copy link
Contributor

snnn commented Feb 4, 2020

Maybe it is a bit out of topic, what is L1NormalizeFeaturizer? Based on what I know, L1/L2 norm is only used in training, not inferencing.

@yuslepukhin yuslepukhin merged commit c86db80 into master Feb 4, 2020
@yuslepukhin yuslepukhin deleted the yuslepukhin/std_scale_wrapper branch February 4, 2020 23:43
weixingzhang pushed a commit that referenced this pull request Mar 23, 2020
Merge up to commit 4f4f4bc

There were several very large pull requests in public master:
#2956
#2958
#2961

**BERT-Large, FP16, seq=128:**
Batch = 66
Throughput = 189.049 ex/sec

**BERT-Large, FP16, seq=512:**
Batch = 10
Throughput = 36.6335 ex/sec

**BERT-Large, FP32, seq=128:**
Batch = 33
Throughput = 42.2642 ex/sec

**BERT-Large, FP32, seq=512:**
Batch = 5
Throughput = 9.32792 ex/sec

**BERT-Large LAMB convergence:**
![image.png](https://aiinfra.visualstudio.com/530acbc4-21bc-487d-8cd8-348ff451d2ff/_apis/git/repositories/adc1028e-6f04-44b7-a3cf-cb157be4fb65/pullRequests/5567/attachments/image.png)
`$ python watch_experiment.py --subscription='4aaa645c-5ae2-4ae9-a17a-84b9023bc56a' --resource_group='onnxtraining' --workspace='onnxtraining' --remote_dir='logs/tensorboard/' --local_dir='D:/tensorboard/bert-large/fp16/lamb/seq128/lr3e-3/wr0.2843/master/' --run='BERT-ONNX_1581120364_71872cef'`

**E2E**:  PASSED
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=117300&view=results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants