Conversation
michaelbenayoun
left a comment
There was a problem hiding this comment.
Huge work @philschmid !!!
I left a few comments and questions to initiate the discussion on a few details about this really cool feature.
| ) | ||
| def forward( | ||
| self, | ||
| input_ids: Optional[torch.Tensor] = None, |
There was a problem hiding this comment.
I think we should also accept numpy arrays directly, and return tensors of the same nature as the input (numpy arrays if numpy array, torch tensors if torch.Tensor), but I might be missing some details with pipelines.
| onnx_inputs["token_type_ids"] = token_type_ids.cpu().detach().numpy() | ||
| # run inference | ||
| outputs = self.model.run(None, onnx_inputs) | ||
| # converts output to namedtuple for pipelines post-processing |
There was a problem hiding this comment.
Could we have a return_dict parameter? Just a suggestion, I don't know how useful it would be.
lewtun
left a comment
There was a problem hiding this comment.
Thank you for this awesome feature 🔥 ! I left a few nits about the docs and some general questions about the naming of the OnnxForXxx classes.
I also agree with @michaelbenayoun that we need to discuss whether these classes should also be able to optimise / quantize ONNX models (I can see pros and cons with each approach).
Otherwise, great stuff!
| # Models | ||
| *.onnx | ||
| # include small test model for tests | ||
| !tests/assets/onnx/model.onnx |
There was a problem hiding this comment.
Would it make sense to have a small model on the Hub, e.g. under the optimum org?
| The [`optimum_pipeline`] makes it simple to use models from the [Model Hub](https://huggingface.co/models) for accelerated inference on a variety of tasks such as text classification. | ||
| Even if you don't have experience with a specific modality or understand the code powering the models, you can still use them with the [`optimum_pipeline`]! This tutorial will teach you to: | ||
|
|
||
| <Tip> |
There was a problem hiding this comment.
Should this tip be moved below, i.e. just after the example in Optimum pipeline usage?
echarlaix
left a comment
There was a problem hiding this comment.
Great work @philschmid, this new feature is 🔥. I left couple of minor comments, also happy to discuss about the optimize and quantize methods as @michaelbenayoun and @lewtun pointed out.
|
I reworked the from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
from optimum.pipelines import optimum_pipeline
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",from_transformers=True)
# optimization_config=99 enables all available graph optimisations
optimization_config = OptimizationConfig(optimization_level=99)
model.optimize(optimization_config=optimization_config)
# dynamic quantization configuration
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
model.quantize(quantization_config=qconfig)
# create transformers pipeline
onnx_clx = optimum_pipeline("text-classification", model=model, tokenizer=tokenizer)
text = "I like the new ORT pipeline"
pred = onnx_clx(text)
print(pred) |
regisss
left a comment
There was a problem hiding this comment.
Huge work @philschmid 🔥
I just left a couple of minor comments.
|
Hi All, You are doing kind of the same as I did here: https://github.com/AlekseyKorshuk/optimum-transformers. Btw this project is done to show myself to join Hugging Face Team 🤗 This may seem a little cheeky on my part. But I have now applied for an internship at Hugging Face and I would be very happy to work on this update fully on the part of the company. Please don’t hesitate to contact me should you need further information. Best regards, |
lewtun
left a comment
There was a problem hiding this comment.
Thanks a lot for iterating on this great feature @philschmid 🚀 ! Most of my comments are nits and I think the only "major" thing involves reworking the docs about quantization / optimization.
Apart from that, LGTM!
|
|
||
| The Optimum model classes, e.g. [`~ORTModelForSequenceClassification`] are directly integrated with the [Hugging Face Model Hub](https://hf.co/models)) meaning you can not only | ||
| load model from the Hub but also push your models to the Hub with `push_to_hub()` method. Below you find an example which pulls a vanilla transformers model | ||
| from the Hub and converts it to an optimum model uses the `[~ORTQuantizer]` to quantize the model and pushes it back into a new repository. |
There was a problem hiding this comment.
We don't seem to do quantization in this example right?
There was a problem hiding this comment.
True removed it for now and added a TODO
michaelbenayoun
left a comment
There was a problem hiding this comment.
Seems great, thanks a lot @philschmid, it will be super useful!
| pred = onnx_qa(question, context) | ||
| ``` | ||
|
|
||
| Optimum Inference also includes methods to convert vanilla Transformers models to optimized ones. Simply pass `from_transformers=True` to the `from_pretrained()` method, and your model will be loaded and converted to ONNX on-the-fly: |
There was a problem hiding this comment.
Is it really an "optimized" one?
There was a problem hiding this comment.
I would be that critical, or do you have a suggestion what we could use instead of "optimized" ?
optimum/onnxruntime/modeling_ort.py
Outdated
| @staticmethod | ||
| def load_model(path: Union[str, Path], provider=None): | ||
| """ | ||
| loads ONNX Inference session with Provider. Default Provider is if GPU available else `CPUExecutionProvider` |
There was a problem hiding this comment.
Uppercase, and missing CUDAExecutionProvider after "Default" i think.
There was a problem hiding this comment.
Not sure we should default to GPU, it will pick the GPU:0 which can be problematic if the user already has something running on that GPU.
IMO: Default to CPU and use CUDAExecutionProvider if:
- available
- provided by the user (i.e. he knows what he's doing, nothing hidden and follow what ORT does by default too)
There was a problem hiding this comment.
Let's stick with it for now and adjust it afterward when we have feedback or issues. That way it is the easiest UX. And you can always provide provider, e.g. Nils does the same for sentence-transformers.
mfuntowicz
left a comment
There was a problem hiding this comment.
Look great overall, congrats 💪🏻.
Few performances/usability comments I think it would be interesting to address here or in a following PR, no strong blocking point for me, a very good starting point
| if len(str(model_id).split("@")) == 2: | ||
| model_id, revision = model_id.split("@") |
There was a problem hiding this comment.
What about uniformizing the way to grab a specific revision for a model towards what transformers provides, using a specific revision parameter:
revision (`str`, *optional*, defaults to `"main"`):
The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
identifier allowed by git.There was a problem hiding this comment.
Great Idea and suggestion. But let's try to get this PR in. I would create separate small issues we can then work on after the PR is merged.
optimum/onnxruntime/modeling_ort.py
Outdated
| @staticmethod | ||
| def load_model(path: Union[str, Path], provider=None): | ||
| """ | ||
| loads ONNX Inference session with Provider. Default Provider is if GPU available else `CPUExecutionProvider` |
There was a problem hiding this comment.
Not sure we should default to GPU, it will pick the GPU:0 which can be problematic if the user already has something running on that GPU.
IMO: Default to CPU and use CUDAExecutionProvider if:
- available
- provided by the user (i.e. he knows what he's doing, nothing hidden and follow what ORT does by default too)
| available else `CPUExecutionProvider` | ||
| """ | ||
| if provider is None: | ||
| provider = "CUDAExecutionProvider" if _is_gpu_available() else "CPUExecutionProvider" |
There was a problem hiding this comment.
What about wrapping thoses names within an enum to make it more user-friendly for the user?
class ORTDevice(Enum):
CPU = "CPUExecutionProvider"
CUDA = "CUDAExecutionProvider"
TENSORRT = "TensorRTExecutionProvider"
...I find those name not very friendly for someone coming from PyTorch/Device
or even something a la PyTorch:
class device:
def __init__(self, specs: str):
parse_specs(specs)
@property
def is_cpu(self) -> bool:
pass
@property
def is_cuda(self) -> bool:
pass
@property
def num_cores(self) -> Optional[int]:
return specs["core"] # for symplicity
@property
def gpu_id(self) -> Optional[int]:
return specs["gpu_id"]
def to_execution_provider_specs(self) -> Tuple[str, dict[str, any]]
return "CPUExecutionProvider", {"num_intraops_threads": self.num_cores}
cpu_bound_device = device("cpu:0-16")
cpu_cores_device = device("cpu:16")
gpu_device = device("cuda:0")There was a problem hiding this comment.
Great Idea and suggestion. But let's try to get this PR in. I would create separate small issues we can then work on after the PR is merged.
| onnx_inputs = { | ||
| "input_ids": input_ids.cpu().detach().numpy(), | ||
| "attention_mask": attention_mask.cpu().detach().numpy(), | ||
| } |
There was a problem hiding this comment.
We should use OrtValue in the future to reduce the round-trip of the data Torch -> Numpy -> ORT
There was a problem hiding this comment.
This is especially true if the device property binds to cuda, then we are allocating Torch(GPU) -> Torch(CPU) -> Numpy -> ORT
There was a problem hiding this comment.
Great Idea and suggestion. But let's try to get this PR in. I would create separate small issues we can then work on after the PR is merged.
| feature_extractor: Optional[Union[str, PreTrainedFeatureExtractor]] = None, | ||
| use_fast: bool = True, | ||
| use_auth_token: Optional[Union[str, bool]] = None, | ||
| accelerator: Optional[str] = "ort", |
There was a problem hiding this comment.
"onnxruntime"? Again, for non-expert, "ort" might not ring a bell
There was a problem hiding this comment.
in transformers we have pt and tf as well so i kept it similar + we have OrtModel* classes. I would rather stick with ort
There was a problem hiding this comment.
Ok lets go this way, we can still have "onnxruntime" as an alias for "ort" if we want both.
echarlaix
left a comment
There was a problem hiding this comment.
It looks great, thanks a lot @philschmid ! Ready to merge after the confirmation about the tests workflow
| @@ -0,0 +1,467 @@ | |||
| import os | |||
There was a problem hiding this comment.
I didn't see any yaml file defining a workflow to handle the tests from test_modeling_ort.py ? It seems that test_onnxruntime.yml is only running the tests from test_onnxruntime.py and not all the ones from the onnxruntime directory. The tests from test_modeling_base.py would be covered by test_optimum_common.yml
Poc Accelerated Inference
Our goal with Optimum is to offer the open-source reference toolkit to do transformers acceleration work, and inference is one part of it.
Over the last couple of weeks and months, I created a PoC, enabling us to create OptimizedModelFor* classes to mimic the transformers API to allow support for
transformers.pipelinesand other features.I have opened this PR to discuss the current POC. I also put quite some work into already providing documentation and tests.
You can access the documentation here:
The main two new pages are under onnxruntime "Inference" and under getting started "Pipelines."
The current PR supports:
from_transformersmethod to automatically convert models withtransformers.onnxBelow is a snippet on how you can load the vanilla transformers model and convert it to then use it in a pipeline
More examples can be found here:
Todos
ORTOptimizerandORTQuantizerin.quantizeand.optimizeOnnxForCausalLMmodelORTModelForXXfrom_transformerstofrom_pretrained(model_id, from_transformers=True)After another discussion with the Optimum Team:
.optimizeand.quantizefor now to think about a save approach for offline optimizations