Enable parquet content_type in the scoring server input for pyfunc#20630
Conversation
This modifies the invocations call of the scoring server to include the parquet content type (applications/x-parquet) alongside the preexisting CSV and JSON. The parquet file is directly read into a Pandas dataframe and the parquet input should already contain the proper data schema as expected in the subsequent steps before the actual model prediction call This commit adds parquet only as an input type. Output stays as JSON. Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
|
In the most basic example the server works just as expected still for CSV and JSON test inputs: I've transformed our example CSV into a Parquet format. This now also works just fine: I get a similar output when there's an additional column in the input file and all of that. If I intentionally malform the Parquet file the output traceback is also available: We still also get a proper feedback if the Parquet file has a mismatch in the dtypes: and if a column is missing: For size an ingesting comparison I've created a Pandas Dataframe that's around 50MB in-memory size. This came out to be a DataFrame with a shape of (15000, 40). Saved to a Parquet and CSV side by side the filesize comparison is as follows: That's already double the difference in file size. It is a significant difference when we talk about batch inference processing. A simple curl with the bigger parquet file takes about 0.5 second (need to take into account that this is all localhost testing, so I'm skipping all the possible networking from taking place). The same run with the CSV file takes 2 full seconds: On this scale it might seem small but for batch transform cases this is a noticeable change. Additionally there might be also general restrictions on maximum size of a single curl data input size for the inference endpoint. In for example SageMaker Batch Transform there's the MaxPayloadInMB. While Sagemaker does provide the ability of selecting the "Split Type" so that the Sagemaker takes care of fragmenting the bigger requests into smaller pieces when it is a CSV file one will have to do that by themselves when wanting to process bigger batches of data in Parquet form. So there's a requirement on the user side to partition the files during the data preprocessing step. Still at least for my usecase this is a worthy change for larger scale data inference runs. |
|
Documentation preview for e6d6f65 is available at: More info
|
Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
|
The implementation looks good! Could you add an E2E test? |
Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
| elif content_type == "parquet": | ||
| # For a direct stdin stream we need to read through the entire text buffer first | ||
| # before converting it from a TextIO stream into a BytesIO stream for Pandas. | ||
| # A seek that pyarrow engine will try to do with Parquet for sys.stdin input | ||
| # will get forbidden otherwise. | ||
| df = ( | ||
| parse_parquet_input(input_path) | ||
| if input_path is not None | ||
| else parse_parquet_input(BytesIO(sys.stdin.buffer.read())) | ||
| ) | ||
| params = None |
There was a problem hiding this comment.
| elif content_type == "parquet": | |
| # For a direct stdin stream we need to read through the entire text buffer first | |
| # before converting it from a TextIO stream into a BytesIO stream for Pandas. | |
| # A seek that pyarrow engine will try to do with Parquet for sys.stdin input | |
| # will get forbidden otherwise. | |
| df = ( | |
| parse_parquet_input(input_path) | |
| if input_path is not None | |
| else parse_parquet_input(BytesIO(sys.stdin.buffer.read())) | |
| ) | |
| params = None |
Actually this API is used for mlflow.models.predict, and to support that we need more changes. I think it's fine to support scoring server for now, let's revert this part of changes :)
There was a problem hiding this comment.
Sure, makes sense. 👍
Avoided applying the suggestion via Github webui as I see it makes a mess with the signoff for me at least. The additional commit reverts these changes.
Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
…lflow#20630) Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
…20630) Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>


Related Issues/PRs
Resolve #20602
What changes are proposed in this pull request?
This modifies the invocations call of the scoring server to include the parquet content type (applications/x-parquet) alongside the preexisting CSV and JSON.
The parquet file is directly read into a Pandas dataframe and the parquet input should already contain the proper data schema as expected in the subsequent steps before the actual model prediction call.
The feature is also added to the
_predictfunction in the same file even though I don't see a corresponding test for it or a use of this method in general in other places.This commit adds parquet only as an input type. Output stays as JSON.
How is this PR tested?
Does this PR require documentation update?
Does this PR require updating the MLflow Skills repository?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.