Skip to content

Enable parquet content_type in the scoring server input for pyfunc#20630

Merged
serena-ruan merged 4 commits intomlflow:masterfrom
TFK1410:feature/enable-parquet-in-scoring-server-input
Feb 11, 2026
Merged

Enable parquet content_type in the scoring server input for pyfunc#20630
serena-ruan merged 4 commits intomlflow:masterfrom
TFK1410:feature/enable-parquet-in-scoring-server-input

Conversation

@TFK1410
Copy link
Contributor

@TFK1410 TFK1410 commented Feb 6, 2026

Related Issues/PRs

Resolve #20602

What changes are proposed in this pull request?

This modifies the invocations call of the scoring server to include the parquet content type (applications/x-parquet) alongside the preexisting CSV and JSON.

The parquet file is directly read into a Pandas dataframe and the parquet input should already contain the proper data schema as expected in the subsequent steps before the actual model prediction call.

The feature is also added to the _predict function in the same file even though I don't see a corresponding test for it or a use of this method in general in other places.

This commit adds parquet only as an input type. Output stays as JSON.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Does this PR require updating the MLflow Skills repository?

  • No. You can skip the rest of this section.
  • Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

This modifies the invocations call of the scoring server to include the
parquet content type (applications/x-parquet) alongside the preexisting
CSV and JSON.

The parquet file is directly read into a Pandas dataframe and the
parquet input should already contain the proper data schema as expected
in the subsequent steps before the actual model prediction call

This commit adds parquet only as an input type. Output stays as JSON.

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20630/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20630/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20630/merge

@github-actions github-actions bot added area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/feature Mention under Features in Changelogs. labels Feb 6, 2026
@TFK1410
Copy link
Contributor Author

TFK1410 commented Feb 9, 2026

In the most basic example the server works just as expected still for CSV and JSON test inputs:

$ curl -H "Content-Type:text/csv" --data-binary '@/opt/ml/model/test.csv' http://localhost:8000/invocations
{"predictions": [0]}
$ curl -H "Content-Type:application/json" --data '@/opt/ml/model/serving_input_example.json' http://localhost:8000/invocations
{"predictions": [0, 0, 0, 0, 0]}

I've transformed our example CSV into a Parquet format. This now also works just fine:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test.parquet' http://localhost:8000/invocations
{"predictions": [0]}

I get a similar output when there's an additional column in the input file and all of that.

If I intentionally malform the Parquet file the output traceback is also available:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test-malform.parquet' http://localhost:8000/invocations
{"error_code": "BAD_REQUEST", "message": "Failed to parse input as a Pandas DataFrame. Ensure that the input is a valid Parquet Pandas DataFrame produced using the `pandas.DataFrame.to_parquet()` method. Error: 'Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n'", "stack_trace": "Traceback (most recent call last):\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/mlflow/pyfunc/scoring_server/__init__.py\", line 275, in parse_parquet_input\n    return pd.read_parquet(parquet_input)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 667, in read_parquet\n    return impl.read(\n           ^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 274, in read\n    pa_table = self.api.parquet.read_table(\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1858, in read_table\n    dataset = ParquetDataset(\n              ^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1427, in __init__\n    [fragment], schema=schema or fragment.physical_schema,\n                                 ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"pyarrow/_dataset.pyx\", line 1477, in pyarrow._dataset.Fragment.physical_schema.__get__\n  File \"pyarrow/error.pxi\", line 155, in pyarrow.lib.pyarrow_internal_check_status\n  File \"pyarrow/error.pxi\", line 92, in pyarrow.lib.check_status\nOSError: Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n\n"}

We still also get a proper feedback if the Parquet file has a mismatch in the dtypes:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test-wrong-dtype.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '..., 'age': double (required)]'. Error: Incompatible input types for column age. Can not safely convert int64 to float64."}

and if a column is missing:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/part-00000-missing-age-column.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '...[1 rows x 39 columns]'. \nError: Failed to enforce schema of data '...[1 rows x 39 columns]' with schema '... 'age': double (required)]'. Error: Model is missing inputs ['age']. Note that there were extra inputs: ['mp_id']."}

For size an ingesting comparison I've created a Pandas Dataframe that's around 50MB in-memory size. This came out to be a DataFrame with a shape of (15000, 40). Saved to a Parquet and CSV side by side the filesize comparison is as follows:

# stat -c "%s %n" -- bigger.*
115776974 bigger.csv
60857034 bigger.parquet

That's already double the difference in file size. It is a significant difference when we talk about batch inference processing.

A simple curl with the bigger parquet file takes about 0.5 second (need to take into account that this is all localhost testing, so I'm skipping all the possible networking from taking place).

$ time curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/bigger.parquet' http://localhost:8000/invocations
...
real    0m0.527s
user    0m0.004s
sys     0m0.085s

The same run with the CSV file takes 2 full seconds:

$ time curl -H "Content-Type:text/csv" --data-binary '@/opt/ml/model/bigger.csv' http://localhost:8000/invocations
...
real    0m2.160s
user    0m0.014s
sys     0m0.123s

On this scale it might seem small but for batch transform cases this is a noticeable change.

Additionally there might be also general restrictions on maximum size of a single curl data input size for the inference endpoint. In for example SageMaker Batch Transform there's the MaxPayloadInMB. While Sagemaker does provide the ability of selecting the "Split Type" so that the Sagemaker takes care of fragmenting the bigger requests into smaller pieces when it is a CSV file one will have to do that by themselves when wanting to process bigger batches of data in Parquet form. So there's a requirement on the user side to partition the files during the data preprocessing step. Still at least for my usecase this is a worthy change for larger scale data inference runs.

@serena-ruan serena-ruan self-requested a review February 10, 2026 04:53
@github-actions
Copy link
Contributor

github-actions bot commented Feb 10, 2026

Documentation preview for e6d6f65 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
@serena-ruan
Copy link
Collaborator

The implementation looks good! Could you add an E2E test?

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
@TFK1410
Copy link
Contributor Author

TFK1410 commented Feb 10, 2026

Sure!

I've now added a test_parse_parquet_input() test scenario in a similar vein to the test_parse_json_input_including_path() case and it actually now covers the additional lines that I've added to the utils.py in the testing directory.

I hope the scope of this test test scenario covers what's needed for E2E. Let me know if there's anything more in terms if E2E scenarios that I could provide.

In the meantime I've also rebuilt my container environment.

All the previous calls work just fine as before. For the "old" content-type that I've set the request now throws the 415 error code with the following message:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test.parquet' http://localhost:8000/invocations
This predictor only supports the following content types: Types: ['text/csv', 'application/json', 'application/vnd.apache.parquet'].

The list of supported types is correct. Let's adjust the called content-type:

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/part-00000.parquet' http://localhost:8000/invocations
{"predictions": [0]}

All other error case scenarios still work the same as last time:

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/test-malform.parquet' http://localhost:8000/invocations
{"error_code": "BAD_REQUEST", "message": "Failed to parse input as a Pandas DataFrame. Ensure that the input is a valid Parquet Pandas DataFrame produced using the `pandas.DataFrame.to_parquet()` method. Error: 'Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n'", "stack_trace": "Traceback (most recent call last):\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/mlflow/pyfunc/scoring_server/__init__.py\", line 275, in parse_parquet_input\n    return pd.read_parquet(parquet_input)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 667, in read_parquet\n    return impl.read(\n           ^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 274, in read\n    pa_table = self.api.parquet.read_table(\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1858, in read_table\n    dataset = ParquetDataset(\n              ^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1427, in __init__\n    [fragment], schema=schema or fragment.physical_schema,\n                                 ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"pyarrow/_dataset.pyx\", line 1477, in pyarrow._dataset.Fragment.physical_schema.__get__\n  File \"pyarrow/error.pxi\", line 155, in pyarrow.lib.pyarrow_internal_check_status\n  File \"pyarrow/error.pxi\", line 92, in pyarrow.lib.check_status\nOSError: Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n\n"}

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/test-wrong-dtype.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '..., 'age': double (required)]'. Error: Incompatible input types for column age. Can not safely convert int64 to float64."}

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/part-00000-missing-age-column.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '...[1 rows x 39 columns]'. \nError: Failed to enforce schema of data '...[1 rows x 39 columns]' with schema '... 'age': double (required)]'. Error: Model is missing inputs ['age']. Note that there were extra inputs: ['mp_id']."}

Additionally I can provide that Sagemaker Batch Transform confirmation.

The deployable model definition in Sagemaker AWS happily takes in my custom built image with this PR version of mlflow in. From that we create a batch transform run which completes successfully:
image
and it happily takes the Content type parameter to be set to application/vnd.apache.parquet
image
As also mentioned earlier in this case -> parquet being a binary file -> we can't utilize the Split type, hence why I've set it to None.

Comment on lines +578 to +588
elif content_type == "parquet":
# For a direct stdin stream we need to read through the entire text buffer first
# before converting it from a TextIO stream into a BytesIO stream for Pandas.
# A seek that pyarrow engine will try to do with Parquet for sys.stdin input
# will get forbidden otherwise.
df = (
parse_parquet_input(input_path)
if input_path is not None
else parse_parquet_input(BytesIO(sys.stdin.buffer.read()))
)
params = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif content_type == "parquet":
# For a direct stdin stream we need to read through the entire text buffer first
# before converting it from a TextIO stream into a BytesIO stream for Pandas.
# A seek that pyarrow engine will try to do with Parquet for sys.stdin input
# will get forbidden otherwise.
df = (
parse_parquet_input(input_path)
if input_path is not None
else parse_parquet_input(BytesIO(sys.stdin.buffer.read()))
)
params = None

Actually this API is used for mlflow.models.predict, and to support that we need more changes. I think it's fine to support scoring server for now, let's revert this part of changes :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense. 👍

Avoided applying the suggestion via Github webui as I see it makes a mess with the signoff for me at least. The additional commit reverts these changes.

Copy link
Collaborator

@serena-ruan serena-ruan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
@serena-ruan serena-ruan added this pull request to the merge queue Feb 11, 2026
Merged via the queue into mlflow:master with commit f19747c Feb 11, 2026
50 of 52 checks passed
daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Feb 20, 2026
…lflow#20630)

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
daniellok-db pushed a commit that referenced this pull request Feb 20, 2026
…20630)

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/feature Mention under Features in Changelogs. v3.10.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR] Enable support for the application/x-parquet MIME type in the pyfunc scoring server input

2 participants