Enable parquet content_type in the scoring server input for pyfunc by TFK1410 · Pull Request #20630 · mlflow/mlflow

TFK1410 · 2026-02-06T22:12:49Z

Related Issues/PRs

Resolve #20602

What changes are proposed in this pull request?

This modifies the invocations call of the scoring server to include the parquet content type (applications/x-parquet) alongside the preexisting CSV and JSON.

The parquet file is directly read into a Pandas dataframe and the parquet input should already contain the proper data schema as expected in the subsequent steps before the actual model prediction call.

The feature is also added to the _predict function in the same file even though I don't see a corresponding test for it or a use of this method in general in other places.

This commit adds parquet only as an input type. Output stays as JSON.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

No. You can skip the rest of this section.
Yes. Please link the corresponding PR or explain how you plan to update it.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

This modifies the invocations call of the scoring server to include the parquet content type (applications/x-parquet) alongside the preexisting CSV and JSON. The parquet file is directly read into a Pandas dataframe and the parquet input should already contain the proper data schema as expected in the subsequent steps before the actual model prediction call This commit adds parquet only as an input type. Output stays as JSON. Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

github-actions · 2026-02-06T22:13:01Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20630/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20630/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20630/merge

TFK1410 · 2026-02-09T11:38:19Z

In the most basic example the server works just as expected still for CSV and JSON test inputs:

$ curl -H "Content-Type:text/csv" --data-binary '@/opt/ml/model/test.csv' http://localhost:8000/invocations
{"predictions": [0]}
$ curl -H "Content-Type:application/json" --data '@/opt/ml/model/serving_input_example.json' http://localhost:8000/invocations
{"predictions": [0, 0, 0, 0, 0]}

I've transformed our example CSV into a Parquet format. This now also works just fine:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test.parquet' http://localhost:8000/invocations
{"predictions": [0]}

I get a similar output when there's an additional column in the input file and all of that.

If I intentionally malform the Parquet file the output traceback is also available:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test-malform.parquet' http://localhost:8000/invocations
{"error_code": "BAD_REQUEST", "message": "Failed to parse input as a Pandas DataFrame. Ensure that the input is a valid Parquet Pandas DataFrame produced using the `pandas.DataFrame.to_parquet()` method. Error: 'Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n'", "stack_trace": "Traceback (most recent call last):\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/mlflow/pyfunc/scoring_server/__init__.py\", line 275, in parse_parquet_input\n    return pd.read_parquet(parquet_input)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 667, in read_parquet\n    return impl.read(\n           ^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 274, in read\n    pa_table = self.api.parquet.read_table(\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1858, in read_table\n    dataset = ParquetDataset(\n              ^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1427, in __init__\n    [fragment], schema=schema or fragment.physical_schema,\n                                 ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"pyarrow/_dataset.pyx\", line 1477, in pyarrow._dataset.Fragment.physical_schema.__get__\n  File \"pyarrow/error.pxi\", line 155, in pyarrow.lib.pyarrow_internal_check_status\n  File \"pyarrow/error.pxi\", line 92, in pyarrow.lib.check_status\nOSError: Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n\n"}

We still also get a proper feedback if the Parquet file has a mismatch in the dtypes:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test-wrong-dtype.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '..., 'age': double (required)]'. Error: Incompatible input types for column age. Can not safely convert int64 to float64."}

and if a column is missing:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/part-00000-missing-age-column.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '...[1 rows x 39 columns]'. \nError: Failed to enforce schema of data '...[1 rows x 39 columns]' with schema '... 'age': double (required)]'. Error: Model is missing inputs ['age']. Note that there were extra inputs: ['mp_id']."}

For size an ingesting comparison I've created a Pandas Dataframe that's around 50MB in-memory size. This came out to be a DataFrame with a shape of (15000, 40). Saved to a Parquet and CSV side by side the filesize comparison is as follows:

# stat -c "%s %n" -- bigger.*
115776974 bigger.csv
60857034 bigger.parquet

That's already double the difference in file size. It is a significant difference when we talk about batch inference processing.

A simple curl with the bigger parquet file takes about 0.5 second (need to take into account that this is all localhost testing, so I'm skipping all the possible networking from taking place).

$ time curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/bigger.parquet' http://localhost:8000/invocations
...
real    0m0.527s
user    0m0.004s
sys     0m0.085s

The same run with the CSV file takes 2 full seconds:

$ time curl -H "Content-Type:text/csv" --data-binary '@/opt/ml/model/bigger.csv' http://localhost:8000/invocations
...
real    0m2.160s
user    0m0.014s
sys     0m0.123s

On this scale it might seem small but for batch transform cases this is a noticeable change.

Additionally there might be also general restrictions on maximum size of a single curl data input size for the inference endpoint. In for example SageMaker Batch Transform there's the MaxPayloadInMB. While Sagemaker does provide the ability of selecting the "Split Type" so that the Sagemaker takes care of fragmenting the bigger requests into smaller pieces when it is a CSV file one will have to do that by themselves when wanting to process bigger batches of data in Parquet form. So there's a requirement on the user side to partition the files during the data preprocessing step. Still at least for my usecase this is a worthy change for larger scale data inference runs.

github-actions · 2026-02-10T05:05:05Z

Documentation preview for e6d6f65 is available at:

https://pr-20630--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

mlflow/pyfunc/scoring_server/__init__.py

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

serena-ruan · 2026-02-10T11:17:06Z

The implementation looks good! Could you add an E2E test?

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

TFK1410 · 2026-02-10T13:25:09Z

Sure!

I've now added a test_parse_parquet_input() test scenario in a similar vein to the test_parse_json_input_including_path() case and it actually now covers the additional lines that I've added to the utils.py in the testing directory.

I hope the scope of this test test scenario covers what's needed for E2E. Let me know if there's anything more in terms if E2E scenarios that I could provide.

In the meantime I've also rebuilt my container environment.

All the previous calls work just fine as before. For the "old" content-type that I've set the request now throws the 415 error code with the following message:

$ curl -H "Content-Type:application/x-parquet" --data-binary '@/opt/ml/model/test.parquet' http://localhost:8000/invocations
This predictor only supports the following content types: Types: ['text/csv', 'application/json', 'application/vnd.apache.parquet'].

The list of supported types is correct. Let's adjust the called content-type:

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/part-00000.parquet' http://localhost:8000/invocations
{"predictions": [0]}

All other error case scenarios still work the same as last time:

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/test-malform.parquet' http://localhost:8000/invocations
{"error_code": "BAD_REQUEST", "message": "Failed to parse input as a Pandas DataFrame. Ensure that the input is a valid Parquet Pandas DataFrame produced using the `pandas.DataFrame.to_parquet()` method. Error: 'Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n'", "stack_trace": "Traceback (most recent call last):\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/mlflow/pyfunc/scoring_server/__init__.py\", line 275, in parse_parquet_input\n    return pd.read_parquet(parquet_input)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 667, in read_parquet\n    return impl.read(\n           ^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pandas/io/parquet.py\", line 274, in read\n    pa_table = self.api.parquet.read_table(\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1858, in read_table\n    dataset = ParquetDataset(\n              ^^^^^^^^^^^^^^^\n  File \"/opt/.pyenv/versions/3.12.9/lib/python3.12/site-packages/pyarrow/parquet/core.py\", line 1427, in __init__\n    [fragment], schema=schema or fragment.physical_schema,\n                                 ^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"pyarrow/_dataset.pyx\", line 1477, in pyarrow._dataset.Fragment.physical_schema.__get__\n  File \"pyarrow/error.pxi\", line 155, in pyarrow.lib.pyarrow_internal_check_status\n  File \"pyarrow/error.pxi\", line 92, in pyarrow.lib.check_status\nOSError: Could not open Parquet input source '<Buffer>': Couldn't deserialize thrift: don't know what type: \u000f\n\n"}

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/test-wrong-dtype.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '..., 'age': double (required)]'. Error: Incompatible input types for column age. Can not safely convert int64 to float64."}

$ curl -H "Content-Type:application/vnd.apache.parquet" --data-binary '@/opt/ml/model/part-00000-missing-age-column.parquet' http://localhost:8000/invocations
{"error_code": "INVALID_PARAMETER_VALUE", "message": "Failed to predict data '...[1 rows x 39 columns]'. \nError: Failed to enforce schema of data '...[1 rows x 39 columns]' with schema '... 'age': double (required)]'. Error: Model is missing inputs ['age']. Note that there were extra inputs: ['mp_id']."}

Additionally I can provide that Sagemaker Batch Transform confirmation.

The deployable model definition in Sagemaker AWS happily takes in my custom built image with this PR version of mlflow in. From that we create a batch transform run which completes successfully:

and it happily takes the Content type parameter to be set to application/vnd.apache.parquet

As also mentioned earlier in this case -> parquet being a binary file -> we can't utilize the Split type, hence why I've set it to None.

serena-ruan · 2026-02-11T03:29:09Z

mlflow/pyfunc/scoring_server/__init__.py

+        elif content_type == "parquet":
+            # For a direct stdin stream we need to read through the entire text buffer first
+            # before converting it from a TextIO stream into a BytesIO stream for Pandas.
+            # A seek that pyarrow engine will try to do with Parquet for sys.stdin input
+            # will get forbidden otherwise.
+            df = (
+                parse_parquet_input(input_path)
+                if input_path is not None
+                else parse_parquet_input(BytesIO(sys.stdin.buffer.read()))
+            )
+            params = None


Suggested change

elif content_type == "parquet":

# For a direct stdin stream we need to read through the entire text buffer first

# before converting it from a TextIO stream into a BytesIO stream for Pandas.

# A seek that pyarrow engine will try to do with Parquet for sys.stdin input

# will get forbidden otherwise.

df = (

parse_parquet_input(input_path)

if input_path is not None

else parse_parquet_input(BytesIO(sys.stdin.buffer.read()))

)

params = None

Actually this API is used for mlflow.models.predict, and to support that we need more changes. I think it's fine to support scoring server for now, let's revert this part of changes :)

Sure, makes sense. 👍

Avoided applying the suggestion via Github webui as I see it makes a mess with the signoff for me at least. The additional commit reverts these changes.

serena-ruan

LGTM!

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

…lflow#20630) Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

…20630) Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

github-actions bot added area/scoring MLflow Model server, model deployment tools, Spark UDFs rn/feature Mention under Features in Changelogs. labels Feb 6, 2026

serena-ruan self-requested a review February 10, 2026 04:53

serena-ruan reviewed Feb 10, 2026

View reviewed changes

mlflow/pyfunc/scoring_server/__init__.py Outdated Show resolved Hide resolved

github-actions bot assigned serena-ruan Feb 10, 2026

Update the MIME type for Parquet to the official name from IANA

98a41c1

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

serena-ruan added the v3.10.0 label Feb 10, 2026

Include an E2E scenario for testing the parse of a parquet input

08afcad

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

serena-ruan reviewed Feb 11, 2026

View reviewed changes

serena-ruan approved these changes Feb 11, 2026

View reviewed changes

Revert changes to the predict API

e6d6f65

Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

serena-ruan enabled auto-merge February 11, 2026 10:15

serena-ruan added this pull request to the merge queue Feb 11, 2026

Merged via the queue into mlflow:master with commit f19747c Feb 11, 2026
50 of 52 checks passed

daniellok-db pushed a commit to daniellok-db/mlflow that referenced this pull request Feb 20, 2026

Enable parquet content_type in the scoring server input for pyfunc (m…

3725d8a

…lflow#20630) Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

daniellok-db pushed a commit that referenced this pull request Feb 20, 2026

Enable parquet content_type in the scoring server input for pyfunc (#…

ccc7bf3

…20630) Signed-off-by: Paweł Rutkowski <prutkowski994@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable parquet content_type in the scoring server input for pyfunc#20630

Enable parquet content_type in the scoring server input for pyfunc#20630
serena-ruan merged 4 commits intomlflow:masterfrom
TFK1410:feature/enable-parquet-in-scoring-server-input

TFK1410 commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Install mlflow from this PR

Uh oh!

TFK1410 commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

serena-ruan commented Feb 10, 2026

Uh oh!

TFK1410 commented Feb 10, 2026

Uh oh!

serena-ruan Feb 11, 2026

Uh oh!

TFK1410 Feb 11, 2026

Uh oh!

serena-ruan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TFK1410 commented Feb 6, 2026

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Does this PR require updating the MLflow Skills repository?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Feb 6, 2026

Install mlflow from this PR

Uh oh!

TFK1410 commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

serena-ruan commented Feb 10, 2026

Uh oh!

TFK1410 commented Feb 10, 2026

Uh oh!

serena-ruan Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

TFK1410 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

serena-ruan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Feb 10, 2026 •

edited

Loading