docs: document approximate algorithm and Dask-specific params in describe()#12300
Conversation
…ribe() The describe() method uses an approximate algorithm (by default) for computing percentiles, which can produce results that differ slightly from pandas. This was undocumented, confusing users who compare results. Add an explicit docstring to both DataFrame.describe() and Series.describe() that: - Notes the approximate nature of percentile computation - Documents the Dask-specific parameter - Documents the Dask-specific parameter Resolves: dask#10416
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files ± 0 21 suites ±0 5h 31m 5s ⏱️ + 1m 41s Results for commit 143e153. ± Comparison against base commit 45610ac. ♻️ This comment has been updated with latest results. |
| .. note:: | ||
|
|
||
| Dask computes percentiles (used for the ``25%``, ``50%``, and | ||
| ``75%`` statistics) using an **approximate algorithm** by default. | ||
| Results may therefore differ slightly from pandas. Use | ||
| ``percentiles_method='dask'`` for the built-in Dask algorithm or | ||
| ``percentiles_method='tdigest'`` for the t-digest algorithm. | ||
| See :meth:`dask.dataframe.DataFrame.quantile` for details. |
There was a problem hiding this comment.
Why is this in a note block instead of just in the description?
There was a problem hiding this comment.
good pont, will move it into the main description - was overthinking the formating there
There was a problem hiding this comment.
Good point — moved the text into the main description body and dropped the note block.
| percentiles_method : {'default', 'tdigest', 'dask'}, optional | ||
| Method for computing percentiles. ``'default'`` uses the internal | ||
| Dask algorithm. ``'tdigest'`` uses the t-digest algorithm for | ||
| floats and ints and falls back to ``'dask'`` otherwise. |
There was a problem hiding this comment.
Can we use double quotes for string?
There was a problem hiding this comment.
yep, sorry about that - will updaet to double quotes
There was a problem hiding this comment.
Done, switched to double quotes throughout.
Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- Remove note block, move content into main description - Use double quotes consistently for string values Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Summary
Closes #10416
The
describe()method silently uses an approximate algorithm for percentile computation (used for the25%,50%,75%statistics). This can produce results that differ from pandas, which confuses users comparing outputs side-by-side.Previous PRs (#11973, #12288, #12289) attempted to address this in docs/source files. A subsequent review on #12113 requested the fix be placed in the
describe()docstring directly. This PR does exactly that.Changes
Added an explicit docstring to both
DataFrame.describe()andSeries.describe()that:.. note::block explaining that percentiles are computed using an approximate algorithm by default, and that results may differ slightly from pandas.split_every– a Dask-specific parameter not present in pandas.percentiles_method– a Dask-specific parameter not present in pandas, offering'dask'and'tdigest'options.Since both methods use
@derived_from(pd.DataFrame/pd.Series), the docstring is prepended to the inherited pandas docstring (following the existing pattern used by e.g.quantile()).Testing
No new tests needed – this is a documentation-only change. Existing tests continue to pass.