set memory_usage to True if verbose is True in info#8222
Conversation
|
Can one of the admins verify this patch? |
|
ok to test |
|
Thanks for opening this @kinshukdua! It looks good to me. |
|
Hi folks, checking in here. It looks like we are having legit failures on the CI. The problem is two different tests
where we are setting dask/dask/dataframe/tests/test_dataframe.py Line 3435 in dd636df
Where now the expected output is different since @kinshukdua could you adapt the tests to match the new behavior? Let us know if you need some help or have any questions |
|
@ncclementi Hi, sorry for the problems with the tests, I've fixed both the tests and they pass on my computer. Kindly let me know if something else is required. |
|
@kinshukdua no need to apologize, these things are normal : ) thanks for taking care of that. We just need one of the maintainers to approve the CI workflow. cc: @jsignell |
|
Thanks for the ping @ncclementi - tests are running. I'll merge this if tests pass. |
|
Huh. It looks like in old versions of pandas, dask and pandas don't agree on the memory_usage. I think we can just skip those tests for pandas < 1.2. Let me know if you need help with the skip @kinshukdua Failure: _____________________________ test_categorize_info _____________________________
[gw2] linux -- Python 3.7.10 /usr/share/miniconda3/envs/test-environment/bin/python
def test_categorize_info():
# assert that we can call info after categorize
# workaround for: https://github.com/pydata/pandas/issues/14368
from io import StringIO
pandas_format._put_lines = put_lines
df = pd.DataFrame(
{"x": [1, 2, 3, 4], "y": pd.Series(list("aabc")), "z": pd.Series(list("aabc"))},
index=[0, 1, 2, 3],
)
ddf = dd.from_pandas(df, npartitions=4).categorize(["y"])
# Verbose=False
buf = StringIO()
ddf.info(buf=buf, verbose=True)
expected = (
"<class 'dask.dataframe.core.DataFrame'>\n"
"Int64Index: 4 entries, 0 to 3\n"
"Data columns (total 3 columns):\n"
" # Column Non-Null Count Dtype\n"
"--- ------ -------------- -----\n"
" 0 x 4 non-null int64\n"
" 1 y 4 non-null category\n"
" 2 z 4 non-null object\n"
"dtypes: category(1), object(1), int64(1)\n"
"memory usage: 496.0 bytes\n"
)
> assert buf.getvalue() == expected
E assert "<class 'dask...412.0 bytes\n" == "<class 'dask...496.0 bytes\n"
E <class 'dask.dataframe.core.DataFrame'>
E Int64Index: 4 entries, 0 to 3
E Data columns (total 3 columns):
E # Column Non-Null Count Dtype
E --- ------ -------------- -----
E 0 x 4 non-null int64
E 1 y 4 non-null category
E 2 z 4 non-null object
E dtypes: category(1), object(1), int64(1)
E - memory usage: 496.0 bytes
E ? ^^
E + memory usage: 412.0 bytes
E ? ^^
dask/dataframe/tests/test_dataframe.py:3539: AssertionError |
|
@jsignell Done, also tested on multiple version of Pandas to confirm. |
|
@kinshukdua this is looking good, we are almost there. But it looks like the tests are not passing because the skip you added is on a different test than the one with the problem. The test having the memory discrepancy due to pandas version is Also, it seems there is a failure on the precommits, it seems there is one extra space. You can run the precommits on your end if you wish, here are the instructions to set them up. @@ -3480,6 +3480,7 @@ def test_info():
# buf=None
assert ddf.info(buf=None) is None
+
@pytest.mark.skipif(pd.__version__ < "1.2.0", reason="Need newer version of Pandas")
def test_groupby_multilevel_info():
# GH 1844
Error: The process '/opt/hostedtoolcache/Python/3.9.7/x64/bin/pre-commit' failed with exit code 1 |
|
@ncclementi Fixed the function and ran precommits, everything should be fine now |
|
@kinshukdua Thank you very much for working on this. We just need approval from one of the maintainers to run the full CI, but overall this looks great and I think it's ready. I believe the docs failing is an unrelated issue, will let a maintainer comment on that. ping @jsignell |
|
@ncclementi The test that is failing ( |
|
@kinshukdua yes, that seems unrelated. But I was referring to the fail on the docs build @jrbourbeau can you confirm these fails are unrelated? I think this PR is ready to merge. |
|
The docs issue is fixed on main. Thanks for this work @kinshukdua! |
|
Hi, the test also fails with an unexpected memory usage on 32-bit systems (#8169 (comment)) is applied: |
black dask/flake8 dask/isort daskI think, the expected behavior for
dd.info(verbose=True)would be to also return the total memory being used, it would also be in line with pandas and will prevent confusions like issue #8115.This was discussed in the comments of the issue a while back but a PR was never made.