set `memory_usage` to `True` if `verbose` is `True` in info by kinshukdua · Pull Request #8222 · dask/dask

kinshukdua · 2021-10-05T18:16:46Z

Closes how to get info (total memory usage) of a dask.dataframe? #8115
Tests added / passed
Passes black dask / flake8 dask / isort dask

I think, the expected behavior for dd.info(verbose=True) would be to also return the total memory being used, it would also be in line with pandas and will prevent confusions like issue #8115.
This was discussed in the comments of the issue a while back but a PR was never made.

GPUtester · 2021-10-05T18:16:48Z

Can one of the admins verify this patch?

jsignell · 2021-10-05T18:19:50Z

ok to test

jsignell · 2021-10-05T18:20:22Z

Thanks for opening this @kinshukdua! It looks good to me.

ncclementi · 2021-10-07T22:31:46Z

Hi folks, checking in here. It looks like we are having legit failures on the CI.

The problem is two different tests

dask/dataframe/tests/test_dataframe and the specific test is test_groupby_multilevel_info() more specific in

dask/dask/dataframe/tests/test_dataframe.py

Line 3499 in dd636df

_assert_info(g.compute(), g, memory_usage=False)

where we are setting memory_usage=Flase what is causing trouble in this line

dask/dask/dataframe/tests/test_dataframe.py

Line 3435 in dd636df

ddf.info(buf=buf_da, verbose=True, memory_usage=memory_usage)

dask/dataframe/tests/test_dataframe and the specific test is test_categorize_info()`, more specific in

dask/dask/dataframe/tests/test_dataframe.py

Lines 3526 to 3538 in dd636df

    
           ddf.info(buf=buf, verbose=True) 
        
           expected = ( 
        
               "<class 'dask.dataframe.core.DataFrame'>\n" 
        
               "Int64Index: 4 entries, 0 to 3\n" 
        
               "Data columns (total 3 columns):\n" 
        
               " #   Column  Non-Null Count  Dtype\n" 
        
               "---  ------  --------------  -----\n" 
        
               " 0   x       4 non-null      int64\n" 
        
               " 1   y       4 non-null      category\n" 
        
               " 2   z       4 non-null      object\n" 
        
               "dtypes: category(1), object(1), int64(1)" 
        
           ) 
        
           assert buf.getvalue() == expected

Where now the expected output is different since verbose=True would include the info of the memory.

@kinshukdua could you adapt the tests to match the new behavior? Let us know if you need some help or have any questions

kinshukdua · 2021-10-08T05:37:38Z

@ncclementi Hi, sorry for the problems with the tests, I've fixed both the tests and they pass on my computer. Kindly let me know if something else is required.

ncclementi · 2021-10-08T14:08:33Z

@kinshukdua no need to apologize, these things are normal : ) thanks for taking care of that. We just need one of the maintainers to approve the CI workflow.

cc: @jsignell

jsignell · 2021-10-11T14:17:53Z

Thanks for the ping @ncclementi - tests are running. I'll merge this if tests pass.

jsignell · 2021-10-11T14:51:56Z

Huh. It looks like in old versions of pandas, dask and pandas don't agree on the memory_usage. I think we can just skip those tests for pandas < 1.2. Let me know if you need help with the skip @kinshukdua

Failure:

_____________________________ test_categorize_info _____________________________
[gw2] linux -- Python 3.7.10 /usr/share/miniconda3/envs/test-environment/bin/python

    def test_categorize_info():
        # assert that we can call info after categorize
        # workaround for: https://github.com/pydata/pandas/issues/14368
        from io import StringIO
    
        pandas_format._put_lines = put_lines
    
        df = pd.DataFrame(
            {"x": [1, 2, 3, 4], "y": pd.Series(list("aabc")), "z": pd.Series(list("aabc"))},
            index=[0, 1, 2, 3],
        )
        ddf = dd.from_pandas(df, npartitions=4).categorize(["y"])
    
        # Verbose=False
        buf = StringIO()
        ddf.info(buf=buf, verbose=True)
        expected = (
            "<class 'dask.dataframe.core.DataFrame'>\n"
            "Int64Index: 4 entries, 0 to 3\n"
            "Data columns (total 3 columns):\n"
            " #   Column  Non-Null Count  Dtype\n"
            "---  ------  --------------  -----\n"
            " 0   x       4 non-null      int64\n"
            " 1   y       4 non-null      category\n"
            " 2   z       4 non-null      object\n"
            "dtypes: category(1), object(1), int64(1)\n"
            "memory usage: 496.0 bytes\n"
        )
>       assert buf.getvalue() == expected
E       assert "<class 'dask...412.0 bytes\n" == "<class 'dask...496.0 bytes\n"
E           <class 'dask.dataframe.core.DataFrame'>
E           Int64Index: 4 entries, 0 to 3
E           Data columns (total 3 columns):
E            #   Column  Non-Null Count  Dtype
E           ---  ------  --------------  -----
E            0   x       4 non-null      int64
E            1   y       4 non-null      category
E            2   z       4 non-null      object
E           dtypes: category(1), object(1), int64(1)
E         - memory usage: 496.0 bytes
E         ?                ^^
E         + memory usage: 412.0 bytes
E         ?                ^^

dask/dataframe/tests/test_dataframe.py:3539: AssertionError

kinshukdua · 2021-10-11T15:19:11Z

@jsignell Done, also tested on multiple version of Pandas to confirm.

ncclementi · 2021-10-11T23:45:43Z

@kinshukdua this is looking good, we are almost there. But it looks like the tests are not passing because the skip you added is on a different test than the one with the problem. The test having the memory discrepancy due to pandas version is test_categorize_info and you put the skip on test_groupby_multilevel_info would you be able to change it, please?

Also, it seems there is a failure on the precommits, it seems there is one extra space. You can run the precommits on your end if you wish, here are the instructions to set them up.

@@ -3480,6 +3480,7 @@ def test_info():
     # buf=None
     assert ddf.info(buf=None) is None
 
+
 @pytest.mark.skipif(pd.__version__ < "1.2.0", reason="Need newer version of Pandas")
 def test_groupby_multilevel_info():
     # GH 1844
Error: The process '/opt/hostedtoolcache/Python/3.9.7/x64/bin/pre-commit' failed with exit code 1

kinshukdua · 2021-10-12T07:24:50Z

@ncclementi Fixed the function and ran precommits, everything should be fine now

dask/dataframe/tests/test_dataframe.py

ncclementi · 2021-10-13T21:21:37Z

@kinshukdua Thank you very much for working on this. We just need approval from one of the maintainers to run the full CI, but overall this looks great and I think it's ready. I believe the docs failing is an unrelated issue, will let a maintainer comment on that.

ping @jsignell

kinshukdua · 2021-10-14T16:40:02Z

@ncclementi The test that is failing (test_scheduler_highlevel_graph_unpack_import in test_layers.py) seems to be unrelated to this PR.

ncclementi · 2021-10-14T17:03:05Z

@kinshukdua yes, that seems unrelated. But I was referring to the fail on the docs build docs/readthedocs.org:dask, which is throwing a "Command killed due to excessive memory consumption".

@jrbourbeau can you confirm these fails are unrelated? I think this PR is ready to merge.

jsignell · 2021-10-18T19:32:22Z

The docs issue is fixed on main. Thanks for this work @kinshukdua!

bnavigator · 2022-01-20T20:11:31Z

Hi, the test also fails with an unexpected memory usage on 32-bit systems (#8169 (comment)) is applied:

[  287s] python39-pandas-1.3.5-67.6            ########################################
...
[  361s] ============================= test session starts ==============================
[  361s] platform linux -- Python 3.9.9, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /usr/bin/python3.9
[  361s] cachedir: .pytest_cache
[  361s] rootdir: /home/abuild/rpmbuild/BUILD/dask-2022.1.0, configfile: setup.cfg
[  361s] plugins: forked-1.3.0, xdist-2.2.0, rerunfailures-9.1.1
...
[ 1356s] _____________________________ test_categorize_info _____________________________
[ 1356s] [gw3] linux -- Python 3.9.9 /usr/bin/python3.9
[ 1356s] 
[ 1356s]     @pytest.mark.skipif(not PANDAS_GT_120, reason="need newer version of Pandas")
[ 1356s]     def test_categorize_info():
[ 1356s]         # assert that we can call info after categorize
[ 1356s]         # workaround for: https://github.com/pydata/pandas/issues/14368
[ 1356s]         from io import StringIO
[ 1356s]     
[ 1356s]         pandas_format._put_lines = put_lines
[ 1356s]     
[ 1356s]         df = pd.DataFrame(
[ 1356s]             {"x": [1, 2, 3, 4], "y": pd.Series(list("aabc")), "z": pd.Series(list("aabc"))},
[ 1356s]             index=[0, 1, 2, 3],
[ 1356s]         )
[ 1356s]         ddf = dd.from_pandas(df, npartitions=4).categorize(["y"])
[ 1356s]     
[ 1356s]         # Verbose=False
[ 1356s]         buf = StringIO()
[ 1356s]         ddf.info(buf=buf, verbose=True)
[ 1356s]         expected = (
[ 1356s]             "<class 'dask.dataframe.core.DataFrame'>\n"
[ 1356s]             "Int64Index: 4 entries, 0 to 3\n"
[ 1356s]             "Data columns (total 3 columns):\n"
[ 1356s]             " #   Column  Non-Null Count  Dtype\n"
[ 1356s]             "---  ------  --------------  -----\n"
[ 1356s]             " 0   x       4 non-null      int64\n"
[ 1356s]             " 1   y       4 non-null      category\n"
[ 1356s]             " 2   z       4 non-null      object\n"
[ 1356s]             "dtypes: category(1), object(1), int64(1)\n"
[ 1356s]             "memory usage: 496.0 bytes\n"
[ 1356s]         )
[ 1356s] >       assert buf.getvalue() == expected
[ 1356s] E       assert ("<class 'dask.dataframe.core.DataFrame'>\n"\n 'Int64Index: 4 entries, 0 to 3\n'\n 'Data columns (total 3 columns):\n'\n ' #   Column  Non-Null Count  Dtype\n'\n '---  ------  --------------  -----\n'\n ' 0   x       4 non-null      int64\n'\n ' 1   y       4 non-null      category\n'\n ' 2   z       4 non-null      object\n'\n 'dtypes: category(1), object(1), int64(1)\n'\n 'memory usage: 312.0 bytes\n') == ("<class 'dask.dataframe.core.DataFrame'>\n"\n 'Int64Index: 4 entries, 0 to 3\n'\n 'Data columns (total 3 columns):\n'\n ' #   Column  Non-Null Count  Dtype\n'\n '---  ------  --------------  -----\n'\n ' 0   x       4 non-null      int64\n'\n ' 1   y       4 non-null      category\n'\n ' 2   z       4 non-null      object\n'\n 'dtypes: category(1), object(1), int64(1)\n'\n 'memory usage: 496.0 bytes\n')
[ 1356s] E           <class 'dask.dataframe.core.DataFrame'>
[ 1356s] E           Int64Index: 4 entries, 0 to 3
[ 1356s] E           Data columns (total 3 columns):
[ 1356s] E            #   Column  Non-Null Count  Dtype
[ 1356s] E           ---  ------  --------------  -----
[ 1356s] E            0   x       4 non-null      int64
[ 1356s] E            1   y       4 non-null      category
[ 1356s] E            2   z       4 non-null      object
[ 1356s] E           dtypes: category(1), object(1), int64(1)
[ 1356s] E         - memory usage: 496.0 bytes
[ 1356s] E         ?               ^^^
[ 1356s] E         + memory usage: 312.0 bytes
[ 1356s] E         ?               ^^^

add memory_usage in info(verbose=True)

d80cf9c

github-actions bot added the dataframe label Oct 5, 2021

jsignell approved these changes Oct 5, 2021

View reviewed changes

charlesbluca mentioned this pull request Oct 6, 2021

ReadTheDocs PR builds are failing #8227

Closed

Reflect memory usage in info(verbose=True) tests

6b232ef

skip groupby_multilevel_info test for old pandas version

b09c61b

skip test_categorize_info instead of test_groupby_multilevel_info

a011a99

jsignell reviewed Oct 13, 2021

View reviewed changes

dask/dataframe/tests/test_dataframe.py Outdated Show resolved Hide resolved

use PANDAS_GT_120 for skipping test

d621878

jsignell merged commit a06ee6a into dask:main Oct 18, 2021

detrout mentioned this pull request Mar 5, 2022

32bit test suite: pandas 1.3.3 "Buffer dtype mismatch" #8169

Closed

Uh oh!

Conversation

kinshukdua commented Oct 5, 2021

Uh oh!

GPUtester commented Oct 5, 2021

Uh oh!

jsignell commented Oct 5, 2021

Uh oh!

jsignell commented Oct 5, 2021

Uh oh!

ncclementi commented Oct 7, 2021

Uh oh!

kinshukdua commented Oct 8, 2021

Uh oh!

ncclementi commented Oct 8, 2021

Uh oh!

jsignell commented Oct 11, 2021

Uh oh!

jsignell commented Oct 11, 2021

Uh oh!

kinshukdua commented Oct 11, 2021

Uh oh!

ncclementi commented Oct 11, 2021

Uh oh!

kinshukdua commented Oct 12, 2021

Uh oh!

Uh oh!

ncclementi commented Oct 13, 2021

Uh oh!

kinshukdua commented Oct 14, 2021

Uh oh!

ncclementi commented Oct 14, 2021

Uh oh!

jsignell commented Oct 18, 2021

Uh oh!

bnavigator commented Jan 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants