Skip to content

set memory_usage to True if verbose is True in info#8222

Merged
jsignell merged 5 commits intodask:mainfrom
kinshukdua:memory
Oct 18, 2021
Merged

set memory_usage to True if verbose is True in info#8222
jsignell merged 5 commits intodask:mainfrom
kinshukdua:memory

Conversation

@kinshukdua
Copy link
Copy Markdown
Contributor

I think, the expected behavior for dd.info(verbose=True) would be to also return the total memory being used, it would also be in line with pandas and will prevent confusions like issue #8115.
This was discussed in the comments of the issue a while back but a PR was never made.

@GPUtester
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

@jsignell
Copy link
Copy Markdown
Member

jsignell commented Oct 5, 2021

ok to test

@jsignell
Copy link
Copy Markdown
Member

jsignell commented Oct 5, 2021

Thanks for opening this @kinshukdua! It looks good to me.

@ncclementi
Copy link
Copy Markdown
Member

Hi folks, checking in here. It looks like we are having legit failures on the CI.

The problem is two different tests

  1. dask/dataframe/tests/test_dataframe and the specific test is test_groupby_multilevel_info() more specific in
    _assert_info(g.compute(), g, memory_usage=False)

where we are setting memory_usage=Flase what is causing trouble in this line

ddf.info(buf=buf_da, verbose=True, memory_usage=memory_usage)

  1. dask/dataframe/tests/test_dataframe and the specific test is test_categorize_info()`, more specific in
    ddf.info(buf=buf, verbose=True)
    expected = (
    "<class 'dask.dataframe.core.DataFrame'>\n"
    "Int64Index: 4 entries, 0 to 3\n"
    "Data columns (total 3 columns):\n"
    " # Column Non-Null Count Dtype\n"
    "--- ------ -------------- -----\n"
    " 0 x 4 non-null int64\n"
    " 1 y 4 non-null category\n"
    " 2 z 4 non-null object\n"
    "dtypes: category(1), object(1), int64(1)"
    )
    assert buf.getvalue() == expected

Where now the expected output is different since verbose=True would include the info of the memory.

@kinshukdua could you adapt the tests to match the new behavior? Let us know if you need some help or have any questions

@kinshukdua
Copy link
Copy Markdown
Contributor Author

@ncclementi Hi, sorry for the problems with the tests, I've fixed both the tests and they pass on my computer. Kindly let me know if something else is required.

@ncclementi
Copy link
Copy Markdown
Member

@kinshukdua no need to apologize, these things are normal : ) thanks for taking care of that. We just need one of the maintainers to approve the CI workflow.

cc: @jsignell

@jsignell
Copy link
Copy Markdown
Member

Thanks for the ping @ncclementi - tests are running. I'll merge this if tests pass.

@jsignell
Copy link
Copy Markdown
Member

Huh. It looks like in old versions of pandas, dask and pandas don't agree on the memory_usage. I think we can just skip those tests for pandas < 1.2. Let me know if you need help with the skip @kinshukdua

Failure:

_____________________________ test_categorize_info _____________________________
[gw2] linux -- Python 3.7.10 /usr/share/miniconda3/envs/test-environment/bin/python

    def test_categorize_info():
        # assert that we can call info after categorize
        # workaround for: https://github.com/pydata/pandas/issues/14368
        from io import StringIO
    
        pandas_format._put_lines = put_lines
    
        df = pd.DataFrame(
            {"x": [1, 2, 3, 4], "y": pd.Series(list("aabc")), "z": pd.Series(list("aabc"))},
            index=[0, 1, 2, 3],
        )
        ddf = dd.from_pandas(df, npartitions=4).categorize(["y"])
    
        # Verbose=False
        buf = StringIO()
        ddf.info(buf=buf, verbose=True)
        expected = (
            "<class 'dask.dataframe.core.DataFrame'>\n"
            "Int64Index: 4 entries, 0 to 3\n"
            "Data columns (total 3 columns):\n"
            " #   Column  Non-Null Count  Dtype\n"
            "---  ------  --------------  -----\n"
            " 0   x       4 non-null      int64\n"
            " 1   y       4 non-null      category\n"
            " 2   z       4 non-null      object\n"
            "dtypes: category(1), object(1), int64(1)\n"
            "memory usage: 496.0 bytes\n"
        )
>       assert buf.getvalue() == expected
E       assert "<class 'dask...412.0 bytes\n" == "<class 'dask...496.0 bytes\n"
E           <class 'dask.dataframe.core.DataFrame'>
E           Int64Index: 4 entries, 0 to 3
E           Data columns (total 3 columns):
E            #   Column  Non-Null Count  Dtype
E           ---  ------  --------------  -----
E            0   x       4 non-null      int64
E            1   y       4 non-null      category
E            2   z       4 non-null      object
E           dtypes: category(1), object(1), int64(1)
E         - memory usage: 496.0 bytes
E         ?                ^^
E         + memory usage: 412.0 bytes
E         ?                ^^

dask/dataframe/tests/test_dataframe.py:3539: AssertionError

@kinshukdua
Copy link
Copy Markdown
Contributor Author

@jsignell Done, also tested on multiple version of Pandas to confirm.

@ncclementi
Copy link
Copy Markdown
Member

@kinshukdua this is looking good, we are almost there. But it looks like the tests are not passing because the skip you added is on a different test than the one with the problem. The test having the memory discrepancy due to pandas version is test_categorize_info and you put the skip on test_groupby_multilevel_info would you be able to change it, please?

Also, it seems there is a failure on the precommits, it seems there is one extra space. You can run the precommits on your end if you wish, here are the instructions to set them up.

@@ -3480,6 +3480,7 @@ def test_info():
     # buf=None
     assert ddf.info(buf=None) is None
 
+
 @pytest.mark.skipif(pd.__version__ < "1.2.0", reason="Need newer version of Pandas")
 def test_groupby_multilevel_info():
     # GH 1844
Error: The process '/opt/hostedtoolcache/Python/3.9.7/x64/bin/pre-commit' failed with exit code 1

@kinshukdua
Copy link
Copy Markdown
Contributor Author

@ncclementi Fixed the function and ran precommits, everything should be fine now

@ncclementi
Copy link
Copy Markdown
Member

@kinshukdua Thank you very much for working on this. We just need approval from one of the maintainers to run the full CI, but overall this looks great and I think it's ready. I believe the docs failing is an unrelated issue, will let a maintainer comment on that.

ping @jsignell

@kinshukdua
Copy link
Copy Markdown
Contributor Author

@ncclementi The test that is failing (test_scheduler_highlevel_graph_unpack_import in test_layers.py) seems to be unrelated to this PR.

@ncclementi
Copy link
Copy Markdown
Member

@kinshukdua yes, that seems unrelated. But I was referring to the fail on the docs build docs/readthedocs.org:dask, which is throwing a "Command killed due to excessive memory consumption".

@jrbourbeau can you confirm these fails are unrelated? I think this PR is ready to merge.

@jsignell
Copy link
Copy Markdown
Member

The docs issue is fixed on main. Thanks for this work @kinshukdua!

@jsignell jsignell merged commit a06ee6a into dask:main Oct 18, 2021
@bnavigator
Copy link
Copy Markdown
Contributor

Hi, the test also fails with an unexpected memory usage on 32-bit systems (#8169 (comment)) is applied:

[  287s] python39-pandas-1.3.5-67.6            ########################################
...
[  361s] ============================= test session starts ==============================
[  361s] platform linux -- Python 3.9.9, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /usr/bin/python3.9
[  361s] cachedir: .pytest_cache
[  361s] rootdir: /home/abuild/rpmbuild/BUILD/dask-2022.1.0, configfile: setup.cfg
[  361s] plugins: forked-1.3.0, xdist-2.2.0, rerunfailures-9.1.1
...
[ 1356s] _____________________________ test_categorize_info _____________________________
[ 1356s] [gw3] linux -- Python 3.9.9 /usr/bin/python3.9
[ 1356s] 
[ 1356s]     @pytest.mark.skipif(not PANDAS_GT_120, reason="need newer version of Pandas")
[ 1356s]     def test_categorize_info():
[ 1356s]         # assert that we can call info after categorize
[ 1356s]         # workaround for: https://github.com/pydata/pandas/issues/14368
[ 1356s]         from io import StringIO
[ 1356s]     
[ 1356s]         pandas_format._put_lines = put_lines
[ 1356s]     
[ 1356s]         df = pd.DataFrame(
[ 1356s]             {"x": [1, 2, 3, 4], "y": pd.Series(list("aabc")), "z": pd.Series(list("aabc"))},
[ 1356s]             index=[0, 1, 2, 3],
[ 1356s]         )
[ 1356s]         ddf = dd.from_pandas(df, npartitions=4).categorize(["y"])
[ 1356s]     
[ 1356s]         # Verbose=False
[ 1356s]         buf = StringIO()
[ 1356s]         ddf.info(buf=buf, verbose=True)
[ 1356s]         expected = (
[ 1356s]             "<class 'dask.dataframe.core.DataFrame'>\n"
[ 1356s]             "Int64Index: 4 entries, 0 to 3\n"
[ 1356s]             "Data columns (total 3 columns):\n"
[ 1356s]             " #   Column  Non-Null Count  Dtype\n"
[ 1356s]             "---  ------  --------------  -----\n"
[ 1356s]             " 0   x       4 non-null      int64\n"
[ 1356s]             " 1   y       4 non-null      category\n"
[ 1356s]             " 2   z       4 non-null      object\n"
[ 1356s]             "dtypes: category(1), object(1), int64(1)\n"
[ 1356s]             "memory usage: 496.0 bytes\n"
[ 1356s]         )
[ 1356s] >       assert buf.getvalue() == expected
[ 1356s] E       assert ("<class 'dask.dataframe.core.DataFrame'>\n"\n 'Int64Index: 4 entries, 0 to 3\n'\n 'Data columns (total 3 columns):\n'\n ' #   Column  Non-Null Count  Dtype\n'\n '---  ------  --------------  -----\n'\n ' 0   x       4 non-null      int64\n'\n ' 1   y       4 non-null      category\n'\n ' 2   z       4 non-null      object\n'\n 'dtypes: category(1), object(1), int64(1)\n'\n 'memory usage: 312.0 bytes\n') == ("<class 'dask.dataframe.core.DataFrame'>\n"\n 'Int64Index: 4 entries, 0 to 3\n'\n 'Data columns (total 3 columns):\n'\n ' #   Column  Non-Null Count  Dtype\n'\n '---  ------  --------------  -----\n'\n ' 0   x       4 non-null      int64\n'\n ' 1   y       4 non-null      category\n'\n ' 2   z       4 non-null      object\n'\n 'dtypes: category(1), object(1), int64(1)\n'\n 'memory usage: 496.0 bytes\n')
[ 1356s] E           <class 'dask.dataframe.core.DataFrame'>
[ 1356s] E           Int64Index: 4 entries, 0 to 3
[ 1356s] E           Data columns (total 3 columns):
[ 1356s] E            #   Column  Non-Null Count  Dtype
[ 1356s] E           ---  ------  --------------  -----
[ 1356s] E            0   x       4 non-null      int64
[ 1356s] E            1   y       4 non-null      category
[ 1356s] E            2   z       4 non-null      object
[ 1356s] E           dtypes: category(1), object(1), int64(1)
[ 1356s] E         - memory usage: 496.0 bytes
[ 1356s] E         ?               ^^^
[ 1356s] E         + memory usage: 312.0 bytes
[ 1356s] E         ?               ^^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

how to get info (total memory usage) of a dask.dataframe?

5 participants