Skip to content

[python] to_dataframe does not produce sparse data frames #808

@cdiener

Description

@cdiener

Hi,

I noticed that the pandas.SparseDataFrame returned by Table.to_dataframe is not really sparse. For instance for the American Gut data:

In [15]: bm = load_table("deblur_125nt_no_blooms.biom")

In [16]: bm
Out[16]: 32954 x 9511 <class 'biom.table.Table'> with 1829490 nonzero entries (0% dense)

In [17]: tab = bm.to_dataframe()

In [19]: type(tab)
Out[19]: pandas.core.sparse.frame.SparseDataFrame

In [20]: tab.density
Out[20]: 1.0

In [21]: tab.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
Index: 32954 entries, AACGTAGGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGCAGGCGGAAGGCTAAGTCTGATGTGAAAGCCCGGGGCTCAACCCCGGTACTGCATTGGAAACTGGTCATCTAGAGTG to TACGGGGGATGCGAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGCAGGCCGAGGTTCAAGTCAGCGGTGAAACCCCCGCGCTCAACGCGGGGCATGCCGTTGATACTGTATCTCTGGAGTA
Columns: 9511 entries, 10317.000012326 to 10317.000038478
dtypes: Sparse[float64, nan](9511)
memory usage: 2.3+ GB

This is basically the memory use of the full table including zeros. Also the densities of the original table and the SparseDataTable are pretty different (~0% vs 100%).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions