Polars Sparse Aggregation Implementation

Using a dense pivot operation in Polars for tabularizing large datasets consumes excessive memory, as highlighted in [this GitHub issue](https://github.com/pola-rs/polars/issues/8777). In response, I developed a sparse version of tabularization in Polars that bypasses traditional pivoting, showing promising results. This approach was tested on a dataset comprising 5,000 patients with 9,354,128 observations and 42,266 unique codes. The implementation can be viewed [here](https://github.com/mmcdermott/MEDS_Tabular_AutoML/blob/sparse_polars_approach/src/MEDS_tabular_automl/generate_ts_features.py#L25C5-L25C28).

The memory usage of this sparse Polars method compares favorably to that of a sparse matrix approach, yet it executes significantly faster—taking only 5 seconds as opposed to 250 seconds. Here are the memory profiles for the Polars experiment: ![](https://github.com/mmcdermott/MEDS_Tabular_AutoML/assets/29084395/e0c76f37-00e7-448f-84b0-82d86d20c054) and for the approach using scipy sparse matrices: ![](https://github.com/mmcdermott/MEDS_Tabular_AutoML/assets/29084395/2e3f42a8-a4e4-45be-9dca-29d8b4fd0f95)

One limitation of the Polars approach is that it only aggregates codes at times they are present, omitting computations at event times where the code did not occur but is within the rolling window for the event. This could increase memory usage and computational time, casting doubt on whether this method is definitively superior. Simple aggregations like code/count could potentially be implemented using [this Colab notebook method](https://colab.research.google.com/drive/1UevuZ6NFiCMx3ZqvMUB_1cQ3M_U8JuP0?usp=sharing#scrollTo=MTbTwH1MiMmB), although more complex operations such as value/sum, value/max, value/min, and other arbitrary aggregations may pose greater challenges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars Sparse Aggregation Implementation #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Polars Sparse Aggregation Implementation #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions