Skip to content

Read in process Python objects like Dataframe, Numpy or dict#211

Merged
auxten merged 26 commits into
main-23.10-20240617from
readPyObj
Jun 17, 2024
Merged

Read in process Python objects like Dataframe, Numpy or dict#211
auxten merged 26 commits into
main-23.10-20240617from
readPyObj

Conversation

@auxten

@auxten auxten commented Apr 12, 2024

Copy link
Copy Markdown
Member

This PR is in very early stage. The implementation could change a lot for final patch.

Just hold this PR for other projects to tracking the progress of "chDB on Pandas/NumPy..."

Related issues:

@auxten auxten added the Arrow Apache Arrow support label Apr 12, 2024
@auxten auxten self-assigned this Apr 12, 2024
@auxten auxten marked this pull request as draft April 12, 2024 09:41
@auxten

auxten commented Apr 29, 2024

Copy link
Copy Markdown
Member Author

Still working on it. Good news is the prototype worked. Python API example could be like this below. Any suggestion?

#!python3

import chdb


class myReader(chdb.PyReader):
    def __init__(self, data):
        self.data = data
        self.cursor = 0
        super().__init__(data)

    def read(self, col_names, count):
        # count ignored for demo
        if self.cursor >= len(self.data["a"]):
            return []
        block = [self.data[col] for col in col_names]
        self.cursor += len(block[0])
        return block


reader = myReader(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python('reader') GROUP BY b", "debug").show()

Output:

"tom",5
"auxten",9
"jerry",7

@auxten auxten marked this pull request as ready for review June 17, 2024 05:39
@auxten auxten merged commit eeb6b68 into main-23.10-20240617 Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow Apache Arrow support

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet

1 participant