Read in process Python objects like Dataframe, Numpy or dict by auxten · Pull Request #211 · chdb-io/chdb

auxten · 2024-04-12T09:41:44Z

This PR is in very early stage. The implementation could change a lot for final patch.

Just hold this PR for other projects to tracking the progress of "chDB on Pandas/NumPy..."

Related issues:

auxten · 2024-04-29T08:34:06Z

Still working on it. Good news is the prototype worked. Python API example could be like this below. Any suggestion?

#!python3

import chdb


class myReader(chdb.PyReader):
    def __init__(self, data):
        self.data = data
        self.cursor = 0
        super().__init__(data)

    def read(self, col_names, count):
        # count ignored for demo
        if self.cursor >= len(self.data["a"]):
            return []
        block = [self.data[col] for col in col_names]
        self.cursor += len(block[0])
        return block


reader = myReader(
    {
        "a": [1, 2, 3, 4, 5, 6],
        "b": ["tom", "jerry", "auxten", "tom", "jerry", "auxten"],
    }
)

chdb.query("SELECT b, sum(a) FROM Python('reader') GROUP BY b", "debug").show()

Output:

"tom",5
"auxten",9
"jerry",7

auxten added the Arrow Apache Arrow support label Apr 12, 2024

auxten self-assigned this Apr 12, 2024

auxten marked this pull request as draft April 12, 2024 09:41

auxten added 25 commits June 6, 2024 12:28

Fix Python ide index issue

b085bbb

Add simple StoragePython to fix compile flags

27d17ad

Add PyReader and PyWriter ABC

5e917f3

Prototype works

614d392

GetSchema in PyReader works on dict

1e789dc

Merge all convert_and_insert and getTableStructureFromData v1

0f58c45

Refactor convert_and_insert

209646a

Fix prototype of PyReader.read

b79246f

Remove trampoline class

bc6603b

Fix gc for read returned data

81da2aa

Fix reader type to py::object

da85c29

Fix gil cross threads between C++ and Python

7461e83

Use inspect.current_frame and f_back to find py obj

623b322

Add append_raw in PODArray

68b6729

Add appendRawData in ColumnVectorHelper

656bccf

Treat binary[pyarrow] as string

b4923dc

Fix pandas arrow dtype

f050319

Benchmark on clickbench data

451f8d0

Add pybind headers for ColumnPyObject

093d9dc

2x faster on Q23 with better getPyUtf8StrData

ce37839

Add PythonUtils

9db0a0a

Support SQL on data objects without PyReader

fef6cda

Do things with GIL in batch

0e53dcb

GIL less scanDataToChunk

187edc4

Move prepareColumnCache to StoragePython

22c0bb8

auxten force-pushed the readPyObj branch from 12d41da to 22c0bb8 Compare June 6, 2024 04:31

Convert UTF-16 and UTF-32 without copy in FillColumnString

4725983

auxten marked this pull request as ready for review June 17, 2024 05:39

auxten merged commit eeb6b68 into main-23.10-20240617 Jun 17, 2024

auxten linked an issue Aug 30, 2024 that may be closed by this pull request

CHDB is significantly slower on Arrow tables (in-memory) than with CSV / Parquet #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Read in process Python objects like Dataframe, Numpy or dict#211

Read in process Python objects like Dataframe, Numpy or dict#211
auxten merged 26 commits into
main-23.10-20240617from
readPyObj

auxten commented Apr 12, 2024 •

edited

Loading

Uh oh!

auxten commented Apr 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

auxten commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

auxten commented Apr 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

auxten commented Apr 12, 2024 •

edited

Loading