-
Notifications
You must be signed in to change notification settings - Fork 4.1k
cdc: add new parquet library #99028
Copy link
Copy link
Closed
Labels
A-cdcChange Data CaptureChange Data CaptureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-cdc
Description
This issue tracks progress for adding the apache arrow parquet library and integrating it into changefeeds.
https://github.com/apache/arrow/
This decision was based on the investigation done in this doc. In summary
- The existing implementation of changefeed initial scans with parquet is 15x slower than JSON
- CPU flamegraphs show that 53% of time is spent in the current library and an additional ~20% is doing GC
- An initial investigation of the old and new libraries shows that the new one would be more efficient with handling memory and writing files. We plan on integrating the new library with the smallest amount of code changes to enable the cdc TPCC roachtest, which we can use to benchmark the new library and confirm our observations.
I'll be working on the following items roughly in order
- Install dependency (cdc: add apache arrow parquet library and writer #99288)
- Create writer library ( cdc: add apache arrow parquet library and writer #99288)
- Add support for tpcc types ( cdc: add apache arrow parquet library and writer #99288)
- Benchmark new parquet library with tpcc roachtest (results )
- Implement reader library and tests for tpcc types (cdc: add apache arrow parquet library and writer #99288)
- Add support for all types
- Add compression options to writer util/parquet: add compression options #102978
-
Add memory accountingThis is now tracked by export: parquet causes node to OOM #103317 -
Using library fortracked by export: replace parquet library for EXPORT #103318EXPORT INTO - Integrate writer with cdc
- Add support for tuples, which is needed for
diffandcdc_prev. - Add previously unsupported options to parquet changefeedccl: remove limitations for parquet format #103129
- Ensure flushing is efficient and add metrics
Jira issue: CRDB-25669
Epic CRDB-27372
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-cdcChange Data CaptureChange Data CaptureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-cdc