Skip to content

cdc: add new parquet library #99028

@jayshrivastava

Description

@jayshrivastava

This issue tracks progress for adding the apache arrow parquet library and integrating it into changefeeds.

https://github.com/apache/arrow/

This decision was based on the investigation done in this doc. In summary

  • The existing implementation of changefeed initial scans with parquet is 15x slower than JSON
  • CPU flamegraphs show that 53% of time is spent in the current library and an additional ~20% is doing GC
  • An initial investigation of the old and new libraries shows that the new one would be more efficient with handling memory and writing files. We plan on integrating the new library with the smallest amount of code changes to enable the cdc TPCC roachtest, which we can use to benchmark the new library and confirm our observations.

I'll be working on the following items roughly in order

Jira issue: CRDB-25669

Epic CRDB-27372

Metadata

Metadata

Labels

A-cdcChange Data CaptureC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-cdc

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions