Skip to content

Improve parallel CSV scan #6922

@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

This issue is to address the remaining tasks from an initial parallel CSV scan PR #6801

The remaining tasks:

  1. Use get_opts() for range read on local FS
    get_opts() is an interface for range streaming read from ObjectStore (local FS/ cloud storage), currently it's not supported for range read on local FS https://github.com/apache/arrow-rs/blob/0d4e6a727f113f42d58650d2dbecab89b22d4e28/object_store/src/lib.rs#L355
    When it's implemented in arrow-rs, we can use it in parallel CSV scan implementation and possibly get some performance improvement (the current implementation will copy the whole CSV file range into memory at once instead of in a streaming fashion)
  2. Use only 1 get operation from ObjectStore for each partition instead of 3 (see original PR discussion)

It's easier to do task 2 after 1 is done (can do tests on the local filesystem)

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions