Skip to content

Handle projection pushdown in the metadata cache #25584

@hiltontj

Description

@hiltontj

Problem

The TableProvider implementation for the MetaCacheFunctionProvider is not currently handling projection pushdown:

impl TableProvider for MetaCacheFunctionProvider {

This means that the cache will be getting a full scan (within the bounds of provided predicates) regardless of the provided projection. For a cache that has multiple levels, if the user is only interested in the top level of the cache, this could lead to unnecessary cycles spent scanning lower levels of the cache; if the user is interested in lower levels of the cache, then we still need to scan through the higher levels, but at the least, we could avoid building the arrow buffers for those columns.

In addition, projection to lower levels of the cache is not ordered, however, that may need a separate issue.

Proposed solution

The projection provided to the TableProvider::scan could be passed down to the MetaCache::to_record_batch to more optimally scan the cache:

  • do not build arrow buffers for un-needed columns
  • only scan down to the lowest needed level in the cache
  • update the MetaCacheExec to include details about projected columns

Alternatives

N/A

Additional context

Currently, DataFusion handles projection at a higher level, so this isn't a show-stopper, the cache will still work as it is intended when projections are provided in the query.

The method that walks the cache hierarchy to do predicate evaluation and build the arrow buffers is here.

An example showing that the output when projecting a lower column is not ordered is here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions