feat: Implement an AsyncReader for avro using ObjectStore #8930

EmilyMatt · 2025-11-26T10:35:05Z

Which issue does this PR close?

Closes Implement an async AvroReader #8929 .

Rationale for this change

Allows for proper file splitting within an asynchronous context.

What changes are included in this PR?

The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema

Are these changes tested?

Yes

Are there any user-facing changes?

Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)

jecsand838

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

arrow-avro/Cargo.toml

arrow-avro/src/reader/async_reader.rs

EmilyMatt · 2025-11-26T21:26:03Z

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

jecsand838 · 2025-11-26T21:33:18Z

Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

100% I'm working on that right now and won't stop until I have a PR. That was a solid catch.

The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back.

arrow-avro/src/reader/async_reader.rs

EmilyMatt · 2025-12-01T21:46:18Z

Sorry, I haven't dropped it, just found myself in a really busy week! The generic reader support does not seem to hard to implement from the dabbling I made, and I still need to get to the builder pattern change

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

EmilyMatt · 2025-12-07T15:10:23Z

@jecsand838 I believe this is now ready for a proper review^

jecsand838

@EmilyMatt Thank you so much for getting these changes up!

I left a few comments. Let me know what you think.

EDIT: Should have mentioned that this is looking really good overall and I'm very excited for the AsyncReader!

arrow-avro/Cargo.toml

arrow-avro/src/reader/async_reader/async_file_reader.rs

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/async_file_reader.rs

arrow-avro/src/reader/async_reader/builder.rs

arrow-avro/src/reader/async_reader/mod.rs

arrow-avro/src/reader/async_reader/object_store_reader.rs

alamb · 2026-01-10T13:12:10Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

EmilyMatt · 2026-01-11T12:24:35Z

@jecsand838 and @EmilyMatt -- how is this PR looking?

I had actually just returned to work on it 2 days ago, still having some issues with the schema now being provided, due to the problems I've described, @jecsand838 suggested removing the arrow schema and I'm starting to think that is the only viable way for now.
Making the fetch API a bit closer to the one parquet uses is the smaller issue, I do wish to keep the seperate semantics for the original fetch and extra fetch(for parquet for example, that will be the row groups ranges, and the footer range), will try a couple ways to do this

EmilyMatt · 2026-01-11T12:24:52Z

Hope to push another version today and address some of the things above

…sync-reader

EmilyMatt · 2026-01-12T21:44:20Z

@jecsand838 I've shamelessly plagiarized the API for the object reader from the parquet crate, but that's ok IMO, it lays the foundations for a common API in a few versions.
I believe I've addressed everything, let me know if anything pops to mind

jecsand838 · 2026-01-15T23:41:15Z

@jecsand838 I've shamelessly plagiarized the API for the object reader from the parquet crate, but that's ok IMO, it lays the foundations for a common API in a few versions. I believe I've addressed everything, let me know if anything pops to mind

@EmilyMatt Absolutely, I'll have time to give this a solid review tonight. Ty for getting these changes in!

jecsand838

@EmilyMatt Flushing a partial review here. Looking really good overall. I know this is a huge PR and your high-level changes look great.

I left more code-level comments with one architectural call-out you may want to consider. Overall this is looking solid and I'm super stoked about this async reader for arrow-avro.

arrow-avro/Cargo.toml

arrow-avro/README.md

arrow-avro/src/lib.rs

jecsand838 · 2026-01-16T11:05:40Z

arrow-avro/src/reader/async_reader/mod.rs

+                    let consumed = self.block_decoder.decode(&data)?;
+                    if consumed == 0 {


I think there maybe an issue with using consumed == 0 as the signal for detecting incomplete blocks here.

Looking at BlockDecoder::decode in block.rs lines 78-129, it returns 0 only when:

The input buffer is empty at the start, OR

The decoder is already in Finished state
For a truly incomplete block, decode() consumes all available bytes (returns data.len()) and flush() returns None. The current logic likely never triggers when it should.

You may want to consider changing the detection logic to check flush() first:

ReaderState::DecodingBlock { mut reader, mut data } => { let consumed = self.block_decoder.decode(&data)?; data = data.slice(consumed..); // Check for complete block FIRST if let Some(block) = self.block_decoder.flush() { let block_data = Bytes::from_owner(if let Some(ref codec) = self.codec { codec.decompress(&block.data)? } else { block.data }); self.reader_state = ReaderState::ReadingBatches { reader, data, block_data, remaining_in_block: block.count, }; continue; } // No complete block if data.is_empty() && consumed == 0 { // No progress on empty buffer = EOF let final_batch = self.decoder.flush(); self.reader_state = ReaderState::Finished; return Poll::Ready(final_batch.transpose()); } if data.is_empty() { // All data consumed but block incomplete - need more bytes // (incomplete block handling logic here) } else { // Still have data to process self.reader_state = ReaderState::DecodingBlock { reader, data }; } }

jecsand838 · 2026-01-16T11:08:39Z

arrow-avro/src/reader/async_reader/mod.rs

+
+                        // Two longs: count and size have already been read, but using our vlq,
+                        // meaning they were not consumed.
+                        let total_block_size = size + vlq_header_len;


Is there any risks from the calculation omitting the 16-byte sync marker here?

jecsand838 · 2026-01-16T11:43:55Z

arrow-avro/src/reader/async_reader/mod.rs

+                ReaderState::Limbo => {
+                    unreachable!("ReaderState::Limbo should never be observed");
+                }


Is the ReaderState::Limbo variant really necessary? Could we use Finished and if a bug causes an early return without setting state, the stream just ends (which is safer than panicking)?

A user will never know they did not actually finish writing the file, they will think they've just reached the end, this is in my opinion times of magnitude more severe than crashing.

I mean in the event of that occurring an ArrowError should be passed back which would alert the user.

arrow-avro/src/reader/async_reader/store.rs

parquet/src/arrow/array_reader/mod.rs

parquet/src/arrow/schema/extension.rs

jecsand838 · 2026-01-16T11:57:19Z

arrow-avro/src/reader/async_reader/mod.rs

+                ReaderState::DecodingBlock {
+                    mut reader,
+                    mut data,
+                } => {


I was thinking about it and you may want to consider a decode loop similar to the sync Reader::read method's, specifically this logic:

let consumed = self.block_decoder.decode(buf)?; self.reader.consume(consumed); if let Some(block) = self.block_decoder.flush() { // Block complete - use it } else if consumed == 0 { // Stuck on non-empty buffer - error return Err(ArrowError::ParseError(...)); } // Otherwise: made progress, loop for more data

From an architectural perspective the advantages would be:

Always calls flush() after decode() to check for complete blocks

Only errors when stuck (consumed == 0 on non-empty buffer AND flush() == None)

Trusts BlockDecoder to handle partial data incrementally

Maybe it could resemble something like this pseduo-code?

ReaderState::DecodingBlock { mut reader, mut data } = { let consumed = self.block_decoder.decode(&data)?; data = data.slice(consumed..); // Equivalent to reader.consume() if let Some(block) = self.block_decoder.flush() { // Block complete - proceed to ReadingBatches let block_data = Bytes::from_owner(if let Some(ref codec) = self.codec { codec.decompress(&block.data)? } else { block.data }); self.reader_state = ReaderState::ReadingBatches { reader, data, block_data, remaining_in_block: block.count, }; continue; } // No complete block yet if consumed == 0 && !data.is_empty() { // Stuck - no progress on non-empty buffer = corrupted data return Poll::Ready(Some(Err(ArrowError::ParseError( "Could not decode next Avro block from partial data".into() )))); } if data.is_empty() { // Buffer exhausted, block incomplete if self.finishing_partial_block { return Poll::Ready(Some(Err(ArrowError::AvroError( "Unexpected EOF while reading last Avro block".into() )))); } // Fetch more data (range end case) or finish // ... simplified fetch logic here ... } else { // Made progress but not complete - continue decoding self.reader_state = ReaderState::DecodingBlock { reader, data }; } }

# Conflicts: # arrow-avro/src/reader/mod.rs

EmilyMatt added 2 commits November 26, 2025 12:28

feat: Implement an AsyncReader for avro using ObjectStore

4ed172b

Merge branch 'main' into avro-async-reader

e5c7f57

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Nov 26, 2025

This was referenced Nov 26, 2025

Support Ref types in Avro Reader. apache/datafusion#18811

Open

Improve Avro Reader Types Support apache/datafusion#18810

Open

Use arrow-avro for performance and improved type support apache/datafusion#14097

Open

EmilyMatt added 4 commits November 26, 2025 14:24

feature gate use

f5bfd35

update comments

32b0760

file size is mandatory

2251be5

finish immediately

79af114

jecsand838 reviewed Nov 26, 2025

View reviewed changes

arrow-avro/Cargo.toml Outdated Show resolved Hide resolved

arrow-avro/src/reader/async_reader.rs Outdated Show resolved Hide resolved

jecsand838 reviewed Nov 26, 2025

View reviewed changes

arrow-avro/src/reader/async_reader.rs Outdated Show resolved Hide resolved

EmilyMatt added 3 commits November 27, 2025 12:05

remove object store form default

4e207ea

remove object store form default

e04337c

Merge branch 'main' into avro-async-reader

854cd95

EmilyMatt added 2 commits December 7, 2025 17:06

Use builder pattern, fallback to get the schema from the arrow schema…

8e0e46e

…, separate object store file reader into a featuregated struct and use a generic async file reader trait

Merge branch 'main' into avro-async-reader

4f88571

remove accidental changes

cb19fad

jecsand838 reviewed Dec 9, 2025

View reviewed changes

arrow-avro/src/reader/async_reader/object_store_reader.rs Outdated Show resolved Hide resolved

Merge branch 'apache:main' into avro-async-reader

d59a2f5

EmilyMatt added 6 commits January 12, 2026 23:28

Merge branch 'refs/heads/main' into avro-async-reader

e57801d

rebase, address CR

939cf8e

Merge remote-tracking branch 'upstream/avro-async-reader' into avro-a…

03de289

…sync-reader

fix some docs

a071ab8

update cfg

1cc8d8a

Add some docs

1a2169b

Add a basic roundtrip test

c6634ad

jecsand838 reviewed Jan 16, 2026

View reviewed changes

jecsand838 mentioned this pull request Jan 18, 2026

[arrow-avro] Add AsyncWriter #9212

Open

Merge branch 'main' into avro-async-reader

f7cffc9

# Conflicts: # arrow-avro/src/reader/mod.rs

		let consumed = self.block_decoder.decode(&data)?;
		if consumed == 0 {

feat: Implement an AsyncReader for avro using ObjectStore #8930

Are you sure you want to change the base?

feat: Implement an AsyncReader for avro using ObjectStore #8930

Uh oh!

Conversation

EmilyMatt commented Nov 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Nov 26, 2025

Uh oh!

jecsand838 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EmilyMatt commented Dec 1, 2025

Uh oh!

EmilyMatt commented Dec 7, 2025

Uh oh!

jecsand838 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

EmilyMatt commented Jan 11, 2026

Uh oh!

EmilyMatt commented Jan 11, 2026

Uh oh!

EmilyMatt commented Jan 12, 2026

Uh oh!

jecsand838 commented Jan 15, 2026

Uh oh!

jecsand838 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jecsand838 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jecsand838 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jecsand838 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

EmilyMatt Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

jecsand838 Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jecsand838 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

jecsand838 commented Nov 26, 2025 •

edited

Loading

jecsand838 left a comment •

edited

Loading