ESQL: Add error policy and configurable options for CSV format reader#143779
Merged
costin merged 3 commits intoelastic:mainfrom Mar 9, 2026
Merged
ESQL: Add error policy and configurable options for CSV format reader#143779costin merged 3 commits intoelastic:mainfrom
costin merged 3 commits intoelastic:mainfrom
Conversation
Introduce ErrorPolicy with three modes (FAIL_FAST, SKIP_ROW, NULL_FIELD) and an error budget (maxErrors, maxErrorRatio) for resilient CSV parsing. Add CsvFormatOptions for configurable delimiter, quote/escape characters, comment prefix, null representation, encoding, datetime format, and max field size. Extend FormatReader SPI with withConfig() for per-query configuration.
Collaborator
|
Hi @costin, I've created a changelog YAML for you. |
Collaborator
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Replace exception-based error flow with return-code approach using a lastFieldError field, eliminating exception allocation and stack trace filling on parse failures. Pre-compute projected column arrays (int[], DataType[], Attribute[]) at schema init to avoid autoboxing and list lookups per field. Hoist invariant checks (comment filter, null value, log flag) into constructor-time booleans. Reuse Object[] row buffer across rows. Replace division with multiplication in error budget ratio check.
bpintea
approved these changes
Mar 9, 2026
| private void onFieldError(String message, String value, Attribute attr) { | ||
| errorCount++; | ||
| if (logErrors) { | ||
| logger.warn( |
Contributor
There was a problem hiding this comment.
This is good, but I guess we might want to evolve how we inform the user about the error. I believe that unless the policy is FAIL_FAST, a user would need to check the logs to see what happened.
We might want to add warnings?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds resilient error handling and configurable parsing options to the ESQL CSV datasource format reader.
Error policy — three modes control how malformed rows are handled during CSV ingestion:
FAIL_FAST— abort on the first error (default, equivalent to SparkFAILFAST)SKIP_ROW— drop the malformed row and continue (equivalent to SparkDROPMALFORMED, DuckDBignore_errors)NULL_FIELD— null-fill unparseable fields while keeping the row (equivalent to SparkPERMISSIVE)An error budget (
max_errors,max_error_ratio) caps how many errors are tolerated before aborting, giving operators fine-grained control over data quality vs. throughput.Configurable format options: delimiter, quote/escape characters, comment prefix, null representation, encoding, datetime format, and max field size can all be set per-query via
WITHparametersBoth features are wired through the existing
FormatReaderSPI via a newwithConfig(Map)method, keeping the interface backward-compatible.Developed using AI-assisted tooling