Skip to content

[Data] Custom error handling for failed rows in dataset processing #52449

@hainesmichaelc

Description

@hainesmichaelc

Description

Ideally, I should be able to control the behavior when just a single row fails - e.g. continue on error vs. interrupt, custom post-processing for row failures. Ideally the API looks something like:

build_llm_processor(
        processor_config,
        preprocess=lambda row: dict(
            payload=dict(
                messages=[
                    {
                        "role": "system",
                        "content": "foo"
                    },
                    {
                        "role": "user",
                        "content": fn(row)
                    },
                ],
                model="bar"
                temperature=0.1,
            )
        ),
        postprocess=lambda row: {
            **{key: value for key, value in row.items() if key != "http_response"},
            f"{name}_formatted_output": str(row["http_response"]['choices'][0]['message']['content'])
        },
       on_error: "continue",
       error_handler: error_handling_fn
)

where I define custom error_handling_fn(row, e) to control handling behavior.

Use case

I was using Ray Data LLM APIs to analyze a decent sized corpus of text data and there were a few transient issues with specific rows of data - for instance, dynamically generated prompt too large for context window of the LLM. This caused the whole pipeline to fail, several hours into processing. I would have liked to have specified the failure handling to continue on error and flag this in my corpus with row['error'] = True

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilityusability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions