-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] Custom error handling for failed rows in dataset processing #52449
Copy link
Copy link
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityusability
Description
Description
Ideally, I should be able to control the behavior when just a single row fails - e.g. continue on error vs. interrupt, custom post-processing for row failures. Ideally the API looks something like:
build_llm_processor(
processor_config,
preprocess=lambda row: dict(
payload=dict(
messages=[
{
"role": "system",
"content": "foo"
},
{
"role": "user",
"content": fn(row)
},
],
model="bar"
temperature=0.1,
)
),
postprocess=lambda row: {
**{key: value for key, value in row.items() if key != "http_response"},
f"{name}_formatted_output": str(row["http_response"]['choices'][0]['message']['content'])
},
on_error: "continue",
error_handler: error_handling_fn
)
where I define custom error_handling_fn(row, e) to control handling behavior.
Use case
I was using Ray Data LLM APIs to analyze a decent sized corpus of text data and there were a few transient issues with specific rows of data - for instance, dynamically generated prompt too large for context window of the LLM. This caused the whole pipeline to fail, several hours into processing. I would have liked to have specified the failure handling to continue on error and flag this in my corpus with row['error'] = True
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weekscommunity-backlogdataRay Data-related issuesRay Data-related issuesenhancementRequest for new feature and/or capabilityRequest for new feature and/or capabilityusability