Skip to content

[One workflow] Workflow execution error trigger (workflows.failed)#257633

Merged
yngrdyn merged 29 commits intoelastic:mainfrom
yngrdyn:14421-feature-workflow-error-trigger-reference-implementation
Apr 6, 2026
Merged

[One workflow] Workflow execution error trigger (workflows.failed)#257633
yngrdyn merged 29 commits intoelastic:mainfrom
yngrdyn:14421-feature-workflow-error-trigger-reference-implementation

Conversation

@yngrdyn
Copy link
Copy Markdown
Contributor

@yngrdyn yngrdyn commented Mar 13, 2026

Closes https://github.com/elastic/security-team/issues/14421.

This PR implements the Workflow execution error Trigger: when a workflow run fails, the platform emits a workflows.failed event so that other workflows subscribed to it can run (notifications, cleanup, retries). It also serves as a reference implementation for solution teams adding their own event-driven triggers.

Summary

When a workflow execution reaches a failed terminal state, the execution engine builds an event payload (workflow id/name, execution id, error message, failed step id/name, optional stack trace) and calls workflowsExtensions.emitEvent() with trigger id workflows.executionFailed. The existing trigger event handler (from #254964) resolves workflows subscribed to that trigger in the same space, evaluates optional KQL on.condition against the event, and runs only matching workflows with the event as context.event. The payload includes workflow.isErrorHandler: true when the failed run was itself triggered by an error event, so subscribers can filter out error-handler failures and avoid infinite loops.

graph TB
    subgraph Engine["workflows_execution_engine"]
        Run["runWorkflow() / resumeWorkflow()"]
        Fail["Execution fails → failStep()"]
        Finally["finally: load execution, build payload"]
        Emit["emitEvent(workflows.executionFailed, payload, spaceId, request)"]
    end

    subgraph Subscriber["Error-handling workflow"]
        Steps["Steps use {{ event.workflow.id }}, {{ event.error.message }}, etc."]
    end

    Run --> Fail
    Fail --> Finally
    Finally --> Emit
    Emit --> Steps

    style Engine fill:#e1f5ff
    style Subscriber fill:#e8f5e9
Loading

Event payload (and thus context.event in subscriber workflows):

  • workflow: id, name, spaceId, isErrorHandler
  • execution: id, startedAt, failedAt
  • error: message, stepId, stepName, optional stepExecutionId, optional stackTrace

Conditions and steps can use e.g. event.workflow.name, event.error.stepName, event.execution.id, and not event.workflow.isErrorHandler:true to avoid handling failures from error-handler workflows.


What's in this PR

Trigger registration (workflows_extensions)

  • Common: WORKFLOW_EXECUTION_FAILED_TRIGGER_ID, Zod workflowExecutionFailedEventSchema (workflow, execution, error with optional stepExecutionId and stackTrace), i18n for schema descriptions.
  • Server: Trigger definition registered in workflows_extensions plugin setup; used for validation when emitting and for internal trigger-definitions API.
  • Public: PublicTriggerDefinition with i18n title, description, documentation, and examples so the workflow authoring UI shows the trigger and its event shape.

Emit on failure (workflows_execution_engine)

  • Payload builder: buildWorkflowExecutionFailedPayload(execution, failedStepContext?) in server/lib/build_workflow_execution_failed_payload.ts. Step context (stepId, stepName, stepExecutionId, stackTrace) comes from in-memory FailedStepContext set in failStep(); not from execution.error.details or ES step executions (avoids refresh delays).
  • Failure context: In step_execution_runtime.ts, failStep() calls workflowExecutionState.setLastFailedStepContext({ stepId, stepName, stepExecutionId, stack }) so the payload builder can read it in the same run.
  • Emission: In run_workflow.ts and resume_workflow.ts, in a finally block: if execution?.status === FAILED and not a test run, build payload (with workflowExecutionState.getLastFailedStepContext()) and call workflowsExtensions.emitEvent({ triggerId: WORKFLOW_EXECUTION_FAILED_TRIGGER_ID, spaceId, payload, request }). Ensures one emit per failed run and consistent metering.

How to verify

  1. Start Kibana with the workflows extensions example:
    yarn start
  2. Create a workflow that always fails.
name: Always fails
enabled: true
triggers:
  - type: manual
steps:
  - name: log_start
    type: console
    with:
      message: "Workflow started; next step will fail."
  - name: http_always_500
    type: http
    with:
      url: "https://httpstat.us/500"
      method: GET

  1. Create a second workflow with trigger workflows.executionFailed and a step that logs or notifies
name: Workflow failure monitor
description: Sends a Slack notification with full details when any workflow in the space fails.
enabled: true
triggers:
  - type: workflows.executionFailed
    on:
      condition: not event.workflow.isErrorHandler:true
steps:
  - name: slack_alert
    type: slack
    connector-id: c57c5a7b-dc2b-4d64-b9bd-a02c92696e03
    with:
      message: |
        :alert: *Workflow execution failed*

        *Workflow:* {{ event.workflow.name }}
        *Workflow ID:* `{{ event.workflow.id }}`
        *Space:* {{ event.workflow.spaceId }}

        *Failed step:* {{ event.error.stepName }}
        *Error:* {{ event.error.message }}

        *Execution ID:* {{ event.execution.id }}
        *Started:* {{ event.execution.startedAt }}
        *Failed at:* {{ event.execution.failedAt }}
        {% if event.error.stackTrace %}
        *Stack trace:*
        ```
        {{ event.error.stackTrace }}
        ```
        {% endif %}

        *View execution in Kibana:*
        {{kibanaUrl}}{% if event.workflow.spaceId != 'default' %}/s/{{ event.workflow.spaceId }}{% endif %}/app/workflows/{{ event.workflow.id }}?executionId={{ event.execution.id }}&tab=executions&stepExecutionId={{ event.error.stepExecutionId }}
  1. Run the first workflow; wait for it to fail.
  2. Confirm the second workflow runs and receives the event (e.g. triggeredBy: 'workflows.executionFailed', step sees event.workflow.name, event.error.message).

Release note

Added workflows.executionFailed trigger so you can run workflows when another workflow fails. Use it to send notifications (e.g. Slack), run cleanup, or trigger retries. Subscriber workflows receive an event with workflow and execution details, the error message, and the failed step. #257633

@yngrdyn yngrdyn self-assigned this Mar 13, 2026
@yngrdyn yngrdyn requested a review from a team as a code owner March 13, 2026 11:37
@yngrdyn yngrdyn added backport:skip This PR does not require backporting release_note:feature Makes this part of the condensed release notes labels Mar 13, 2026
@botelastic botelastic bot added the Team:One Workflow Team label for One Workflow (Workflow automation) label Mar 13, 2026
@yngrdyn yngrdyn changed the title [One workflow] Workflow execution error Trigger (workflows.executionFailed) [One workflow] Workflow execution error trigger (workflows.executionFailed) Mar 13, 2026
@yngrdyn yngrdyn requested a review from Copilot March 13, 2026 11:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a new event-driven trigger (workflows.executionFailed) emitted when a workflow run reaches a failed terminal state, enabling subscriber workflows (notifications/cleanup/retries) to react with the failure event context.

Changes:

  • Adds common/server/public trigger definitions + schema for workflows.executionFailed and registers them in workflows_extensions.
  • Emits workflows.executionFailed from the execution engine on failed runs (run + resume paths) with step failure context captured at failure time.
  • Adds unit + Scout API tests and updates trigger-definition approval fixtures.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/platform/plugins/shared/workflows_management/test/scout_workflows_ui/api/tests/workflow_execution/workflow_error_trigger.spec.ts Adds Scout test validating a subscriber workflow runs on execution failure.
src/platform/plugins/shared/workflows_extensions/test/scout/api/tests/trigger_definitions_approval.spec.ts Updates tags configuration for trigger definitions approval test coverage.
src/platform/plugins/shared/workflows_extensions/test/scout/api/fixtures/approved_trigger_definitions.ts Approves the new trigger id + schema hash.
src/platform/plugins/shared/workflows_extensions/server/triggers/workflow_execution_failed.ts Introduces server trigger definition wrapper/export for the new trigger.
src/platform/plugins/shared/workflows_extensions/server/triggers/index.ts Registers internal trigger definitions in the server plugin.
src/platform/plugins/shared/workflows_extensions/server/plugin.ts Calls internal trigger registration during setup.
src/platform/plugins/shared/workflows_extensions/server/index.ts Exposes trigger id/type from the server package.
src/platform/plugins/shared/workflows_extensions/public/triggers/workflow_execution_failed.ts Adds UI-facing trigger metadata, docs, examples, and snippet.
src/platform/plugins/shared/workflows_extensions/public/triggers/index.ts Registers public trigger definition.
src/platform/plugins/shared/workflows_extensions/public/plugin.ts Calls public trigger registration during setup.
src/platform/plugins/shared/workflows_extensions/common/triggers/workflow_execution_failed.ts Defines trigger id + Zod schema + common trigger definition.
src/platform/plugins/shared/workflows_extensions/common/triggers/index.ts Re-exports common trigger artifacts.
src/platform/plugins/shared/workflows_extensions/common/index.ts Exposes trigger artifacts from common entrypoint.
src/platform/plugins/shared/workflows_execution_engine/server/workflow_context_manager/workflow_execution_state.ts Adds in-memory failed-step context storage for event payload building.
src/platform/plugins/shared/workflows_execution_engine/server/workflow_context_manager/step_execution_runtime.ts Captures failed step context in failStep() for later emission.
src/platform/plugins/shared/workflows_execution_engine/server/lib/build_workflow_execution_failed_payload.ts Adds payload builder for the emitted event.
src/platform/plugins/shared/workflows_execution_engine/server/lib/build_workflow_execution_failed_payload.test.ts Adds unit tests for the payload builder.
src/platform/plugins/shared/workflows_execution_engine/server/execution_functions/run_workflow.ts Emits the event in a finally block when status is FAILED.
src/platform/plugins/shared/workflows_execution_engine/server/execution_functions/run_workflow.test.ts Adds unit tests asserting emission behavior in run path.
src/platform/plugins/shared/workflows_execution_engine/server/execution_functions/resume_workflow.ts Emits the event in a finally block when a resumed execution fails.
src/platform/plugins/shared/workflows_execution_engine/server/execution_functions/resume_workflow.test.ts Adds unit tests asserting emission behavior in resume path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yngrdyn added 2 commits March 13, 2026 14:33
…tation' of github.com:yngrdyn/kibana into 14421-feature-workflow-error-trigger-reference-implementation
@yngrdyn yngrdyn requested a review from a team as a code owner March 13, 2026 15:33
Copy link
Copy Markdown
Contributor

@skynetigor skynetigor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@jbudz jbudz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

packages/kbn-optimizer/limits.yml LGTM

@yngrdyn yngrdyn changed the title [One workflow] Workflow execution error trigger (workflows.executionFailed) [One workflow] Workflow execution error trigger (workflows.failed) Apr 6, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
workflowsExtensions 239 244 +5

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
workflowsExtensions 25 33 +8

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
workflowsExtensions 52.1KB 55.9KB +3.9KB
workflowsManagement 2.3MB 2.3MB +186.0B
total +4.0KB

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
workflowsExtensions 34.5KB 37.6KB +3.0KB
Unknown metric groups

API count

id before after diff
workflowsExtensions 109 117 +8

async chunk count

id before after diff
workflowsExtensions 21 23 +2

History

cc @yngrdyn

@yngrdyn yngrdyn merged commit 1740ee6 into elastic:main Apr 6, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting ci:build-cloud-image release_note:feature Makes this part of the condensed release notes Team:One Workflow Team label for One Workflow (Workflow automation) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants