Skip to content

[kbn/evals] Phase 3: Red-teaming and security testing CLI #257824

@patrykkopycinski

Description

@patrykkopycinski

Parent Epic

Part of #257821 — Extend @kbn/evals with advanced evaluation capabilities

Summary

Add a red-team CLI command and supporting modules for adversarial testing of Agent Builder agents. This enables systematic security validation before shipping agent capabilities.

Attack Modules

Each module generates adversarial prompts targeting a specific vulnerability:

  1. Prompt Injection — Direct, indirect, and multi-turn injection attempts
  2. Privilege Escalation — Attempts to access tools/data outside authorized scope
  3. Information Extraction — Tries to get the model to leak system prompts or internal data
  4. Jailbreaking — Attempts to bypass safety guidelines

Each module:

  1. Generates N adversarial prompts (optionally via LLM)
  2. Runs them through the ExperimentTask
  3. Evaluates with security evaluators from Phase 1
  4. Reports findings with severity classification

Guardrail Rules Engine

Configurable pattern-based rules that scan every response:

interface GuardrailRule {
  name: string;
  pattern: RegExp;
  action: 'block' | 'warn' | 'log';
}

CLI Interface

node scripts/evals red-team --suite agent-builder --module prompt-injection
node scripts/evals red-team --suite agent-builder --all
node scripts/evals red-team --suite agent-builder --guardrails-only

Files to Create

  • kbn-evals/src/red_team/index.ts
  • kbn-evals/src/red_team/modules/prompt_injection.ts
  • kbn-evals/src/red_team/modules/privilege_escalation.ts
  • kbn-evals/src/red_team/modules/info_extraction.ts
  • kbn-evals/src/red_team/modules/jailbreaking.ts
  • kbn-evals/src/red_team/guardrails.ts
  • kbn-evals/src/cli/commands/red_team.ts

Dependencies

  • Depends on Phase 1 (#TBD) — security evaluators are used to score attack results

Acceptance Criteria

  • node scripts/evals red-team runs all attack modules against a suite
  • Each attack module generates at least 10 adversarial prompts
  • Results include severity classification (critical/high/medium/low)
  • Guardrail rules can be configured per-suite
  • Unit tests for each attack module and guardrail evaluation
  • Report format includes pass/fail summary and detailed findings

Metadata

Metadata

Assignees

No one assigned

    Labels

    Team:agent-builderenhancementNew value added to drive a business resultkbn-evalsIssue related to the work on Kibana's LLM evaluation framework.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions