[SigEvents][Evals] Tighten existing evaluators and rename KI/rule terminology by viduni94 · Pull Request #258209 · elastic/kibana

viduni94 · 2026-03-17T18:04:52Z

Closes https://github.com/elastic/streams-program/issues/950

Why

The significant-events flow now uses Knowledge Indicators (KIs) and rules as its core concepts. This update aligns the evaluation suite with that terminology, reduces ambiguity around “features” and “queries”, and makes the deterministic checks stricter and more informative.

Summary

Rename significant-events evaluation surfaces from feature/query terminology to KI/rule terminology, including renaming the query_generation suite to rule_generation and feature_extraction to KI_extraction.
Preserve compatibility where needed by keeping legacy storage/index naming, while updating eval code, dataset types, docs, and helper names to use KI-oriented terminology.
Improve ~~feature extraction~~ KI extraction evaluation by tightening evidence grounding so short evidence snippets do not pass via overly permissive substring matches, adding support for array-index field paths in field.path=value evidence, and changing KI count scoring from binary pass/fail to proportional scoring so near misses are penalized more gradually.
Improve ~~query generation~~ rule generation evaluation by incorporating category and severity compliance into the CODE evaluator score, validating scenario-defined expected_categories and esql_substrings, and making evidence checks less permissive so the evaluator better reflects whether generated rules are grounded in the provided log context.
Add focused unit tests for the KI extraction and rule generation evaluator changes, and add debug logging in rule generation to help diagnose cases where some models complete without ever calling the query-generation tools.

Checklist

Unit or functional tests were updated or added to match the most common scenarios
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
Review the backport guidelines and apply applicable backport:* labels.

Summary by CodeRabbit

Documentation
- Updated evaluation docs and run examples to use "KI" (Knowledge Indicator) terminology and revised configuration names.
New Features
- Added KI extraction and rule-generation evaluation flows and a utility for evidence-text matching.
Tests
- Introduced KI-focused unit/integration tests and replaced several legacy feature/query-generation test suites.
Chores
- Renamed evaluators, datasets, and data-generator outputs to KI-centric naming; updated environment/config identifiers.

coderabbitai · 2026-03-17T18:57:15Z

📝 Walkthrough

Walkthrough

This PR renames and rebrands the Significant Events evals from "feature extraction"/"query generation" to "KI (Knowledge Indicator) extraction"/"rule generation", adding KI-centric evaluators, tests, and data-generator renames while removing legacy feature/query modules. Public/type exports and many file paths were renamed accordingly.

Changes

Cohort / File(s)	Summary
Top-level docs `x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md`	Terminology and examples updated from feature/query generation → KI/rule generation; CLI env var names and run examples adjusted.
Dataset schemas & index `.../datasets/types.ts`, `.../datasets/index.ts`, `.../datasets/otel_demo.ts`	Renamed types/fields: QueryGenerationScenario→RuleGenerationScenario, FeatureExtractionScenario→KIExtractionScenario; dataset keys renamed to `ruleGeneration`/`kiExtraction`.
KI extraction tests & helpers `.../ki_extraction/ki_extraction.spec.ts`, `.../ki_extraction/collect_sample_documents.ts`	Added integration spec for KI extraction; updated helper signatures to use KIExtractionScenario; orchestration for snapshot resolve, replay, sample collection, and evaluator wiring.
Rule generation tests & helpers `.../rule_generation/rule_generation.spec.ts`, `.../rule_generation/resolve_ki_sources.ts`, `.../rule_generation/get_computed_kis_from_docs.ts`	Added rule-generation spec and helpers; renamed FeatureSource→KISource, env var fallback chain switched to RULE_GENERATION_KI_SOURCE; new flow handles canonical/snapshot/auto KI sources and KI computation.
Evaluators — added/renamed `src/evaluators/ki_extraction_evaluators.ts`, `src/evaluators/ki_extraction_evaluators.test.ts`, `src/evaluators/ki_duplication_evaluators.ts`, `src/evaluators/rule_generation_evaluators.ts`, `src/evaluators/rule_generation_evaluators.test.ts`, `src/evaluators/evidence_text_matching.ts`	Renamed feature evaluators → KI evaluators (types, names, helpers). Added rule-generation CODE evaluator (ES
Data generators & snapshot tooling `src/data_generators/canonical_kis.ts`, `src/data_generators/canonical_kis.test.ts`, `src/data_generators/load_kis_from_snapshot.ts`, `src/data_generators/load_kis_from_snapshot.test.ts`, `src/data_generators/sigevents_kis_index.ts`, `src/data_generators/sigevents_kis_index.test.ts`, `src/data_generators/replay.ts`, `scripts/.../gcs.ts`, `scripts/.../significant_events_workflow.ts`	Renamed features → KIs across generator APIs and tests (canonicalKIs, loadKIsFromSnapshot, getSigeventsSnapshotKIsIndex) and updated index usage in snapshot scripts.
Removed legacy feature/query modules `.../feature_extraction/feature_extraction.spec.ts`, `.../query_generation/query_generation.spec.ts`, `src/evaluators/query_generation_evaluators.ts`, `src/data_generators/canonical_features.test.ts`	Deleted legacy feature extraction/query generation specs and the old query_generation_evaluators module; replaced by KI- and rule-generation counterparts.
Datasets: KI duplication `.../ki_duplication/ki_duplication.spec.ts`, `.../ki_duplication/ki_duplication_datasets.ts`	Renamed duplication datasets and test harness to KI-centric names and evaluator imports; interfaces/constants renamed (FeaturesDuplicationEvaluationDataset → KIDuplicationEvaluationDataset, FEATURES_DUPLICATION_DATASETS → KI_DUPLICATION_DATASETS).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant TestRunner as Test Runner
participant SnapshotSvc as GCS Snapshot Service
participant Replayer as Replay Tool
participant ES as Elasticsearch
participant LLM as Inference Client
participant Evaluator as Evaluators (KI & Rule)
TestRunner->>SnapshotSvc: list available snapshots
SnapshotSvc-->>TestRunner: snapshot metadata
TestRunner->>Replayer: replay snapshot into managed stream
Replayer->>ES: ingest replayed logs
TestRunner->>ES: refresh indices & collect sample documents
TestRunner->>LLM: run KI extraction / rule generation task (samples + prompt)
LLM-->>TestRunner: task output (KIs or queries, trace id)
TestRunner->>Evaluator: run evaluators (KI extraction evaluators, rule-generation CODE evaluator)
Evaluator->>ES: optional ES|QL checks / evidence lookups
Evaluator-->>TestRunner: evaluation results and scores

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[SigEvents][Evals] Update GCS bucket for SigEvents and add README with instructions for new evals #256314: touches Significant Events snapshot tooling and GCS/index handling — strong overlap in snapshot scripts and index naming changes.
[SigEvents] Add description to KI (Knowledge Indicators) queries #257457: related to query/rule generation shapes and evaluator flows — overlaps with the new rule-generation evaluator logic.

Suggested labels

Team:actionable-obs

Suggested reviewers

crespocarlos
jasonrhodes
mdbirnstiehl

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: renaming KI/rule terminology in significant events evaluators, which is the core objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Warning

Tools execution failed with the following error:

Failed to run tools: 14 UNAVAILABLE: Connection dropped

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts (1)
25-45: ⚠️ Potential issue | 🟡 Minor

The dependency-id parser is already treating non-app tokens as apps.

extractRequiredAppsFromCriteria() assumes every dep-* segment maps to resource.attributes.app, but the current dataset already contains ids like dep-services-flagd and dep-upstream. That makes the fallback sampler query impossible apps and emit misleading "missing required apps" warnings. Please switch this to explicit required-app metadata or a stricter parser instead of splitting arbitrary dependency ids.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts`
around lines 25 - 45, extractRequiredAppsFromCriteria currently treats every
token from dep-* ids as an app by splitting on '-' which misclassifies non-app
tokens like 'services' or 'upstream'; update extractRequiredAppsFromCriteria
(and the KIExtractionScenario output parsing) to stop splitting arbitrary dep
ids and instead read explicit required-app metadata if present (e.g., a
dedicated criteria field like { requiredApps: string[] }) or implement a
stricter parser that only accepts known app-name patterns (validate against
allowed app name regex or a whitelist) before adding to the apps Set; ensure you
preserve existing handling for entity-* ids and keep return type the same.

🧹 Nitpick comments (6)

x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts (1)

30-33: Consider defensive handling for evaluator lookup.

The non-null assertion (!) assumes the evaluator will always be found. If the evaluator name changes, this would throw at runtime rather than providing a clear error.

🔧 Suggested defensive approach

 const getRuleGenerationCodeEvaluator = (esClient: ElasticsearchClient) =>
-  createRuleGenerationEvaluators(esClient).find(
+  createRuleGenerationEvaluators(esClient).find(
     (evaluator) => evaluator.name === 'rule_generation_code_evaluator'
-  )!;
+  ) ?? (() => { throw new Error('rule_generation_code_evaluator not found'); })();

Alternatively, since this is a test helper, the current approach is acceptable as the test would fail with a clear enough error on first use.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts`
around lines 30 - 33, The helper getRuleGenerationCodeEvaluator uses a non-null
assertion on the result of createRuleGenerationEvaluators(...).find(...), which
will throw a less-descriptive runtime error if the evaluator name is changed or
missing; replace the `!` with an explicit defensive check: locate the creation
call to createRuleGenerationEvaluators(esClient) and the find that compares
evaluator.name === 'rule_generation_code_evaluator' in
getRuleGenerationCodeEvaluator, capture the result in a variable, and if it's
undefined throw a clear Error (e.g., "rule_generation_code_evaluator not found
in createRuleGenerationEvaluators") so tests fail with a descriptive message.

x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts (1)

12-17: Consider adding explicit existence checks for evaluators.

The non-null assertions (!) on lines 12 and 15 could mask issues if evaluator names change. Consider adding explicit checks:

Safer evaluator lookup

-const evidenceGroundingEvaluator = createKIExtractionEvaluators().find(
-  (evaluator) => evaluator.name === 'evidence_grounding'
-);
-const kiCountEvaluator = createKIExtractionEvaluators().find(
-  (evaluator) => evaluator.name === 'ki_count'
-);
+const evaluators = createKIExtractionEvaluators();
+
+const evidenceGroundingEvaluator = evaluators.find(
+  (evaluator) => evaluator.name === 'evidence_grounding'
+);
+const kiCountEvaluator = evaluators.find(
+  (evaluator) => evaluator.name === 'ki_count'
+);
+
+if (!evidenceGroundingEvaluator || !kiCountEvaluator) {
+  throw new Error('Required evaluators not found - check evaluator names');
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts`
around lines 12 - 17, The test currently uses non-null assertions for
evidenceGroundingEvaluator and kiCountEvaluator returned from
createKIExtractionEvaluators(), which can hide failures if evaluator names
change; update the test to explicitly check that
createKIExtractionEvaluators().find(...) returned a value for each evaluator
(evidenceGroundingEvaluator and kiCountEvaluator) and fail early with a clear
message (e.g., throw or expect) if not found, before using them in later
assertions so the test errors point to the missing evaluator by name.

x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts (1)

29-76: Code duplication with ki_extraction_evaluators.ts.

The SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and the short-evidence word-boundary matching pattern are duplicated between this file and ki_extraction_evaluators.ts (lines 150-169). Consider extracting to a shared utility.

Suggested shared utility

// src/evaluators/evidence_matching.ts
export const SHORT_EVIDENCE_MAX_LENGTH = 3;

export function escapeRegExp(value: string): string {
  return value.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

export function matchesShortEvidence(text: string, evidence: string): boolean {
  const normalized = evidence.trim();
  if (normalized.length === 0) return false;
  
  if (normalized.length <= SHORT_EVIDENCE_MAX_LENGTH) {
    return new RegExp(`(^|[^a-zA-Z0-9_])${escapeRegExp(normalized)}($|[^a-zA-Z0-9_])`).test(text);
  }
  return text.includes(normalized);
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`
around lines 29 - 76, Extract duplicated logic (SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, and the short-evidence word-boundary matching) into a shared
utility module (e.g., export SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and a
matchesShortEvidence function) and replace the local implementations in
rule_generation_evaluators.ts (functions/constants: SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, matchesEvidenceInLog) with imports from that utility; update
matchesEvidenceInLog to delegate to the new matchesShortEvidence (or replace
usages directly) and add appropriate imports so behavior remains identical.

x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts (1)

23-40: Consider using typed mocks instead of as any casts.

The test file uses multiple as any casts for mocks (lines 28, 40, and throughout). While this is common in test code, it could mask type mismatches. Consider creating partial mock types:

Example typed mock approach

+type MockToolingLog = Pick<ToolingLog, 'info' | 'debug' | 'warning' | 'error'>;
+
 const log = {
   info: jest.fn(),
   debug: jest.fn(),
   warning: jest.fn(),
   error: jest.fn(),
-} as any;
+} as MockToolingLog;

As per coding guidelines: "Use TypeScript for all new code; avoid any and unknown."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts`
around lines 23 - 40, Replace the untyped `as any` mocks with typed
partial/mocked types to preserve compile-time checks: define `log` as a
`jest.Mocked<YourLoggerType>` (or `Partial<YourLoggerType>` with jest.fn() for
the methods) instead of `as any`, type `gcs` with its proper interface/type, and
change `makeEsClient` to return `Partial<Client>` or `jest.Mocked<Client>` (use
the real `Client` type from the ES client) so the `indices.delete` and `search`
mocks are correctly typed; update all occurrences (references: `log`, `gcs`,
`makeEsClient`, and the `Client` type) to use these typed mocks rather than `as
any`.

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts (1)

10-10: Keep DatasetConfig independent from evaluator code.

datasets/types.ts is the shared contract for datasets and specs, but ValidKIType now comes from src/evaluators/ki_extraction_evaluators. That couples a public schema module to a concrete evaluator implementation. Moving the KI-type union into a neutral shared types module would keep the dependency direction cleaner and reduce the chance of cycles later.

Also applies to: 41-50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts`
at line 10, DatasetConfig in datasets/types.ts currently depends on ValidKIType
from src/evaluators/ki_extraction_evaluators which couples the shared dataset
contract to a concrete evaluator; extract the KI-type union into a neutral
shared types module (e.g., create/export a KiType or ValidKIType in a shared
package used by both datasets and evaluators), update datasets/types.ts to
import the KI union from that new shared module instead of
src/evaluators/ki_extraction_evaluators, and update the evaluator implementation
to import the same shared type so DatasetConfig (and any other shared
interfaces) remain independent of evaluator code; ensure symbols involved are
DatasetConfig and ValidKIType (or the new shared KiType) so references are
updated consistently.

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts (1)

381-624: Only two rule scenarios opt into the new esql_substrings check.

The new evaluator tightening only applies this grounding check when a scenario defines esql_substrings, but in this dataset that is currently limited to payment-unreachable and cart-redis-cutoff. Adding representative substrings for the remaining rule-generation scenarios would make the new behavior exercise the full matrix instead of just those two cases.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts`
around lines 381 - 624, The ruleGeneration scenarios only include
esql_substrings for payment-unreachable and cart-redis-cutoff, so add
representative esql_substrings arrays to the other scenario objects to exercise
the new evaluator check; update the ruleGeneration entries for scenario_id
values healthy-baseline, currency-unreachable, checkout-memory-starvation,
flagd-unreachable, and load-generator-ramp to include relevant substrings (e.g.,
healthy-baseline: ['throughput','latency','service'], currency-unreachable:
['currency','conversion','unreachable'], checkout-memory-starvation:
['OOMKilled','memory','GC pressure'], flagd-unreachable:
['flagd','feature-flag','evaluation'], load-generator-ramp: ['latency','error
rate','throughput']) so the evaluator sees grounding text across all scenarios.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`:
- Around line 101-109: The test is sampling logs and computing KIs from the
whole snapshot because sampleLogs is still using match_all; use the
previously-resolved extractionScenario/scenario filter so the rule-generation
context is limited to the scenario's failure slice. Update the call that builds
the ES query for sampleLogs (and any match_all usage between sampleLogs and
canonicalKIsFromExpectedGroundTruth) to include a filter on
scenario.input.scenario_id (or extractionScenario.input.scenario_id) so
sampleLogs and the KIs are sourced only from that scenario's slice.

---

Outside diff comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts`:
- Around line 25-45: extractRequiredAppsFromCriteria currently treats every
token from dep-* ids as an app by splitting on '-' which misclassifies non-app
tokens like 'services' or 'upstream'; update extractRequiredAppsFromCriteria
(and the KIExtractionScenario output parsing) to stop splitting arbitrary dep
ids and instead read explicit required-app metadata if present (e.g., a
dedicated criteria field like { requiredApps: string[] }) or implement a
stricter parser that only accepts known app-name patterns (validate against
allowed app name regex or a whitelist) before adding to the apps Set; ensure you
preserve existing handling for entity-* ids and keep return type the same.

---

Nitpick comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts`:
- Around line 381-624: The ruleGeneration scenarios only include esql_substrings
for payment-unreachable and cart-redis-cutoff, so add representative
esql_substrings arrays to the other scenario objects to exercise the new
evaluator check; update the ruleGeneration entries for scenario_id values
healthy-baseline, currency-unreachable, checkout-memory-starvation,
flagd-unreachable, and load-generator-ramp to include relevant substrings (e.g.,
healthy-baseline: ['throughput','latency','service'], currency-unreachable:
['currency','conversion','unreachable'], checkout-memory-starvation:
['OOMKilled','memory','GC pressure'], flagd-unreachable:
['flagd','feature-flag','evaluation'], load-generator-ramp: ['latency','error
rate','throughput']) so the evaluator sees grounding text across all scenarios.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts`:
- Line 10: DatasetConfig in datasets/types.ts currently depends on ValidKIType
from src/evaluators/ki_extraction_evaluators which couples the shared dataset
contract to a concrete evaluator; extract the KI-type union into a neutral
shared types module (e.g., create/export a KiType or ValidKIType in a shared
package used by both datasets and evaluators), update datasets/types.ts to
import the KI union from that new shared module instead of
src/evaluators/ki_extraction_evaluators, and update the evaluator implementation
to import the same shared type so DatasetConfig (and any other shared
interfaces) remain independent of evaluator code; ensure symbols involved are
DatasetConfig and ValidKIType (or the new shared KiType) so references are
updated consistently.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts`:
- Around line 23-40: Replace the untyped `as any` mocks with typed
partial/mocked types to preserve compile-time checks: define `log` as a
`jest.Mocked<YourLoggerType>` (or `Partial<YourLoggerType>` with jest.fn() for
the methods) instead of `as any`, type `gcs` with its proper interface/type, and
change `makeEsClient` to return `Partial<Client>` or `jest.Mocked<Client>` (use
the real `Client` type from the ES client) so the `indices.delete` and `search`
mocks are correctly typed; update all occurrences (references: `log`, `gcs`,
`makeEsClient`, and the `Client` type) to use these typed mocks rather than `as
any`.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts`:
- Around line 12-17: The test currently uses non-null assertions for
evidenceGroundingEvaluator and kiCountEvaluator returned from
createKIExtractionEvaluators(), which can hide failures if evaluator names
change; update the test to explicitly check that
createKIExtractionEvaluators().find(...) returned a value for each evaluator
(evidenceGroundingEvaluator and kiCountEvaluator) and fail early with a clear
message (e.g., throw or expect) if not found, before using them in later
assertions so the test errors point to the missing evaluator by name.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts`:
- Around line 30-33: The helper getRuleGenerationCodeEvaluator uses a non-null
assertion on the result of createRuleGenerationEvaluators(...).find(...), which
will throw a less-descriptive runtime error if the evaluator name is changed or
missing; replace the `!` with an explicit defensive check: locate the creation
call to createRuleGenerationEvaluators(esClient) and the find that compares
evaluator.name === 'rule_generation_code_evaluator' in
getRuleGenerationCodeEvaluator, capture the result in a variable, and if it's
undefined throw a clear Error (e.g., "rule_generation_code_evaluator not found
in createRuleGenerationEvaluators") so tests fail with a descriptive message.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`:
- Around line 29-76: Extract duplicated logic (SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, and the short-evidence word-boundary matching) into a shared
utility module (e.g., export SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and a
matchesShortEvidence function) and replace the local implementations in
rule_generation_evaluators.ts (functions/constants: SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, matchesEvidenceInLog) with imports from that utility; update
matchesEvidenceInLog to delegate to the new matchesShortEvidence (or replace
usages directly) and add appropriate imports so behavior remains identical.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8bc2e091-2af7-48af-ac7a-bcdd69f4c71a

📥 Commits

Reviewing files that changed from the base of the PR and between 92b8fab and 3e91bf6.

📒 Files selected for processing (30)

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/index.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/feature_extraction/feature_extraction.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_duplication/ki_duplication.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_duplication/ki_duplication_datasets.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/ki_extraction.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/query_generation/query_generation.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/extract_log_text.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/get_computed_kis_from_docs.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/resolve_ki_sources.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/scripts/significant_events_snapshots/lib/gcs.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/scripts/significant_events_snapshots/lib/significant_events_workflow.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_features.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_kis.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_kis.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/replay.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/sigevents_kis_index.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/sigevents_kis_index.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_duplication_evaluators.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/query_generation_evaluators.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts

💤 Files with no reviewable changes (4)

x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_features.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/query_generation/query_generation.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/feature_extraction/feature_extraction.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/query_generation_evaluators.ts

...red/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md

...latform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts

.../platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.ts

coderabbitai

🧹 Nitpick comments (2)

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts (1)
91-96: Consider adding debug logging for cleanup failures.

The empty catch blocks silently swallow errors during cleanup. While this is acceptable for cleanup operations that may legitimately fail (e.g., non-existent streams), adding debug logging would help diagnose issues when cleanup unexpectedly fails.
♻️ Suggested improvement
           for (const name of ['logs.otel', 'logs.ecs']) {
-            await esClient.indices.deleteDataStream({ name }).catch(() => {});
+            await esClient.indices.deleteDataStream({ name }).catch((e) => {
+              log.debug(`Failed to delete data stream ${name}: ${e.message}`);
+            });
             await esClient.indices
               .delete({ index: name, ignore_unavailable: true })
-              .catch(() => {});
+              .catch((e) => {
+                log.debug(`Failed to delete index ${name}: ${e.message}`);
+              });
           }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`
around lines 91 - 96, The cleanup loop currently swallows errors silently in the
empty catch blocks around esClient.indices.deleteDataStream and
esClient.indices.delete; update those catches to log the caught error at debug
(or trace) level instead of doing nothing: capture the error in each catch and
call the test-suite logger (or console.debug if no logger is available) with a
short context message referencing deleteDataStream/delete and the error object
so unexpected failures during cleanup are visible without breaking tests.
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts (1)
96-112: Consider adding a timeout for ES|QL query validation to prevent test hangs.

The esClient.esql.query call on line 102 lacks a timeout, which could cause test runs to hang on malformed or expensive queries. Per the Elasticsearch Node.js client API, you can add a requestTimeout in the options parameter (2nd argument).
♻️ Suggested improvement
     try {
-      const result = await esClient.esql.query({ query: esql });
+      const result = await esClient.esql.query({ query: esql }, { requestTimeout: 120000 });
       isSyntaxValid = true;
Tune the timeout value (120000ms shown above) based on expected query complexity and your CI environment—shorter timeouts may be suitable for simple queries, but complex ES|QL queries may need the longer window to execute.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`
around lines 96 - 112, The ES|QL validation call in the loop uses
esClient.esql.query without a timeout which can hang tests; update the
esClient.esql.query invocation (in the same block that sets
isSyntaxValid/isExecutionHit and increments validSyntaxCount/executionHitCount)
to pass a second options argument with a requestTimeout (e.g., 120000) so slow
or stuck queries fail fast, and keep the existing try/catch to log the error as
before.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`:
- Around line 91-96: The cleanup loop currently swallows errors silently in the
empty catch blocks around esClient.indices.deleteDataStream and
esClient.indices.delete; update those catches to log the caught error at debug
(or trace) level instead of doing nothing: capture the error in each catch and
call the test-suite logger (or console.debug if no logger is available) with a
short context message referencing deleteDataStream/delete and the error object
so unexpected failures during cleanup are visible without breaking tests.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`:
- Around line 96-112: The ES|QL validation call in the loop uses
esClient.esql.query without a timeout which can hang tests; update the
esClient.esql.query invocation (in the same block that sets
isSyntaxValid/isExecutionHit and increments validSyntaxCount/executionHitCount)
to pass a second options argument with a requestTimeout (e.g., 120000) so slow
or stuck queries fail fast, and keep the existing try/catch to log the error as
before.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 23b95ab7-bd08-493c-978d-e695529b062d

📥 Commits

Reviewing files that changed from the base of the PR and between 3e91bf6 and dfa6583.

📒 Files selected for processing (7)

x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/resolve_ki_sources.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/evidence_text_matching.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.ts
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts

🚧 Files skipped from review as they are similar to previous changes (1)

x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts

elasticmachine · 2026-03-17T23:22:15Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 3cbbf72

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #89 / dashboard app - group 2 full screen mode exits full screen mode when back button pressed
[job] [logs] Scout: [ observability / synthetics ] plugin / local-stateful-classic - DefaultStatusAlert - creates default alert, triggers on down status, and recovers
[job] [logs] Jest Tests #4 / rules_list rules_list component with items Click column to sort by P95

Metrics [docs]

✅ unchanged

History

💔 Build #411683 failed dfa6583
💔 Build #411601 failed 3e91bf6

cc @viduni94

ruflin · 2026-03-18T08:54:13Z

.../platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/index.ts

  DatasetConfig,
-  QueryGenerationScenario,
-  FeatureExtractionScenario,
+  RuleGenerationScenario,


Is this really about rule generation? Isn't this KI generation of type query?

Proposed name:

KIQueryGenerationScenario

ruflin · 2026-03-18T08:54:35Z

.../platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/index.ts

-  QueryGenerationScenario,
-  FeatureExtractionScenario,
+  RuleGenerationScenario,
+  KIExtractionScenario,


This is only the scenario for KIFeatureExtraction not all KI is my assumption?

…minology (elastic#258209)

viduni94 added 2 commits March 16, 2026 13:12

Rename features to KIs

4a234e2

Renaming and evaluator tightening

3e91bf6

viduni94 self-assigned this Mar 17, 2026

viduni94 requested review from a team as code owners March 17, 2026 18:04

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

...red/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts Show resolved Hide resolved

crespocarlos reviewed Mar 17, 2026

View reviewed changes

Address review comments

d6337bd

viduni94 removed the models:eis/google-gemini-3.0-flash Run LLM evals against model: eis/google-gemini-3.0-flash label Mar 17, 2026

Fix typo

dfa6583

viduni94 requested a review from crespocarlos March 17, 2026 20:21

crespocarlos approved these changes Mar 17, 2026

View reviewed changes

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

viduni94 removed the models:eis/anthropic-claude-4.6-opus Run LLM evals against model: eis/anthropic-claude-4.6-opus label Mar 17, 2026

Merge branch 'main' into improve-existing-evaluators

3cbbf72

viduni94 merged commit c3c2522 into elastic:main Mar 17, 2026
18 checks passed

kibanamachine added the v9.4.0 label Mar 17, 2026

ruflin reviewed Mar 18, 2026

View reviewed changes

This was referenced Mar 18, 2026

[Onboarding] Replace any with unknown for stricter type safety #257433

Merged

[SigEvents][Evals] Rename terminology for KI features and KI queries #258361

Merged

qn895 pushed a commit to qn895/kibana that referenced this pull request Mar 18, 2026

[SigEvents][Evals] Tighten existing evaluators and rename KI/rule ter…

c5b0a8b

…minology (elastic#258209)

coderabbitai bot mentioned this pull request Mar 25, 2026

[SigEvents][Evals] Fix snapshot capture script and clean up evaluator folder #259648

Merged

5 tasks

jeramysoucy pushed a commit to jeramysoucy/kibana that referenced this pull request Mar 26, 2026

[SigEvents][Evals] Tighten existing evaluators and rename KI/rule ter…

cf24088

…minology (elastic#258209)

coderabbitai bot mentioned this pull request Mar 26, 2026

[Sigevents][Evals] Wire snapshot-based data to ki feature duplication eval #259421

Merged

9 tasks

Conversation

viduni94 commented Mar 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Mar 17, 2026

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

Uh oh!

ruflin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

ruflin Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

viduni94 commented Mar 17, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading