Skip to content

[SigEvents][Evals] Tighten existing evaluators and rename KI/rule terminology#258209

Merged
viduni94 merged 5 commits intoelastic:mainfrom
viduni94:improve-existing-evaluators
Mar 17, 2026
Merged

[SigEvents][Evals] Tighten existing evaluators and rename KI/rule terminology#258209
viduni94 merged 5 commits intoelastic:mainfrom
viduni94:improve-existing-evaluators

Conversation

@viduni94
Copy link
Copy Markdown
Contributor

@viduni94 viduni94 commented Mar 17, 2026

Closes https://github.com/elastic/streams-program/issues/950

Why

The significant-events flow now uses Knowledge Indicators (KIs) and rules as its core concepts. This update aligns the evaluation suite with that terminology, reduces ambiguity around “features” and “queries”, and makes the deterministic checks stricter and more informative.

Summary

  • Rename significant-events evaluation surfaces from feature/query terminology to KI/rule terminology, including renaming the query_generation suite to rule_generation and feature_extraction to KI_extraction.
  • Preserve compatibility where needed by keeping legacy storage/index naming, while updating eval code, dataset types, docs, and helper names to use KI-oriented terminology.
  • Improve feature extraction KI extraction evaluation by tightening evidence grounding so short evidence snippets do not pass via overly permissive substring matches, adding support for array-index field paths in field.path=value evidence, and changing KI count scoring from binary pass/fail to proportional scoring so near misses are penalized more gradually.
  • Improve query generation rule generation evaluation by incorporating category and severity compliance into the CODE evaluator score, validating scenario-defined expected_categories and esql_substrings, and making evidence checks less permissive so the evaluator better reflects whether generated rules are grounded in the provided log context.
  • Add focused unit tests for the KI extraction and rule generation evaluator changes, and add debug logging in rule generation to help diagnose cases where some models complete without ever calling the query-generation tools.

Checklist

  • Unit or functional tests were updated or added to match the most common scenarios
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines
  • Review the backport guidelines and apply applicable backport:* labels.

Summary by CodeRabbit

  • Documentation

    • Updated evaluation docs and run examples to use "KI" (Knowledge Indicator) terminology and revised configuration names.
  • New Features

    • Added KI extraction and rule-generation evaluation flows and a utility for evidence-text matching.
  • Tests

    • Introduced KI-focused unit/integration tests and replaced several legacy feature/query-generation test suites.
  • Chores

    • Renamed evaluators, datasets, and data-generator outputs to KI-centric naming; updated environment/config identifiers.

@viduni94 viduni94 self-assigned this Mar 17, 2026
@viduni94 viduni94 requested review from a team as code owners March 17, 2026 18:04
@viduni94 viduni94 added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting Feature:SigEvents Significant events feature, related to streams and rules/alerts (RnA) models:eis/anthropic-claude-4.6-opus Run LLM evals against model: eis/anthropic-claude-4.6-opus models:eis/google-gemini-3.0-flash Run LLM evals against model: eis/google-gemini-3.0-flash models:eis/openai-gpt-5.2 Run LLM evals against model: eis/openai-gpt-5.2 models:eis/openai-gpt-oss-120b Run LLM evals against model: eis/openai-gpt-oss-120b Team:SigEvents Project team working on Significant Events (deprecated) evals:streams-sigevents This label is deprecated. Use `evals:significant-events` to run the Significant Events eval suite. closes:sig-events PR closes an issue labeled for Significant Events models:eis/anthropic-claude-4.6-sonnet Run LLM evals against model: eis/anthropic-claude-4.6-sonnet models:judge:eis/google-gemini-3.1-pro Override LLM-as-a-judge connector for evals: eis/google-gemini-3.1-pro labels Mar 17, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This PR renames and rebrands the Significant Events evals from "feature extraction"/"query generation" to "KI (Knowledge Indicator) extraction"/"rule generation", adding KI-centric evaluators, tests, and data-generator renames while removing legacy feature/query modules. Public/type exports and many file paths were renamed accordingly.

Changes

Cohort / File(s) Summary
Top-level docs
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md
Terminology and examples updated from feature/query generation → KI/rule generation; CLI env var names and run examples adjusted.
Dataset schemas & index
.../datasets/types.ts, .../datasets/index.ts, .../datasets/otel_demo.ts
Renamed types/fields: QueryGenerationScenario→RuleGenerationScenario, FeatureExtractionScenario→KIExtractionScenario; dataset keys renamed to ruleGeneration/kiExtraction.
KI extraction tests & helpers
.../ki_extraction/ki_extraction.spec.ts, .../ki_extraction/collect_sample_documents.ts
Added integration spec for KI extraction; updated helper signatures to use KIExtractionScenario; orchestration for snapshot resolve, replay, sample collection, and evaluator wiring.
Rule generation tests & helpers
.../rule_generation/rule_generation.spec.ts, .../rule_generation/resolve_ki_sources.ts, .../rule_generation/get_computed_kis_from_docs.ts
Added rule-generation spec and helpers; renamed FeatureSource→KISource, env var fallback chain switched to RULE_GENERATION_KI_SOURCE; new flow handles canonical/snapshot/auto KI sources and KI computation.
Evaluators — added/renamed
src/evaluators/ki_extraction_evaluators.ts, src/evaluators/ki_extraction_evaluators.test.ts, src/evaluators/ki_duplication_evaluators.ts, src/evaluators/rule_generation_evaluators.ts, src/evaluators/rule_generation_evaluators.test.ts, src/evaluators/evidence_text_matching.ts
Renamed feature evaluators → KI evaluators (types, names, helpers). Added rule-generation CODE evaluator (ES
Data generators & snapshot tooling
src/data_generators/canonical_kis.ts, src/data_generators/canonical_kis.test.ts, src/data_generators/load_kis_from_snapshot.ts, src/data_generators/load_kis_from_snapshot.test.ts, src/data_generators/sigevents_kis_index.ts, src/data_generators/sigevents_kis_index.test.ts, src/data_generators/replay.ts, scripts/.../gcs.ts, scripts/.../significant_events_workflow.ts
Renamed features → KIs across generator APIs and tests (canonicalKIs, loadKIsFromSnapshot, getSigeventsSnapshotKIsIndex) and updated index usage in snapshot scripts.
Removed legacy feature/query modules
.../feature_extraction/feature_extraction.spec.ts, .../query_generation/query_generation.spec.ts, src/evaluators/query_generation_evaluators.ts, src/data_generators/canonical_features.test.ts
Deleted legacy feature extraction/query generation specs and the old query_generation_evaluators module; replaced by KI- and rule-generation counterparts.
Datasets: KI duplication
.../ki_duplication/ki_duplication.spec.ts, .../ki_duplication/ki_duplication_datasets.ts
Renamed duplication datasets and test harness to KI-centric names and evaluator imports; interfaces/constants renamed (FeaturesDuplicationEvaluationDataset → KIDuplicationEvaluationDataset, FEATURES_DUPLICATION_DATASETS → KI_DUPLICATION_DATASETS).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant TestRunner as Test Runner
participant SnapshotSvc as GCS Snapshot Service
participant Replayer as Replay Tool
participant ES as Elasticsearch
participant LLM as Inference Client
participant Evaluator as Evaluators (KI & Rule)
TestRunner->>SnapshotSvc: list available snapshots
SnapshotSvc-->>TestRunner: snapshot metadata
TestRunner->>Replayer: replay snapshot into managed stream
Replayer->>ES: ingest replayed logs
TestRunner->>ES: refresh indices & collect sample documents
TestRunner->>LLM: run KI extraction / rule generation task (samples + prompt)
LLM-->>TestRunner: task output (KIs or queries, trace id)
TestRunner->>Evaluator: run evaluators (KI extraction evaluators, rule-generation CODE evaluator)
Evaluator->>ES: optional ES|QL checks / evidence lookups
Evaluator-->>TestRunner: evaluation results and scores

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

Team:actionable-obs

Suggested reviewers

  • crespocarlos
  • jasonrhodes
  • mdbirnstiehl
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: renaming KI/rule terminology in significant events evaluators, which is the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

Tools execution failed with the following error:

Failed to run tools: 14 UNAVAILABLE: Connection dropped


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts (1)

25-45: ⚠️ Potential issue | 🟡 Minor

The dependency-id parser is already treating non-app tokens as apps.

extractRequiredAppsFromCriteria() assumes every dep-* segment maps to resource.attributes.app, but the current dataset already contains ids like dep-services-flagd and dep-upstream. That makes the fallback sampler query impossible apps and emit misleading "missing required apps" warnings. Please switch this to explicit required-app metadata or a stricter parser instead of splitting arbitrary dependency ids.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts`
around lines 25 - 45, extractRequiredAppsFromCriteria currently treats every
token from dep-* ids as an app by splitting on '-' which misclassifies non-app
tokens like 'services' or 'upstream'; update extractRequiredAppsFromCriteria
(and the KIExtractionScenario output parsing) to stop splitting arbitrary dep
ids and instead read explicit required-app metadata if present (e.g., a
dedicated criteria field like { requiredApps: string[] }) or implement a
stricter parser that only accepts known app-name patterns (validate against
allowed app name regex or a whitelist) before adding to the apps Set; ensure you
preserve existing handling for entity-* ids and keep return type the same.
🧹 Nitpick comments (6)
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts (1)

30-33: Consider defensive handling for evaluator lookup.

The non-null assertion (!) assumes the evaluator will always be found. If the evaluator name changes, this would throw at runtime rather than providing a clear error.

🔧 Suggested defensive approach
 const getRuleGenerationCodeEvaluator = (esClient: ElasticsearchClient) =>
-  createRuleGenerationEvaluators(esClient).find(
+  createRuleGenerationEvaluators(esClient).find(
     (evaluator) => evaluator.name === 'rule_generation_code_evaluator'
-  )!;
+  ) ?? (() => { throw new Error('rule_generation_code_evaluator not found'); })();

Alternatively, since this is a test helper, the current approach is acceptable as the test would fail with a clear enough error on first use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts`
around lines 30 - 33, The helper getRuleGenerationCodeEvaluator uses a non-null
assertion on the result of createRuleGenerationEvaluators(...).find(...), which
will throw a less-descriptive runtime error if the evaluator name is changed or
missing; replace the `!` with an explicit defensive check: locate the creation
call to createRuleGenerationEvaluators(esClient) and the find that compares
evaluator.name === 'rule_generation_code_evaluator' in
getRuleGenerationCodeEvaluator, capture the result in a variable, and if it's
undefined throw a clear Error (e.g., "rule_generation_code_evaluator not found
in createRuleGenerationEvaluators") so tests fail with a descriptive message.
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts (1)

12-17: Consider adding explicit existence checks for evaluators.

The non-null assertions (!) on lines 12 and 15 could mask issues if evaluator names change. Consider adding explicit checks:

Safer evaluator lookup
-const evidenceGroundingEvaluator = createKIExtractionEvaluators().find(
-  (evaluator) => evaluator.name === 'evidence_grounding'
-);
-const kiCountEvaluator = createKIExtractionEvaluators().find(
-  (evaluator) => evaluator.name === 'ki_count'
-);
+const evaluators = createKIExtractionEvaluators();
+
+const evidenceGroundingEvaluator = evaluators.find(
+  (evaluator) => evaluator.name === 'evidence_grounding'
+);
+const kiCountEvaluator = evaluators.find(
+  (evaluator) => evaluator.name === 'ki_count'
+);
+
+if (!evidenceGroundingEvaluator || !kiCountEvaluator) {
+  throw new Error('Required evaluators not found - check evaluator names');
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts`
around lines 12 - 17, The test currently uses non-null assertions for
evidenceGroundingEvaluator and kiCountEvaluator returned from
createKIExtractionEvaluators(), which can hide failures if evaluator names
change; update the test to explicitly check that
createKIExtractionEvaluators().find(...) returned a value for each evaluator
(evidenceGroundingEvaluator and kiCountEvaluator) and fail early with a clear
message (e.g., throw or expect) if not found, before using them in later
assertions so the test errors point to the missing evaluator by name.
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts (1)

29-76: Code duplication with ki_extraction_evaluators.ts.

The SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and the short-evidence word-boundary matching pattern are duplicated between this file and ki_extraction_evaluators.ts (lines 150-169). Consider extracting to a shared utility.

Suggested shared utility
// src/evaluators/evidence_matching.ts
export const SHORT_EVIDENCE_MAX_LENGTH = 3;

export function escapeRegExp(value: string): string {
  return value.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

export function matchesShortEvidence(text: string, evidence: string): boolean {
  const normalized = evidence.trim();
  if (normalized.length === 0) return false;
  
  if (normalized.length <= SHORT_EVIDENCE_MAX_LENGTH) {
    return new RegExp(`(^|[^a-zA-Z0-9_])${escapeRegExp(normalized)}($|[^a-zA-Z0-9_])`).test(text);
  }
  return text.includes(normalized);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`
around lines 29 - 76, Extract duplicated logic (SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, and the short-evidence word-boundary matching) into a shared
utility module (e.g., export SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and a
matchesShortEvidence function) and replace the local implementations in
rule_generation_evaluators.ts (functions/constants: SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, matchesEvidenceInLog) with imports from that utility; update
matchesEvidenceInLog to delegate to the new matchesShortEvidence (or replace
usages directly) and add appropriate imports so behavior remains identical.
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts (1)

23-40: Consider using typed mocks instead of as any casts.

The test file uses multiple as any casts for mocks (lines 28, 40, and throughout). While this is common in test code, it could mask type mismatches. Consider creating partial mock types:

Example typed mock approach
+type MockToolingLog = Pick<ToolingLog, 'info' | 'debug' | 'warning' | 'error'>;
+
 const log = {
   info: jest.fn(),
   debug: jest.fn(),
   warning: jest.fn(),
   error: jest.fn(),
-} as any;
+} as MockToolingLog;

As per coding guidelines: "Use TypeScript for all new code; avoid any and unknown."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts`
around lines 23 - 40, Replace the untyped `as any` mocks with typed
partial/mocked types to preserve compile-time checks: define `log` as a
`jest.Mocked<YourLoggerType>` (or `Partial<YourLoggerType>` with jest.fn() for
the methods) instead of `as any`, type `gcs` with its proper interface/type, and
change `makeEsClient` to return `Partial<Client>` or `jest.Mocked<Client>` (use
the real `Client` type from the ES client) so the `indices.delete` and `search`
mocks are correctly typed; update all occurrences (references: `log`, `gcs`,
`makeEsClient`, and the `Client` type) to use these typed mocks rather than `as
any`.
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts (1)

10-10: Keep DatasetConfig independent from evaluator code.

datasets/types.ts is the shared contract for datasets and specs, but ValidKIType now comes from src/evaluators/ki_extraction_evaluators. That couples a public schema module to a concrete evaluator implementation. Moving the KI-type union into a neutral shared types module would keep the dependency direction cleaner and reduce the chance of cycles later.

Also applies to: 41-50

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts`
at line 10, DatasetConfig in datasets/types.ts currently depends on ValidKIType
from src/evaluators/ki_extraction_evaluators which couples the shared dataset
contract to a concrete evaluator; extract the KI-type union into a neutral
shared types module (e.g., create/export a KiType or ValidKIType in a shared
package used by both datasets and evaluators), update datasets/types.ts to
import the KI union from that new shared module instead of
src/evaluators/ki_extraction_evaluators, and update the evaluator implementation
to import the same shared type so DatasetConfig (and any other shared
interfaces) remain independent of evaluator code; ensure symbols involved are
DatasetConfig and ValidKIType (or the new shared KiType) so references are
updated consistently.
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts (1)

381-624: Only two rule scenarios opt into the new esql_substrings check.

The new evaluator tightening only applies this grounding check when a scenario defines esql_substrings, but in this dataset that is currently limited to payment-unreachable and cart-redis-cutoff. Adding representative substrings for the remaining rule-generation scenarios would make the new behavior exercise the full matrix instead of just those two cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts`
around lines 381 - 624, The ruleGeneration scenarios only include
esql_substrings for payment-unreachable and cart-redis-cutoff, so add
representative esql_substrings arrays to the other scenario objects to exercise
the new evaluator check; update the ruleGeneration entries for scenario_id
values healthy-baseline, currency-unreachable, checkout-memory-starvation,
flagd-unreachable, and load-generator-ramp to include relevant substrings (e.g.,
healthy-baseline: ['throughput','latency','service'], currency-unreachable:
['currency','conversion','unreachable'], checkout-memory-starvation:
['OOMKilled','memory','GC pressure'], flagd-unreachable:
['flagd','feature-flag','evaluation'], load-generator-ramp: ['latency','error
rate','throughput']) so the evaluator sees grounding text across all scenarios.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`:
- Around line 101-109: The test is sampling logs and computing KIs from the
whole snapshot because sampleLogs is still using match_all; use the
previously-resolved extractionScenario/scenario filter so the rule-generation
context is limited to the scenario's failure slice. Update the call that builds
the ES query for sampleLogs (and any match_all usage between sampleLogs and
canonicalKIsFromExpectedGroundTruth) to include a filter on
scenario.input.scenario_id (or extractionScenario.input.scenario_id) so
sampleLogs and the KIs are sourced only from that scenario's slice.

---

Outside diff comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts`:
- Around line 25-45: extractRequiredAppsFromCriteria currently treats every
token from dep-* ids as an app by splitting on '-' which misclassifies non-app
tokens like 'services' or 'upstream'; update extractRequiredAppsFromCriteria
(and the KIExtractionScenario output parsing) to stop splitting arbitrary dep
ids and instead read explicit required-app metadata if present (e.g., a
dedicated criteria field like { requiredApps: string[] }) or implement a
stricter parser that only accepts known app-name patterns (validate against
allowed app name regex or a whitelist) before adding to the apps Set; ensure you
preserve existing handling for entity-* ids and keep return type the same.

---

Nitpick comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts`:
- Around line 381-624: The ruleGeneration scenarios only include esql_substrings
for payment-unreachable and cart-redis-cutoff, so add representative
esql_substrings arrays to the other scenario objects to exercise the new
evaluator check; update the ruleGeneration entries for scenario_id values
healthy-baseline, currency-unreachable, checkout-memory-starvation,
flagd-unreachable, and load-generator-ramp to include relevant substrings (e.g.,
healthy-baseline: ['throughput','latency','service'], currency-unreachable:
['currency','conversion','unreachable'], checkout-memory-starvation:
['OOMKilled','memory','GC pressure'], flagd-unreachable:
['flagd','feature-flag','evaluation'], load-generator-ramp: ['latency','error
rate','throughput']) so the evaluator sees grounding text across all scenarios.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts`:
- Line 10: DatasetConfig in datasets/types.ts currently depends on ValidKIType
from src/evaluators/ki_extraction_evaluators which couples the shared dataset
contract to a concrete evaluator; extract the KI-type union into a neutral
shared types module (e.g., create/export a KiType or ValidKIType in a shared
package used by both datasets and evaluators), update datasets/types.ts to
import the KI union from that new shared module instead of
src/evaluators/ki_extraction_evaluators, and update the evaluator implementation
to import the same shared type so DatasetConfig (and any other shared
interfaces) remain independent of evaluator code; ensure symbols involved are
DatasetConfig and ValidKIType (or the new shared KiType) so references are
updated consistently.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts`:
- Around line 23-40: Replace the untyped `as any` mocks with typed
partial/mocked types to preserve compile-time checks: define `log` as a
`jest.Mocked<YourLoggerType>` (or `Partial<YourLoggerType>` with jest.fn() for
the methods) instead of `as any`, type `gcs` with its proper interface/type, and
change `makeEsClient` to return `Partial<Client>` or `jest.Mocked<Client>` (use
the real `Client` type from the ES client) so the `indices.delete` and `search`
mocks are correctly typed; update all occurrences (references: `log`, `gcs`,
`makeEsClient`, and the `Client` type) to use these typed mocks rather than `as
any`.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts`:
- Around line 12-17: The test currently uses non-null assertions for
evidenceGroundingEvaluator and kiCountEvaluator returned from
createKIExtractionEvaluators(), which can hide failures if evaluator names
change; update the test to explicitly check that
createKIExtractionEvaluators().find(...) returned a value for each evaluator
(evidenceGroundingEvaluator and kiCountEvaluator) and fail early with a clear
message (e.g., throw or expect) if not found, before using them in later
assertions so the test errors point to the missing evaluator by name.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts`:
- Around line 30-33: The helper getRuleGenerationCodeEvaluator uses a non-null
assertion on the result of createRuleGenerationEvaluators(...).find(...), which
will throw a less-descriptive runtime error if the evaluator name is changed or
missing; replace the `!` with an explicit defensive check: locate the creation
call to createRuleGenerationEvaluators(esClient) and the find that compares
evaluator.name === 'rule_generation_code_evaluator' in
getRuleGenerationCodeEvaluator, capture the result in a variable, and if it's
undefined throw a clear Error (e.g., "rule_generation_code_evaluator not found
in createRuleGenerationEvaluators") so tests fail with a descriptive message.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`:
- Around line 29-76: Extract duplicated logic (SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, and the short-evidence word-boundary matching) into a shared
utility module (e.g., export SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and a
matchesShortEvidence function) and replace the local implementations in
rule_generation_evaluators.ts (functions/constants: SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, matchesEvidenceInLog) with imports from that utility; update
matchesEvidenceInLog to delegate to the new matchesShortEvidence (or replace
usages directly) and add appropriate imports so behavior remains identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 8bc2e091-2af7-48af-ac7a-bcdd69f4c71a

📥 Commits

Reviewing files that changed from the base of the PR and between 92b8fab and 3e91bf6.

📒 Files selected for processing (30)
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/index.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/feature_extraction/feature_extraction.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_duplication/ki_duplication.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_duplication/ki_duplication_datasets.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/ki_extraction.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/query_generation/query_generation.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/extract_log_text.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/get_computed_kis_from_docs.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/resolve_ki_sources.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/scripts/significant_events_snapshots/lib/gcs.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/scripts/significant_events_snapshots/lib/significant_events_workflow.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_features.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_kis.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_kis.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/replay.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/sigevents_kis_index.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/sigevents_kis_index.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_duplication_evaluators.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/query_generation_evaluators.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts
💤 Files with no reviewable changes (4)
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_features.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/query_generation/query_generation.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/feature_extraction/feature_extraction.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/query_generation_evaluators.ts

@viduni94 viduni94 removed the models:eis/google-gemini-3.0-flash Run LLM evals against model: eis/google-gemini-3.0-flash label Mar 17, 2026
@viduni94 viduni94 requested a review from crespocarlos March 17, 2026 20:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts (1)

91-96: Consider adding debug logging for cleanup failures.

The empty catch blocks silently swallow errors during cleanup. While this is acceptable for cleanup operations that may legitimately fail (e.g., non-existent streams), adding debug logging would help diagnose issues when cleanup unexpectedly fails.

♻️ Suggested improvement
           for (const name of ['logs.otel', 'logs.ecs']) {
-            await esClient.indices.deleteDataStream({ name }).catch(() => {});
+            await esClient.indices.deleteDataStream({ name }).catch((e) => {
+              log.debug(`Failed to delete data stream ${name}: ${e.message}`);
+            });
             await esClient.indices
               .delete({ index: name, ignore_unavailable: true })
-              .catch(() => {});
+              .catch((e) => {
+                log.debug(`Failed to delete index ${name}: ${e.message}`);
+              });
           }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`
around lines 91 - 96, The cleanup loop currently swallows errors silently in the
empty catch blocks around esClient.indices.deleteDataStream and
esClient.indices.delete; update those catches to log the caught error at debug
(or trace) level instead of doing nothing: capture the error in each catch and
call the test-suite logger (or console.debug if no logger is available) with a
short context message referencing deleteDataStream/delete and the error object
so unexpected failures during cleanup are visible without breaking tests.
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts (1)

96-112: Consider adding a timeout for ES|QL query validation to prevent test hangs.

The esClient.esql.query call on line 102 lacks a timeout, which could cause test runs to hang on malformed or expensive queries. Per the Elasticsearch Node.js client API, you can add a requestTimeout in the options parameter (2nd argument).

♻️ Suggested improvement
     try {
-      const result = await esClient.esql.query({ query: esql });
+      const result = await esClient.esql.query({ query: esql }, { requestTimeout: 120000 });
       isSyntaxValid = true;

Tune the timeout value (120000ms shown above) based on expected query complexity and your CI environment—shorter timeouts may be suitable for simple queries, but complex ES|QL queries may need the longer window to execute.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`
around lines 96 - 112, The ES|QL validation call in the loop uses
esClient.esql.query without a timeout which can hang tests; update the
esClient.esql.query invocation (in the same block that sets
isSyntaxValid/isExecutionHit and increments validSyntaxCount/executionHitCount)
to pass a second options argument with a requestTimeout (e.g., 120000) so slow
or stuck queries fail fast, and keep the existing try/catch to log the error as
before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`:
- Around line 91-96: The cleanup loop currently swallows errors silently in the
empty catch blocks around esClient.indices.deleteDataStream and
esClient.indices.delete; update those catches to log the caught error at debug
(or trace) level instead of doing nothing: capture the error in each catch and
call the test-suite logger (or console.debug if no logger is available) with a
short context message referencing deleteDataStream/delete and the error object
so unexpected failures during cleanup are visible without breaking tests.

In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`:
- Around line 96-112: The ES|QL validation call in the loop uses
esClient.esql.query without a timeout which can hang tests; update the
esClient.esql.query invocation (in the same block that sets
isSyntaxValid/isExecutionHit and increments validSyntaxCount/executionHitCount)
to pass a second options argument with a requestTimeout (e.g., 120000) so slow
or stuck queries fail fast, and keep the existing try/catch to log the error as
before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 23b95ab7-bd08-493c-978d-e695529b062d

📥 Commits

Reviewing files that changed from the base of the PR and between 3e91bf6 and dfa6583.

📒 Files selected for processing (7)
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/resolve_ki_sources.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/evidence_text_matching.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.ts
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts

@viduni94 viduni94 removed the models:eis/anthropic-claude-4.6-opus Run LLM evals against model: eis/anthropic-claude-4.6-opus label Mar 17, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #89 / dashboard app - group 2 full screen mode exits full screen mode when back button pressed
  • [job] [logs] Scout: [ observability / synthetics ] plugin / local-stateful-classic - DefaultStatusAlert - creates default alert, triggers on down status, and recovers
  • [job] [logs] Jest Tests #4 / rules_list rules_list component with items Click column to sort by P95

Metrics [docs]

✅ unchanged

History

cc @viduni94

@viduni94 viduni94 merged commit c3c2522 into elastic:main Mar 17, 2026
18 checks passed
DatasetConfig,
QueryGenerationScenario,
FeatureExtractionScenario,
RuleGenerationScenario,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really about rule generation? Isn't this KI generation of type query?

Proposed name:

  • KIQueryGenerationScenario

QueryGenerationScenario,
FeatureExtractionScenario,
RuleGenerationScenario,
KIExtractionScenario,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only the scenario for KIFeatureExtraction not all KI is my assumption?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting closes:sig-events PR closes an issue labeled for Significant Events (deprecated) evals:streams-sigevents This label is deprecated. Use `evals:significant-events` to run the Significant Events eval suite. Feature:SigEvents Significant events feature, related to streams and rules/alerts (RnA) models:eis/anthropic-claude-4.6-sonnet Run LLM evals against model: eis/anthropic-claude-4.6-sonnet models:eis/openai-gpt-5.2 Run LLM evals against model: eis/openai-gpt-5.2 models:eis/openai-gpt-oss-120b Run LLM evals against model: eis/openai-gpt-oss-120b models:judge:eis/google-gemini-3.1-pro Override LLM-as-a-judge connector for evals: eis/google-gemini-3.1-pro release_note:skip Skip the PR/issue when compiling release notes Team:SigEvents Project team working on Significant Events v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants