[SigEvents][Evals] Tighten existing evaluators and rename KI/rule terminology#258209
[SigEvents][Evals] Tighten existing evaluators and rename KI/rule terminology#258209viduni94 merged 5 commits intoelastic:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR renames and rebrands the Significant Events evals from "feature extraction"/"query generation" to "KI (Knowledge Indicator) extraction"/"rule generation", adding KI-centric evaluators, tests, and data-generator renames while removing legacy feature/query modules. Public/type exports and many file paths were renamed accordingly. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning Tools execution failed with the following error: Failed to run tools: 14 UNAVAILABLE: Connection dropped Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts (1)
25-45:⚠️ Potential issue | 🟡 MinorThe dependency-id parser is already treating non-app tokens as apps.
extractRequiredAppsFromCriteria()assumes everydep-*segment maps toresource.attributes.app, but the current dataset already contains ids likedep-services-flagdanddep-upstream. That makes the fallback sampler query impossible apps and emit misleading "missing required apps" warnings. Please switch this to explicit required-app metadata or a stricter parser instead of splitting arbitrary dependency ids.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts` around lines 25 - 45, extractRequiredAppsFromCriteria currently treats every token from dep-* ids as an app by splitting on '-' which misclassifies non-app tokens like 'services' or 'upstream'; update extractRequiredAppsFromCriteria (and the KIExtractionScenario output parsing) to stop splitting arbitrary dep ids and instead read explicit required-app metadata if present (e.g., a dedicated criteria field like { requiredApps: string[] }) or implement a stricter parser that only accepts known app-name patterns (validate against allowed app name regex or a whitelist) before adding to the apps Set; ensure you preserve existing handling for entity-* ids and keep return type the same.
🧹 Nitpick comments (6)
x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts (1)
30-33: Consider defensive handling for evaluator lookup.The non-null assertion (
!) assumes the evaluator will always be found. If the evaluator name changes, this would throw at runtime rather than providing a clear error.🔧 Suggested defensive approach
const getRuleGenerationCodeEvaluator = (esClient: ElasticsearchClient) => - createRuleGenerationEvaluators(esClient).find( + createRuleGenerationEvaluators(esClient).find( (evaluator) => evaluator.name === 'rule_generation_code_evaluator' - )!; + ) ?? (() => { throw new Error('rule_generation_code_evaluator not found'); })();Alternatively, since this is a test helper, the current approach is acceptable as the test would fail with a clear enough error on first use.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts` around lines 30 - 33, The helper getRuleGenerationCodeEvaluator uses a non-null assertion on the result of createRuleGenerationEvaluators(...).find(...), which will throw a less-descriptive runtime error if the evaluator name is changed or missing; replace the `!` with an explicit defensive check: locate the creation call to createRuleGenerationEvaluators(esClient) and the find that compares evaluator.name === 'rule_generation_code_evaluator' in getRuleGenerationCodeEvaluator, capture the result in a variable, and if it's undefined throw a clear Error (e.g., "rule_generation_code_evaluator not found in createRuleGenerationEvaluators") so tests fail with a descriptive message.x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts (1)
12-17: Consider adding explicit existence checks for evaluators.The non-null assertions (
!) on lines 12 and 15 could mask issues if evaluator names change. Consider adding explicit checks:Safer evaluator lookup
-const evidenceGroundingEvaluator = createKIExtractionEvaluators().find( - (evaluator) => evaluator.name === 'evidence_grounding' -); -const kiCountEvaluator = createKIExtractionEvaluators().find( - (evaluator) => evaluator.name === 'ki_count' -); +const evaluators = createKIExtractionEvaluators(); + +const evidenceGroundingEvaluator = evaluators.find( + (evaluator) => evaluator.name === 'evidence_grounding' +); +const kiCountEvaluator = evaluators.find( + (evaluator) => evaluator.name === 'ki_count' +); + +if (!evidenceGroundingEvaluator || !kiCountEvaluator) { + throw new Error('Required evaluators not found - check evaluator names'); +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts` around lines 12 - 17, The test currently uses non-null assertions for evidenceGroundingEvaluator and kiCountEvaluator returned from createKIExtractionEvaluators(), which can hide failures if evaluator names change; update the test to explicitly check that createKIExtractionEvaluators().find(...) returned a value for each evaluator (evidenceGroundingEvaluator and kiCountEvaluator) and fail early with a clear message (e.g., throw or expect) if not found, before using them in later assertions so the test errors point to the missing evaluator by name.x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts (1)
29-76: Code duplication with ki_extraction_evaluators.ts.The
SHORT_EVIDENCE_MAX_LENGTH,escapeRegExp, and the short-evidence word-boundary matching pattern are duplicated between this file andki_extraction_evaluators.ts(lines 150-169). Consider extracting to a shared utility.Suggested shared utility
// src/evaluators/evidence_matching.ts export const SHORT_EVIDENCE_MAX_LENGTH = 3; export function escapeRegExp(value: string): string { return value.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); } export function matchesShortEvidence(text: string, evidence: string): boolean { const normalized = evidence.trim(); if (normalized.length === 0) return false; if (normalized.length <= SHORT_EVIDENCE_MAX_LENGTH) { return new RegExp(`(^|[^a-zA-Z0-9_])${escapeRegExp(normalized)}($|[^a-zA-Z0-9_])`).test(text); } return text.includes(normalized); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts` around lines 29 - 76, Extract duplicated logic (SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and the short-evidence word-boundary matching) into a shared utility module (e.g., export SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and a matchesShortEvidence function) and replace the local implementations in rule_generation_evaluators.ts (functions/constants: SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, matchesEvidenceInLog) with imports from that utility; update matchesEvidenceInLog to delegate to the new matchesShortEvidence (or replace usages directly) and add appropriate imports so behavior remains identical.x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts (1)
23-40: Consider using typed mocks instead ofas anycasts.The test file uses multiple
as anycasts for mocks (lines 28, 40, and throughout). While this is common in test code, it could mask type mismatches. Consider creating partial mock types:Example typed mock approach
+type MockToolingLog = Pick<ToolingLog, 'info' | 'debug' | 'warning' | 'error'>; + const log = { info: jest.fn(), debug: jest.fn(), warning: jest.fn(), error: jest.fn(), -} as any; +} as MockToolingLog;As per coding guidelines: "Use TypeScript for all new code; avoid
anyandunknown."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts` around lines 23 - 40, Replace the untyped `as any` mocks with typed partial/mocked types to preserve compile-time checks: define `log` as a `jest.Mocked<YourLoggerType>` (or `Partial<YourLoggerType>` with jest.fn() for the methods) instead of `as any`, type `gcs` with its proper interface/type, and change `makeEsClient` to return `Partial<Client>` or `jest.Mocked<Client>` (use the real `Client` type from the ES client) so the `indices.delete` and `search` mocks are correctly typed; update all occurrences (references: `log`, `gcs`, `makeEsClient`, and the `Client` type) to use these typed mocks rather than `as any`.x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts (1)
10-10: KeepDatasetConfigindependent from evaluator code.
datasets/types.tsis the shared contract for datasets and specs, butValidKITypenow comes fromsrc/evaluators/ki_extraction_evaluators. That couples a public schema module to a concrete evaluator implementation. Moving the KI-type union into a neutral shared types module would keep the dependency direction cleaner and reduce the chance of cycles later.Also applies to: 41-50
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts` at line 10, DatasetConfig in datasets/types.ts currently depends on ValidKIType from src/evaluators/ki_extraction_evaluators which couples the shared dataset contract to a concrete evaluator; extract the KI-type union into a neutral shared types module (e.g., create/export a KiType or ValidKIType in a shared package used by both datasets and evaluators), update datasets/types.ts to import the KI union from that new shared module instead of src/evaluators/ki_extraction_evaluators, and update the evaluator implementation to import the same shared type so DatasetConfig (and any other shared interfaces) remain independent of evaluator code; ensure symbols involved are DatasetConfig and ValidKIType (or the new shared KiType) so references are updated consistently.x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts (1)
381-624: Only two rule scenarios opt into the newesql_substringscheck.The new evaluator tightening only applies this grounding check when a scenario defines
esql_substrings, but in this dataset that is currently limited topayment-unreachableandcart-redis-cutoff. Adding representative substrings for the remaining rule-generation scenarios would make the new behavior exercise the full matrix instead of just those two cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts` around lines 381 - 624, The ruleGeneration scenarios only include esql_substrings for payment-unreachable and cart-redis-cutoff, so add representative esql_substrings arrays to the other scenario objects to exercise the new evaluator check; update the ruleGeneration entries for scenario_id values healthy-baseline, currency-unreachable, checkout-memory-starvation, flagd-unreachable, and load-generator-ramp to include relevant substrings (e.g., healthy-baseline: ['throughput','latency','service'], currency-unreachable: ['currency','conversion','unreachable'], checkout-memory-starvation: ['OOMKilled','memory','GC pressure'], flagd-unreachable: ['flagd','feature-flag','evaluation'], load-generator-ramp: ['latency','error rate','throughput']) so the evaluator sees grounding text across all scenarios.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`:
- Around line 101-109: The test is sampling logs and computing KIs from the
whole snapshot because sampleLogs is still using match_all; use the
previously-resolved extractionScenario/scenario filter so the rule-generation
context is limited to the scenario's failure slice. Update the call that builds
the ES query for sampleLogs (and any match_all usage between sampleLogs and
canonicalKIsFromExpectedGroundTruth) to include a filter on
scenario.input.scenario_id (or extractionScenario.input.scenario_id) so
sampleLogs and the KIs are sourced only from that scenario's slice.
---
Outside diff comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.ts`:
- Around line 25-45: extractRequiredAppsFromCriteria currently treats every
token from dep-* ids as an app by splitting on '-' which misclassifies non-app
tokens like 'services' or 'upstream'; update extractRequiredAppsFromCriteria
(and the KIExtractionScenario output parsing) to stop splitting arbitrary dep
ids and instead read explicit required-app metadata if present (e.g., a
dedicated criteria field like { requiredApps: string[] }) or implement a
stricter parser that only accepts known app-name patterns (validate against
allowed app name regex or a whitelist) before adding to the apps Set; ensure you
preserve existing handling for entity-* ids and keep return type the same.
---
Nitpick comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.ts`:
- Around line 381-624: The ruleGeneration scenarios only include esql_substrings
for payment-unreachable and cart-redis-cutoff, so add representative
esql_substrings arrays to the other scenario objects to exercise the new
evaluator check; update the ruleGeneration entries for scenario_id values
healthy-baseline, currency-unreachable, checkout-memory-starvation,
flagd-unreachable, and load-generator-ramp to include relevant substrings (e.g.,
healthy-baseline: ['throughput','latency','service'], currency-unreachable:
['currency','conversion','unreachable'], checkout-memory-starvation:
['OOMKilled','memory','GC pressure'], flagd-unreachable:
['flagd','feature-flag','evaluation'], load-generator-ramp: ['latency','error
rate','throughput']) so the evaluator sees grounding text across all scenarios.
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.ts`:
- Line 10: DatasetConfig in datasets/types.ts currently depends on ValidKIType
from src/evaluators/ki_extraction_evaluators which couples the shared dataset
contract to a concrete evaluator; extract the KI-type union into a neutral
shared types module (e.g., create/export a KiType or ValidKIType in a shared
package used by both datasets and evaluators), update datasets/types.ts to
import the KI union from that new shared module instead of
src/evaluators/ki_extraction_evaluators, and update the evaluator implementation
to import the same shared type so DatasetConfig (and any other shared
interfaces) remain independent of evaluator code; ensure symbols involved are
DatasetConfig and ValidKIType (or the new shared KiType) so references are
updated consistently.
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts`:
- Around line 23-40: Replace the untyped `as any` mocks with typed
partial/mocked types to preserve compile-time checks: define `log` as a
`jest.Mocked<YourLoggerType>` (or `Partial<YourLoggerType>` with jest.fn() for
the methods) instead of `as any`, type `gcs` with its proper interface/type, and
change `makeEsClient` to return `Partial<Client>` or `jest.Mocked<Client>` (use
the real `Client` type from the ES client) so the `indices.delete` and `search`
mocks are correctly typed; update all occurrences (references: `log`, `gcs`,
`makeEsClient`, and the `Client` type) to use these typed mocks rather than `as
any`.
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.ts`:
- Around line 12-17: The test currently uses non-null assertions for
evidenceGroundingEvaluator and kiCountEvaluator returned from
createKIExtractionEvaluators(), which can hide failures if evaluator names
change; update the test to explicitly check that
createKIExtractionEvaluators().find(...) returned a value for each evaluator
(evidenceGroundingEvaluator and kiCountEvaluator) and fail early with a clear
message (e.g., throw or expect) if not found, before using them in later
assertions so the test errors point to the missing evaluator by name.
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.ts`:
- Around line 30-33: The helper getRuleGenerationCodeEvaluator uses a non-null
assertion on the result of createRuleGenerationEvaluators(...).find(...), which
will throw a less-descriptive runtime error if the evaluator name is changed or
missing; replace the `!` with an explicit defensive check: locate the creation
call to createRuleGenerationEvaluators(esClient) and the find that compares
evaluator.name === 'rule_generation_code_evaluator' in
getRuleGenerationCodeEvaluator, capture the result in a variable, and if it's
undefined throw a clear Error (e.g., "rule_generation_code_evaluator not found
in createRuleGenerationEvaluators") so tests fail with a descriptive message.
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`:
- Around line 29-76: Extract duplicated logic (SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, and the short-evidence word-boundary matching) into a shared
utility module (e.g., export SHORT_EVIDENCE_MAX_LENGTH, escapeRegExp, and a
matchesShortEvidence function) and replace the local implementations in
rule_generation_evaluators.ts (functions/constants: SHORT_EVIDENCE_MAX_LENGTH,
escapeRegExp, matchesEvidenceInLog) with imports from that utility; update
matchesEvidenceInLog to delegate to the new matchesShortEvidence (or replace
usages directly) and add appropriate imports so behavior remains identical.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 8bc2e091-2af7-48af-ac7a-bcdd69f4c71a
📒 Files selected for processing (30)
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.mdx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/index.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/otel_demo.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/datasets/types.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/feature_extraction/feature_extraction.spec.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_duplication/ki_duplication.spec.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_duplication/ki_duplication_datasets.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/collect_sample_documents.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/ki_extraction/ki_extraction.spec.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/query_generation/query_generation.spec.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/extract_log_text.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/get_computed_kis_from_docs.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/resolve_ki_sources.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/scripts/significant_events_snapshots/lib/gcs.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/scripts/significant_events_snapshots/lib/significant_events_workflow.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_features.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_kis.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_kis.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/replay.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/sigevents_kis_index.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/sigevents_kis_index.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_duplication_evaluators.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/query_generation_evaluators.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts
💤 Files with no reviewable changes (4)
- x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/canonical_features.test.ts
- x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/query_generation/query_generation.spec.ts
- x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/feature_extraction/feature_extraction.spec.ts
- x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/query_generation_evaluators.ts
...red/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts
Show resolved
Hide resolved
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.md
Outdated
Show resolved
Hide resolved
...latform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts
Outdated
Show resolved
Hide resolved
...latform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts
Outdated
Show resolved
Hide resolved
.../platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
🧹 Nitpick comments (2)
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts (1)
91-96: Consider adding debug logging for cleanup failures.The empty catch blocks silently swallow errors during cleanup. While this is acceptable for cleanup operations that may legitimately fail (e.g., non-existent streams), adding debug logging would help diagnose issues when cleanup unexpectedly fails.
♻️ Suggested improvement
for (const name of ['logs.otel', 'logs.ecs']) { - await esClient.indices.deleteDataStream({ name }).catch(() => {}); + await esClient.indices.deleteDataStream({ name }).catch((e) => { + log.debug(`Failed to delete data stream ${name}: ${e.message}`); + }); await esClient.indices .delete({ index: name, ignore_unavailable: true }) - .catch(() => {}); + .catch((e) => { + log.debug(`Failed to delete index ${name}: ${e.message}`); + }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts` around lines 91 - 96, The cleanup loop currently swallows errors silently in the empty catch blocks around esClient.indices.deleteDataStream and esClient.indices.delete; update those catches to log the caught error at debug (or trace) level instead of doing nothing: capture the error in each catch and call the test-suite logger (or console.debug if no logger is available) with a short context message referencing deleteDataStream/delete and the error object so unexpected failures during cleanup are visible without breaking tests.x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts (1)
96-112: Consider adding a timeout for ES|QL query validation to prevent test hangs.The
esClient.esql.querycall on line 102 lacks a timeout, which could cause test runs to hang on malformed or expensive queries. Per the Elasticsearch Node.js client API, you can add arequestTimeoutin the options parameter (2nd argument).♻️ Suggested improvement
try { - const result = await esClient.esql.query({ query: esql }); + const result = await esClient.esql.query({ query: esql }, { requestTimeout: 120000 }); isSyntaxValid = true;Tune the timeout value (120000ms shown above) based on expected query complexity and your CI environment—shorter timeouts may be suitable for simple queries, but complex ES|QL queries may need the longer window to execute.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts` around lines 96 - 112, The ES|QL validation call in the loop uses esClient.esql.query without a timeout which can hang tests; update the esClient.esql.query invocation (in the same block that sets isSyntaxValid/isExecutionHit and increments validSyntaxCount/executionHitCount) to pass a second options argument with a requestTimeout (e.g., 120000) so slow or stuck queries fail fast, and keep the existing try/catch to log the error as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.ts`:
- Around line 91-96: The cleanup loop currently swallows errors silently in the
empty catch blocks around esClient.indices.deleteDataStream and
esClient.indices.delete; update those catches to log the caught error at debug
(or trace) level instead of doing nothing: capture the error in each catch and
call the test-suite logger (or console.debug if no logger is available) with a
short context message referencing deleteDataStream/delete and the error object
so unexpected failures during cleanup are visible without breaking tests.
In
`@x-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts`:
- Around line 96-112: The ES|QL validation call in the loop uses
esClient.esql.query without a timeout which can hang tests; update the
esClient.esql.query invocation (in the same block that sets
isSyntaxValid/isExecutionHit and increments validSyntaxCount/executionHitCount)
to pass a second options argument with a requestTimeout (e.g., 120000) so slow
or stuck queries fail fast, and keep the existing try/catch to log the error as
before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 23b95ab7-bd08-493c-978d-e695529b062d
📒 Files selected for processing (7)
x-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/README.mdx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/resolve_ki_sources.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/evals/significant_events/rule_generation/rule_generation.spec.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/evidence_text_matching.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/ki_extraction_evaluators.tsx-pack/platform/packages/shared/kbn-evals-suite-streams/src/evaluators/rule_generation_evaluators.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- x-pack/platform/packages/shared/kbn-evals-suite-streams/src/data_generators/load_kis_from_snapshot.test.ts
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
cc @viduni94 |
| DatasetConfig, | ||
| QueryGenerationScenario, | ||
| FeatureExtractionScenario, | ||
| RuleGenerationScenario, |
There was a problem hiding this comment.
Is this really about rule generation? Isn't this KI generation of type query?
Proposed name:
- KIQueryGenerationScenario
| QueryGenerationScenario, | ||
| FeatureExtractionScenario, | ||
| RuleGenerationScenario, | ||
| KIExtractionScenario, |
There was a problem hiding this comment.
This is only the scenario for KIFeatureExtraction not all KI is my assumption?
Closes https://github.com/elastic/streams-program/issues/950
Why
The significant-events flow now uses Knowledge Indicators (KIs) and rules as its core concepts. This update aligns the evaluation suite with that terminology, reduces ambiguity around “features” and “queries”, and makes the deterministic checks stricter and more informative.
Summary
query_generationsuite torule_generationandfeature_extractiontoKI_extraction.feature extractionKI extraction evaluation by tightening evidence grounding so short evidence snippets do not pass via overly permissive substring matches, adding support for array-index field paths infield.path=valueevidence, and changing KI count scoring from binary pass/fail to proportional scoring so near misses are penalized more gradually.query generationrule generation evaluation by incorporating category and severity compliance into the CODE evaluator score, validating scenario-definedexpected_categoriesandesql_substrings, and making evidence checks less permissive so the evaluator better reflects whether generated rules are grounded in the provided log context.Checklist
release_note:*label is applied per the guidelinesbackport:*labels.Summary by CodeRabbit
Documentation
New Features
Tests
Chores