Skip to content

Commit 6378ff3

Browse files
[8.x] [Auto Import] CSV format support (#194386) (#196090)
# Backport This will backport the following commits from `main` to `8.x`: - [[Auto Import] CSV format support (#194386)](#194386) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Ilya Nikokoshev","email":"ilya.nikokoshev@elastic.co"},"sourceCommit":{"committedDate":"2024-10-14T10:24:58Z","message":"[Auto Import] CSV format support (#194386)\n\n## Release Notes\r\n\r\nAutomatic Import can now create integrations for logs in the CSV format.\r\nOwing to the maturity of log format support, we thus remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n## Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is #194342 \r\n\r\nWhen the user adds a log sample whose format is recognized as CSV by the\r\nLLM, we now parse the samples and insert the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor into the generated pipeline.\r\n\r\nIf the header is present, we use it for the field names and add a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor that removes a header from the document stream by comparing\r\nthe values to the header values.\r\n\r\nIf the header is missing, we ask the LLM to generate a list of column\r\nnames, providing some context like package and data stream title.\r\n\r\nShould the header or LLM suggestion provide unsuitable for a specific\r\ncolumn, we use `column1`, `column2` and so on as a fallback. To avoid\r\nduplicate column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the format appears to be CSV, but the `csv` processor returns fails,\r\nwe bubble up an error using the recently introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide the first\r\nexample of passing the additional attributes of an error (in this case,\r\nthe original CSV error) back to the client. The error message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported formats message**\r\n \r\nThe message that asks the user to upload the logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img width=\"741\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n \r\nThe refactoring makes the \"→JSON\" conversion process more uniform across\r\ndifferent chains and centralizes processor definitions in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now expects the LLM to follow the `SamplesFormat` when\r\nproviding the information rather than an ad-hoc format.\r\n \r\nWhen testing, the `fail` method is [not supported in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it is\r\nremoved.\r\n\r\nSee the PR for examples and follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["v9.0.0","release_note:feature","backport:prev-minor","Team:Security-Scalability","Feature:AutomaticImport"],"title":"[Auto Import] CSV format support","number":194386,"url":"https://github.com/elastic/kibana/pull/194386","mergeCommit":{"message":"[Auto Import] CSV format support (#194386)\n\n## Release Notes\r\n\r\nAutomatic Import can now create integrations for logs in the CSV format.\r\nOwing to the maturity of log format support, we thus remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n## Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is #194342 \r\n\r\nWhen the user adds a log sample whose format is recognized as CSV by the\r\nLLM, we now parse the samples and insert the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor into the generated pipeline.\r\n\r\nIf the header is present, we use it for the field names and add a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor that removes a header from the document stream by comparing\r\nthe values to the header values.\r\n\r\nIf the header is missing, we ask the LLM to generate a list of column\r\nnames, providing some context like package and data stream title.\r\n\r\nShould the header or LLM suggestion provide unsuitable for a specific\r\ncolumn, we use `column1`, `column2` and so on as a fallback. To avoid\r\nduplicate column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the format appears to be CSV, but the `csv` processor returns fails,\r\nwe bubble up an error using the recently introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide the first\r\nexample of passing the additional attributes of an error (in this case,\r\nthe original CSV error) back to the client. The error message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported formats message**\r\n \r\nThe message that asks the user to upload the logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img width=\"741\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n \r\nThe refactoring makes the \"→JSON\" conversion process more uniform across\r\ndifferent chains and centralizes processor definitions in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now expects the LLM to follow the `SamplesFormat` when\r\nproviding the information rather than an ad-hoc format.\r\n \r\nWhen testing, the `fail` method is [not supported in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it is\r\nremoved.\r\n\r\nSee the PR for examples and follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/194386","number":194386,"mergeCommit":{"message":"[Auto Import] CSV format support (#194386)\n\n## Release Notes\r\n\r\nAutomatic Import can now create integrations for logs in the CSV format.\r\nOwing to the maturity of log format support, we thus remove the verbiage\r\nabout requiring the JSON/NDJSON format.\r\n\r\n## Summary\r\n\r\n**Added: the CSV feature**\r\n\r\nThe issue is #194342 \r\n\r\nWhen the user adds a log sample whose format is recognized as CSV by the\r\nLLM, we now parse the samples and insert the\r\n[csv](https://www.elastic.co/guide/en/elasticsearch/reference/current/csv-processor.html)\r\nprocessor into the generated pipeline.\r\n\r\nIf the header is present, we use it for the field names and add a\r\n[drop](https://www.elastic.co/guide/en/elasticsearch/reference/current/drop-processor.html)\r\nprocessor that removes a header from the document stream by comparing\r\nthe values to the header values.\r\n\r\nIf the header is missing, we ask the LLM to generate a list of column\r\nnames, providing some context like package and data stream title.\r\n\r\nShould the header or LLM suggestion provide unsuitable for a specific\r\ncolumn, we use `column1`, `column2` and so on as a fallback. To avoid\r\nduplicate column names, we can add postfixes like `_2` as necessary.\r\n\r\nIf the format appears to be CSV, but the `csv` processor returns fails,\r\nwe bubble up an error using the recently introduced\r\n`ErrorThatHandlesItsOwnResponse` class. We also provide the first\r\nexample of passing the additional attributes of an error (in this case,\r\nthe original CSV error) back to the client. The error message is\r\ncomposed on the client side.\r\n\r\n**Removed: supported formats message**\r\n \r\nThe message that asks the user to upload the logs in `JSON/NDJSON\r\nformat` is removed in this PR:\r\n\r\n<img width=\"741\" alt=\"image\"\r\nsrc=\"https://github.com/user-attachments/assets/34d571c3-b12c-44a1-98e3-d7549160be12\">\r\n\r\n\r\n**Refactoring**\r\n \r\nThe refactoring makes the \"→JSON\" conversion process more uniform across\r\ndifferent chains and centralizes processor definitions in\r\n`.../server/util/processors.ts`.\r\n\r\nLog format chain now expects the LLM to follow the `SamplesFormat` when\r\nproviding the information rather than an ad-hoc format.\r\n \r\nWhen testing, the `fail` method is [not supported in\r\n`jest`](https://stackoverflow.com/a/54244479/23968144), so it is\r\nremoved.\r\n\r\nSee the PR for examples and follow-up.\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"6a72037007d8f71504f444911c9fa25adfb1bb89"}}]}] BACKPORT--> Co-authored-by: Ilya Nikokoshev <ilya.nikokoshev@elastic.co>
1 parent 7a80e6f commit 6378ff3

47 files changed

Lines changed: 853 additions & 132 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

x-pack/plugins/integration_assistant/__jest__/fixtures/log_type_detection.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ export const logFormatDetectionTestState = {
1414
exAnswer: 'testanswer',
1515
packageName: 'testPackage',
1616
dataStreamName: 'testDatastream',
17+
packageTitle: 'Test Title',
18+
dataStreamTitle: 'Test Datastream Title',
1719
finalized: false,
1820
samplesFormat: { name: SamplesFormatName.Values.structured },
1921
header: true,

x-pack/plugins/integration_assistant/common/api/analyze_logs/analyze_logs_route.gen.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ import { z } from '@kbn/zod';
1919
import {
2020
PackageName,
2121
DataStreamName,
22+
PackageTitle,
23+
DataStreamTitle,
2224
LogSamples,
2325
Connector,
2426
LangSmithOptions,
@@ -29,6 +31,8 @@ export type AnalyzeLogsRequestBody = z.infer<typeof AnalyzeLogsRequestBody>;
2931
export const AnalyzeLogsRequestBody = z.object({
3032
packageName: PackageName,
3133
dataStreamName: DataStreamName,
34+
packageTitle: PackageTitle,
35+
dataStreamTitle: DataStreamTitle,
3236
logSamples: LogSamples,
3337
connectorId: Connector,
3438
langSmithOptions: LangSmithOptions.optional(),

x-pack/plugins/integration_assistant/common/api/analyze_logs/analyze_logs_route.schema.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,17 @@ paths:
2222
- connectorId
2323
- packageName
2424
- dataStreamName
25+
- packageTitle
26+
- dataStreamTitle
2527
properties:
2628
packageName:
2729
$ref: "../model/common_attributes.schema.yaml#/components/schemas/PackageName"
2830
dataStreamName:
2931
$ref: "../model/common_attributes.schema.yaml#/components/schemas/DataStreamName"
32+
packageTitle:
33+
$ref: "../model/common_attributes.schema.yaml#/components/schemas/PackageTitle"
34+
dataStreamTitle:
35+
$ref: "../model/common_attributes.schema.yaml#/components/schemas/DataStreamTitle"
3036
logSamples:
3137
$ref: "../model/common_attributes.schema.yaml#/components/schemas/LogSamples"
3238
connectorId:
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
/*
2+
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
3+
* or more contributor license agreements. Licensed under the Elastic License
4+
* 2.0; you may not use this file except in compliance with the Elastic License
5+
* 2.0.
6+
*/
7+
8+
import type { GenerationErrorCode } from '../constants';
9+
10+
// Errors raised by the generation process should provide information through this interface.
11+
export interface GenerationErrorBody {
12+
message: string;
13+
attributes: GenerationErrorAttributes;
14+
}
15+
16+
export function isGenerationErrorBody(obj: unknown | undefined): obj is GenerationErrorBody {
17+
return (
18+
typeof obj === 'object' &&
19+
obj !== null &&
20+
'message' in obj &&
21+
typeof obj.message === 'string' &&
22+
'attributes' in obj &&
23+
obj.attributes !== undefined &&
24+
isGenerationErrorAttributes(obj.attributes)
25+
);
26+
}
27+
28+
export interface GenerationErrorAttributes {
29+
errorCode: GenerationErrorCode;
30+
underlyingMessages: string[] | undefined;
31+
}
32+
33+
export function isGenerationErrorAttributes(obj: unknown): obj is GenerationErrorAttributes {
34+
return (
35+
typeof obj === 'object' &&
36+
obj !== null &&
37+
'errorCode' in obj &&
38+
typeof obj.errorCode === 'string' &&
39+
(!('underlyingMessages' in obj) || Array.isArray(obj.underlyingMessages))
40+
);
41+
}

x-pack/plugins/integration_assistant/common/api/model/api_test.mock.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ export const getRelatedRequestMock = (): RelatedRequestBody => ({
9696
export const getAnalyzeLogsRequestBody = (): AnalyzeLogsRequestBody => ({
9797
dataStreamName: 'test-data-stream-name',
9898
packageName: 'test-package-name',
99+
packageTitle: 'Test package title',
100+
dataStreamTitle: 'Test data stream title',
99101
connectorId: 'test-connector-id',
100102
logSamples: rawSamples,
101103
});

x-pack/plugins/integration_assistant/common/api/model/common_attributes.gen.ts

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,18 @@ export const PackageName = z.string().min(1);
3131
export type DataStreamName = z.infer<typeof DataStreamName>;
3232
export const DataStreamName = z.string().min(1);
3333

34+
/**
35+
* Package title for the integration to be built.
36+
*/
37+
export type PackageTitle = z.infer<typeof PackageTitle>;
38+
export const PackageTitle = z.string().min(1);
39+
40+
/**
41+
* DataStream title for the integration to be built.
42+
*/
43+
export type DataStreamTitle = z.infer<typeof DataStreamTitle>;
44+
export const DataStreamTitle = z.string().min(1);
45+
3446
/**
3547
* String form of the input logsamples.
3648
*/
@@ -86,6 +98,14 @@ export const SamplesFormat = z.object({
8698
* For some formats, specifies whether the samples can be multiline.
8799
*/
88100
multiline: z.boolean().optional(),
101+
/**
102+
* For CSV format, specifies whether the samples have a header row. For other formats, specifies the presence of header in each row.
103+
*/
104+
header: z.boolean().optional(),
105+
/**
106+
* For CSV format, specifies the column names proposed by the LLM.
107+
*/
108+
columns: z.array(z.string()).optional(),
89109
/**
90110
* For a JSON format, describes how to get to the sample array from the root of the JSON.
91111
*/

x-pack/plugins/integration_assistant/common/api/model/common_attributes.schema.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,16 @@ components:
1616
minLength: 1
1717
description: DataStream name for the integration to be built.
1818

19+
PackageTitle:
20+
type: string
21+
minLength: 1
22+
description: Package title for the integration to be built.
23+
24+
DataStreamTitle:
25+
type: string
26+
minLength: 1
27+
description: DataStream title for the integration to be built.
28+
1929
LogSamples:
2030
type: array
2131
items:
@@ -66,6 +76,14 @@ components:
6676
multiline:
6777
type: boolean
6878
description: For some formats, specifies whether the samples can be multiline.
79+
header:
80+
type: boolean
81+
description: For CSV format, specifies whether the samples have a header row. For other formats, specifies the presence of header in each row.
82+
columns:
83+
type: array
84+
description: For CSV format, specifies the column names proposed by the LLM.
85+
items:
86+
type: string
6987
json_path:
7088
type: array
7189
description: For a JSON format, describes how to get to the sample array from the root of the JSON.

x-pack/plugins/integration_assistant/common/constants.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,9 @@ export const MINIMUM_LICENSE_TYPE: LicenseType = 'enterprise';
3030

3131
// ErrorCodes
3232

33-
export enum ErrorCode {
33+
export enum GenerationErrorCode {
3434
RECURSION_LIMIT = 'recursion-limit',
3535
RECURSION_LIMIT_ANALYZE_LOGS = 'recursion-limit-analyze-logs',
3636
UNSUPPORTED_LOG_SAMPLES_FORMAT = 'unsupported-log-samples-format',
37+
UNPARSEABLE_CSV_DATA = 'unparseable-csv-data',
3738
}

x-pack/plugins/integration_assistant/common/index.ts

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,9 @@ export type {
2727
Integration,
2828
Pipeline,
2929
Docs,
30-
SamplesFormat,
3130
LangSmithOptions,
3231
} from './api/model/common_attributes.gen';
33-
export { SamplesFormatName } from './api/model/common_attributes.gen';
32+
export { SamplesFormat, SamplesFormatName } from './api/model/common_attributes.gen';
3433
export type { ESProcessorItem } from './api/model/processor_attributes.gen';
3534
export type { CelInput } from './api/model/cel_input_attributes.gen';
3635

x-pack/plugins/integration_assistant/public/components/create_integration/create_integration_assistant/steps/data_stream_step/generation_modal.test.tsx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,8 @@ describe('GenerationModal', () => {
105105
it('should call runAnalyzeLogsGraph with correct parameters', () => {
106106
expect(mockRunAnalyzeLogsGraph).toHaveBeenCalledWith({
107107
...defaultRequest,
108+
packageTitle: 'Mocked Integration title',
109+
dataStreamTitle: 'Mocked Data Stream Title',
108110
logSamples: integrationSettingsNonJSON.logSamples ?? [],
109111
});
110112
});

0 commit comments

Comments
 (0)