Add support for Kibana knowledge base entry assets#807
Add support for Kibana knowledge base entry assets#807jsoriano merged 13 commits intoelastic:mainfrom
Conversation
pgayvallet
left a comment
There was a problem hiding this comment.
self-review explaining my decisions and the reasoning
spec/content/kibana/spec.yml
Outdated
| - description: Folder containing Kibana knowledge base entries | ||
| type: folder | ||
| name: kb_entry | ||
| required: false | ||
| $ref: "../../integration/kibana/kb_entry/spec.yml" |
There was a problem hiding this comment.
So, I decided to make knowledge base entries (first-class) citizens of the kibana spec instead of having them live in their own, dedicated root folder.
The reasoning is that:
- KB entries will be "managed" by Kibana
- KB entries will only be used via Kibana (at least for now)
So I felt like it made more sense to have it done that way.
There was a problem hiding this comment.
I was thinking that this was just data, and it could maybe be reused for other features, but your reasoning makes sense. If it is managed and used only by Kibana, let's keep it under kibana.
at least for now
Do you foresee other uses of this data?
There was a problem hiding this comment.
I was thinking that this was just data, and it could maybe be reused for other features
Yeah I was initially thinking that too, but the KB entry installation will do more than just creating the index and indexing the docs - it will also create an entry in the knowledge base registry to register the source, so I felt like it would be potentially complicated to convert that to some abstract generic concept...
Do you foresee other uses of this data?
We might want at some point to expose this data more directly to the end user, and make is less of an implementation detail. But we don't have any use case for this right now, so I think we should just ignore it for now and eventually revisit later.
| - description: A knowledge base entry's index mapping | ||
| type: file | ||
| contentMediaType: "application/json" | ||
| sizeLimit: 5MB | ||
| name: "index-mapping.json" | ||
| required: true |
There was a problem hiding this comment.
I decided to KISS and keep the index's mapping as a plain json file, first because I don't think adding validation on the index's mapping format or having it as a yaml file would bring much value (AFAIK we're not doing it for index templates, mappings are just defined as object type with allowed extra props), and I also did not use the fields.yml feature because we don't need, or even want, any kind of discoverability on this type of content.
Please tell me if you think this doesn't make sense.
There was a problem hiding this comment.
I understand that having to convert to the fields.yml format can be some additional effort, but it has some advantages.
For example, Fleet already has the code to generate mappings from this format, and over that it can more easily add opinionated settings or mappings, for example settings that could be defined by users, or fields known to be everywhere.
Also, elastic-package has code to perform validation of documents based on the mappings defined in the package. This could be leveraged for example to check that the provided content comply with the defined mappings.
As you mention, in principle these features are not wanted at least by now. But just in case and if it is only for consistency with other features, I would prefer to use the fields.yml format.
There was a problem hiding this comment.
So another part of the reasoning was that each knowledge base entry can, technically, have different mappings. I'm not sure how that would play with that fields.yml feature, given that from my understanding, this is only a top level section in the package?
Or should we use the same format but with a different pattern, like having one fields.yml file per knowledge base entry folder (that would then replace the index-mappings.json file)?
There was a problem hiding this comment.
We reuse the fields definitions on different places. The more direct way to add it here would be to have a fields directory here, something like this:
- description: Folder containing field definitions to be used as the mappings for the index template
type: folder
name: fields
required: false
$ref: "../../data_stream/fields/spec.yml"
| - description: A knowledge base entry's content file | ||
| type: file | ||
| contentMediaType: "application/json" | ||
| pattern: '^content(\.[0-9]+)?\.json$' |
There was a problem hiding this comment.
As suggested by @jsoriano, we're allowing an arbitrary number of file following the fixed content.json and
dynamic content.{num}.json naming format. Each file have a single documents property containing the source of the documents we will be indexing into the KB's index.
I did not add any validation on the file's content format, both because I wasn't sure we can add validation on JSON documents, and also because we wouldn't have much to validation (only thing we could do it make sure the only top level property present is documents and that it's an array).
But if there is a way to perform validation on JSON docs with the spec validator, I can absolutely add it.
There was a problem hiding this comment.
As suggested by @jsoriano, we're allowing an arbitrary number of file following the fixed content.json and
dynamic content.{num}.json naming format. Each file have a single documents property containing the source of the documents we will be indexing into the KB's index.
👍
In other cases where we allow multiple files we allow to use letters too, so it can give meaningful names when this is used for organizational purposes, and we use to separate with hyphens.
| - description: A knowledge base entry's content file | |
| type: file | |
| contentMediaType: "application/json" | |
| pattern: '^content(\.[0-9]+)?\.json$' | |
| - description: A knowledge base entry's content file | |
| type: file | |
| contentMediaType: "application/json" | |
| pattern: '^content(-[a-z0-9]+)?\.json$' |
I did not add any validation on the file's content format, both because I wasn't sure we can add validation on JSON documents, and also because we wouldn't have much to validation (only thing we could do it make sure the only top level property present is documents and that it's an array).
But if there is a way to perform validation on JSON docs with the spec validator, I can absolutely add it.
In principle we can add validation to JSON documents if we have a schema, but not sure if it is worth for this case.
The spec checks that the JSON is valid for files with contentMediaType: "application/json".
There was a problem hiding this comment.
In other cases where we allow multiple files we allow to use letters too, so it can give meaningful names when this is used for organizational purposes, and we use to separate with hyphens.
Sounds good, I will follow the pattern for consistency. I'll just wait for the other question about the format (json vs ndjson) to be addressed to perform both changes at the same time if required.
| name: foo | ||
| description: > | ||
| Knowledge base entry containing all the foo-related documentation | ||
| index: | ||
| name: .kibana_kb_foo | ||
| system: true | ||
| retrieval: | ||
| syntactic_fields: | ||
| - content_title | ||
| semantic_fields: | ||
| - content_body | ||
| - ai_subtitle | ||
| - ai_summary | ||
| - ai_questions_answered |
There was a problem hiding this comment.
Example file of the format specified in spec/integration/kibana/kb_entry/manifest.spec.yml.
Most of this manifest is meta that will be used by the KB implementation:
-
nameanddescriptionare meta that will be surfaced in the KB management page.namewill be used at the identifier. -
retrieval.syntactic_fieldsandretrieval.semantic_fieldsare the list of fields that will be used for retrieving documents from this KB entry. Later we could find way to be smarter and more automated about field detection, but I wanted to start with something both easy and concrete, so the KB entry contains those meta. -
index.nameis self explanatory - the name of the index that will be created with the specified mappings from theindex-mapping.jsonfile -
index.systemis just some future-proofing by exposing the info about the index being Kibana-managed or not (probably don't need more details on this specific PR)
There was a problem hiding this comment.
- retrieval.syntactic_fields and retrieval.semantic_fields are the list of fields that will be used for retrieving documents from this KB entry. Later we could find way to be smarter and more automated about field detection, but I wanted to start with something both easy and concrete, so the KB entry contains those meta.
How are these going to be different between different knowledge bases?
index.nameis self explanatory - the name of the index that will be created with the specified mappings from theindex-mapping.jsonfile
The index name, or anything that could conflict with other resources, should be managed by Fleet. It could be built from the package and entry names.
There was a problem hiding this comment.
How are these going to be different between different knowledge bases?
By having those defined in the manifest, and then registered during installation. The search/retrieval code will then forge its request to target the specified fields.
The index name, or anything that could conflict with other resources, should be managed by Fleet
I guess that makes sense
jsoriano
left a comment
There was a problem hiding this comment.
Thanks for the PR, it is helpful to see the purpose of the new feature.
Overall it looks good to me, I have added some small suggestions.
My main question at the moment would be about how/if different knowledge bases differ on their mappings. If we expect all the knowledge bases to have the same mappings, I think that the mappings should be provided by Fleet, and not by the packages. Having them in the packages can be error prone. If this mapping is only expected for knowledge bases containing documentation, I think that we could still add some kind of "type" field, that tells Fleet what mappings to install. And we could also have a mix, where Fleet provides a base set of common mappings, and additional custom mappings can still be provided by packages.
spec/content/kibana/spec.yml
Outdated
| - description: Folder containing Kibana knowledge base entries | ||
| type: folder | ||
| name: kb_entry | ||
| required: false | ||
| $ref: "../../integration/kibana/kb_entry/spec.yml" |
There was a problem hiding this comment.
I was thinking that this was just data, and it could maybe be reused for other features, but your reasoning makes sense. If it is managed and used only by Kibana, let's keep it under kibana.
at least for now
Do you foresee other uses of this data?
spec/content/kibana/spec.yml
Outdated
| - '^.+-(ecs|ECS)\.json$' # ECS suffix is forbidden | ||
| - description: Folder containing Kibana knowledge base entries | ||
| type: folder | ||
| name: kb_entry |
There was a problem hiding this comment.
No need to abbreviate, kb may be actually confusing, it could be "kibana" around this context. Would it make sense to call this directly "knowledge_base"?
| name: kb_entry | |
| name: knowledge_base |
Or I would also prefer "knowledge_base_entry" rather than its abbreviated form.
| name: kb_entry | |
| name: knowledge_base_entry |
There was a problem hiding this comment.
That's totally fair. knowledge_base_entry sounds good, I will adapt accordingly
| - description: A knowledge base entry's index mapping | ||
| type: file | ||
| contentMediaType: "application/json" | ||
| sizeLimit: 5MB | ||
| name: "index-mapping.json" | ||
| required: true |
There was a problem hiding this comment.
I understand that having to convert to the fields.yml format can be some additional effort, but it has some advantages.
For example, Fleet already has the code to generate mappings from this format, and over that it can more easily add opinionated settings or mappings, for example settings that could be defined by users, or fields known to be everywhere.
Also, elastic-package has code to perform validation of documents based on the mappings defined in the package. This could be leveraged for example to check that the provided content comply with the defined mappings.
As you mention, in principle these features are not wanted at least by now. But just in case and if it is only for consistency with other features, I would prefer to use the fields.yml format.
| - description: A knowledge base entry's content file | ||
| type: file | ||
| contentMediaType: "application/json" | ||
| pattern: '^content(\.[0-9]+)?\.json$' |
There was a problem hiding this comment.
As suggested by @jsoriano, we're allowing an arbitrary number of file following the fixed content.json and
dynamic content.{num}.json naming format. Each file have a single documents property containing the source of the documents we will be indexing into the KB's index.
👍
In other cases where we allow multiple files we allow to use letters too, so it can give meaningful names when this is used for organizational purposes, and we use to separate with hyphens.
| - description: A knowledge base entry's content file | |
| type: file | |
| contentMediaType: "application/json" | |
| pattern: '^content(\.[0-9]+)?\.json$' | |
| - description: A knowledge base entry's content file | |
| type: file | |
| contentMediaType: "application/json" | |
| pattern: '^content(-[a-z0-9]+)?\.json$' |
I did not add any validation on the file's content format, both because I wasn't sure we can add validation on JSON documents, and also because we wouldn't have much to validation (only thing we could do it make sure the only top level property present is documents and that it's an array).
But if there is a way to perform validation on JSON docs with the spec validator, I can absolutely add it.
In principle we can add validation to JSON documents if we have a schema, but not sure if it is worth for this case.
The spec checks that the JSON is valid for files with contentMediaType: "application/json".
| { | ||
| "documents": [ |
There was a problem hiding this comment.
I wonder if this file could be just a ndjson file.
Is the documents entry needed? Do we expect other entries in this file?
How are these files generated? Do we expect manual editions on these files?
There was a problem hiding this comment.
I'm being (probably overly) cautious with NDJSON, as we've learned the hard way with Kibana import/export that making the format evolve can be a pain (very difficult to add global metadata in addition to the entries). But for Kibana import/export, we're only generating a single file, we don't have a full folder structure like we do there, so the limitations of the format were more impactful.
So no, I don't foresee us needing to add anything to those files, so if you prefer using ndjson, I'm fine with it
There was a problem hiding this comment.
Not a strong opinion, but I think I'd prefer using ndjson here, yes. If there is some case were we need to add metadata in the future we can use the manifest file, or some other file.
| { | ||
| "dynamic": "strict", |
There was a problem hiding this comment.
Is this setting going to be the same in all knowledge bases?
There was a problem hiding this comment.
Not necessarily, I mean given the mappings are defined per KB entry, I assume we would be fine letting this be the source of truth for the dynamicallocation too (given it's technically a part of the mapping, not a setting).
But having this "hardcoded" and non-configurable by the package would be fine too ihmo, if we want to switch to the fields.yml approach
There was a problem hiding this comment.
But having this "hardcoded" and non-configurable by the package would be fine too ihmo, if we want to switch to the
fields.ymlapproach
Yes, if we use the fields.yml approach this could be a setting managed by Fleet, hardcoded by now to strict. If needed for some future case we could add a setting to the knowledge base manifest to control the value.
test/packages/knowledge_base/kibana/kb_entry/foo/index-mapping.json
Outdated
Show resolved
Hide resolved
test/packages/knowledge_base/kibana/kb_entry/foo/index-mapping.json
Outdated
Show resolved
Hide resolved
| name: foo | ||
| description: > | ||
| Knowledge base entry containing all the foo-related documentation | ||
| index: | ||
| name: .kibana_kb_foo | ||
| system: true | ||
| retrieval: | ||
| syntactic_fields: | ||
| - content_title | ||
| semantic_fields: | ||
| - content_body | ||
| - ai_subtitle | ||
| - ai_summary | ||
| - ai_questions_answered |
There was a problem hiding this comment.
- retrieval.syntactic_fields and retrieval.semantic_fields are the list of fields that will be used for retrieving documents from this KB entry. Later we could find way to be smarter and more automated about field detection, but I wanted to start with something both easy and concrete, so the KB entry contains those meta.
How are these going to be different between different knowledge bases?
index.nameis self explanatory - the name of the index that will be created with the specified mappings from theindex-mapping.jsonfile
The index name, or anything that could conflict with other resources, should be managed by Fleet. It could be built from the package and entry names.
jsoriano
left a comment
There was a problem hiding this comment.
Looks good, the main change I would like to see in current code is limiting the changes to content packages and don't do changes to integration packages.
spec/integration/kibana/spec.yml
Outdated
| - description: Folder containing Kibana knowledge base entries | ||
| type: folder | ||
| name: knowledge_base_entry | ||
| required: false | ||
| $ref: "./knowledge_base_entry/spec.yml" |
There was a problem hiding this comment.
This should be added only to content packages by now.
| name: | ||
| description: The name of the knowledge base entry | ||
| type: string |
There was a problem hiding this comment.
Is name needed? Fleet could use the package and knowledge base directory names to build a unique one.
There was a problem hiding this comment.
I was planning on autogenerating an "id" based on the package name and KB directory, but the manifest's name is more of a meta data field for a user-readable "name", which will be displayed in the KB registry's UI. Does that make sense?
There was a problem hiding this comment.
Ah ok, makes sense. Maybe then we could call it label or title? But as you prefer, name in that context sounds also good.
There was a problem hiding this comment.
I think you're right, title is probably more explicit and avoid mixing the concepts. Will update
| @@ -0,0 +1,33 @@ | |||
| spec: | |||
There was a problem hiding this comment.
Please move this file under spec/content by now.
| @@ -0,0 +1,11 @@ | |||
| title: foo | |||
There was a problem hiding this comment.
Thinking about the issue with the assumptions in the kibana directory, and if we have to move the knowledge bases out of it, maybe we can revisit the idea of making this part of a more generic feature for data.
It could be done by:
- Moving knowledge base from
./kibana/knowledge_base_entry/footo./index_data/foo. - Adding a
typefield in the manifest indicating what this data is about.
| title: foo | |
| title: foo | |
| type: knowledge_base |
Thinking on other features that might provide data in the future, we would probably follow the same structure in all cases: the actual content, the field mappings, and some manifest with metadata or additional options. The type would instruct Fleet about different steps done during installation for each kind of data.
| # Conditional properties. | ||
| title: true | ||
| index: true | ||
| retrieval: true | ||
| allOf: | ||
| - if: | ||
| properties: | ||
| type: | ||
| const: knowledge_base_entry | ||
| required: | ||
| - type | ||
| then: | ||
| properties: |
There was a problem hiding this comment.
This is the part I'm the less sure about. Are we fine with all subtype's fields "leaking" at top level, or would we rather have dedicated groups per subtype, e.g.
knowledge_base_entry:
type: object
properties:
[...]
There was a problem hiding this comment.
We can revisit when we add a second type, it LGTM by now.
💚 Build Succeeded
History
|
jsoriano
left a comment
There was a problem hiding this comment.
Thanks!
LGTM, pinging Mario in case he sees any last minute problem with this approach.
| # Conditional properties. | ||
| title: true | ||
| index: true | ||
| retrieval: true | ||
| allOf: | ||
| - if: | ||
| properties: | ||
| type: | ||
| const: knowledge_base_entry | ||
| required: | ||
| - type | ||
| then: | ||
| properties: |
There was a problem hiding this comment.
We can revisit when we add a second type, it LGTM by now.
| type: folder | ||
| name: index_data | ||
| required: false | ||
| $ref: "./index_data/spec.yml" |
There was a problem hiding this comment.
@mrodm we are adding a new "index_data" directory at the root level, for collections of documents that should be installed on package installation. By now the only kind of collection we support would be "knowledge bases" for AI assistants. Other collections we could add in the future could be sample data for example.
Do you foresee any problem with this approach or this new directory at the root level?
There was a problem hiding this comment.
As there is no change in the package manifest, I'd say it would not be needed to add any new field into the index for the Elastic Package Registry. If that is right, it would be ok as it is.
Tested also to add another dummy type under index_data folder. Looks that is possible (see other comment).
There was a problem hiding this comment.
Yeah, we would only need a change in the registry if this were going to be used for discovery of packages. Let's see how the support in Kibana progresses.
mrodm
left a comment
There was a problem hiding this comment.
LGTM!
As a note related to this, the current issue in Kibana elastic/kibana#192484 will not considerate this new folder or subfolders (i.e. fields) when addign support for content packages @jsoriano @pgayvallet
| @@ -0,0 +1,69 @@ | |||
| ## | |||
There was a problem hiding this comment.
Tested adding another type of index_data:
##
## Describes the specification for a Kibana knowledge base entry's manifest.yml file
##
spec:
type: object
additionalProperties: false
properties:
type:
type: string
description: >
Type of the index data.
enum:
- knowledge_base_entry
- foo
description:
description: A description of what the index data provides
type: string
# Conditional properties.
title: true
index: true
retrieval: true
some: true
allOf:
- if:
properties:
type:
const: knowledge_base_entry
required:
- type
then:
properties:
title:
description: The title of the knowledge base entry as used by the knowledge base
type: string
index:
type: object
additionalProperties: false
properties:
system:
description: Specify whether the index should be system-managed or not
type: boolean
required:
- system
retrieval:
type: object
additionalProperties: false
properties:
syntactic_fields:
description: List of fields that should be used for syntactic search during retrieval.
type: array
items:
type: string
semantic_fields:
description: List of fields that should be used for semantic search during retrieval.
type: array
items:
type: string
required:
- syntactic_fields
- semantic_fields
required:
- index
- retrieval
else:
not:
required:
- some
- index
- retrieval
- if:
properties:
type:
const: foo
required:
- type
then:
properties:
title:
description: The title of the knowledge base entry as used by the knowledge base
type: string
some:
description: Some variable
type: string
required:
- title
- some
else:
not:
required:
- some
- index
- retrieval
required:
- typeWhat I checked is that it requires:
- to define as true all the properties at the top where the "Conditional properties" comment is.
- to define as not required all those conditional properties in the
elseclause.
| type: folder | ||
| name: index_data | ||
| required: false | ||
| $ref: "./index_data/spec.yml" |
There was a problem hiding this comment.
As there is no change in the package manifest, I'd say it would not be needed to add any new field into the index for the Elastic Package Registry. If that is right, it would be ok as it is.
Tested also to add another dummy type under index_data folder. Looks that is possible (see other comment).
No problem, Pierre already created an issue and started a PR for this: |
…)" This reverts commit 96abc3c.
What does this PR do?
Add support for Kibana knowledge base entries, as discussed in #693
Knowledge base entries are a new Kibana type of content, living under
kibana/knowledge_base_entry/*.A KB entry is composed of 3 parts:
manifest.ymlfilefieldsfolder, containing one or more fields filecontentfolder, containing one or more content file, containing the ES documents of the entry in an NDJSON format.Please refer to the changes and the self-review for more details.
Why is it important?
As it is required for elastic/kibana#192031 via elastic/kibana#193849
Checklist
test/packagesthat prove my change is effective.spec/changelog.yml.Related issues