Skip to content

Testing: Add config option to accept numeric keywords#193

Merged
adriansr merged 10 commits intoelastic:masterfrom
adriansr:accept_numeric_keyword
Dec 11, 2020
Merged

Testing: Add config option to accept numeric keywords#193
adriansr merged 10 commits intoelastic:masterfrom
adriansr:accept_numeric_keyword

Conversation

@adriansr
Copy link
Copy Markdown
Contributor

@adriansr adriansr commented Nov 30, 2020

Keyword fields are not necessarily ingested as JSON strings. It's common to ingest numbers (integer or floating point) as keyword type.

This updates the fields validator to accept numeric values. They are converted to string so that any defined patterns can be checked.

This adds a new configuration option, numeric_keyword_fields, to pipeline test cases and system tests so that selected fields can have numeric type and be ingested into a text-like field.

Example configuration:

{
    "dynamic_fields": {
        "event.ingested": ".*"
    },
    "numeric_keyword_fields": [
        "network.iana_number",
        "event.code",
        "syslog.facility"
    ]
}

@elasticmachine
Copy link
Copy Markdown
Collaborator

elasticmachine commented Nov 30, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #193 updated

  • Start Time: 2020-12-11T09:02:20.334+0000

  • Duration: 9 min 29 sec

Test stats 🧪

Test Results
Failed 0
Passed 9
Skipped 0
Total 9

@mtojek mtojek requested a review from ycombinator November 30, 2020 18:29
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice enhancement for even stricter validation! Thanks and LGTM.

@mtojek mtojek self-requested a review December 1, 2020 07:58
Copy link
Copy Markdown
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer seeing JSON strings in the _source for fields mapped as keyword types, but since it's completely safe to map numbers to keywords (the inverse not being true) I'm ok loosening the restrictions to make developing packages a little simpler.

@adriansr adriansr changed the title Testing: Accept numeric keywords Testing: Add config option to accept numeric keywords Dec 4, 2020
@adriansr
Copy link
Copy Markdown
Contributor Author

adriansr commented Dec 4, 2020

In the end I think it makes more sense for this to be a configuration option to be set on a per-test-case basis, so I've modified the pipeline test runner to accept this new option.

@adriansr
Copy link
Copy Markdown
Contributor Author

adriansr commented Dec 4, 2020

Let me explain the rationale behind these changes.

Across the Beats code base, there are some instances of fields that are defined as keyword type and they would be ingested as a numeric field the document's JSON. According to ES docs: Not all numeric data should be mapped as a numeric field data type. Elasticsearch optimizes numeric fields, such as integer or long, for range queries. However, keyword fields are better for term and other term-level queries.

Examples of such fields are:

  • network.iana_number
  • vlan.id
  • dns.id
  • observer.*.interface.id
  • x509.version_number
  • event.code: In the case of some external event sources
  • {file,user}.{uid,gid}: for some Beats datasets, under a UNIX-like OS
  • syslog.{facility,severity}: For Filebeat's syslog input (these need conversion to ECS, which uses long).
  • elasticsearch.slowlog.{total_hits,total_shards}: These two really look like they should be a long.

There's probably many others, but it's difficult to spot them. I worry especially about cases where we are receiving an arbitrary set of keys (ex. via JSON) from an external source and storing those verbatim in the event with keyword type.

I see three possibilities for handling this:

  • Do not accept numeric keywords. This will force us to modify the source of events (module/Beat) so that those numbers are converted to string before being published. This is not too complicated when the processing is performed in an ingest pipeline, but it will be more time consuming for fields generated outside a pipeline.
  • Always accept numeric keywords. The original idea of this PR, after your feedback I realised it could be counter-intuitive.
  • Make it an opt-in feature. The new version of this PR. By default validation will be strict but we it can be disabled on a per-field per-test-case basis.

@mtojek
Copy link
Copy Markdown
Contributor

mtojek commented Dec 7, 2020

Thank you for the detailed explanation and possible options. I would for the first or third option:

Do not accept numeric keywords.
Make it an opt-in feature.

Actually I like the third one when the validation is configurable (strictness level). Keep in mind that this option requires updating the package-spec: https://github.com/elastic/package-spec/

@adriansr
Copy link
Copy Markdown
Contributor Author

@mtojek can you have another look?

Copy link
Copy Markdown
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a short one about the clean code

@adriansr adriansr requested a review from mtojek December 11, 2020 09:17
Copy link
Copy Markdown
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
nit: it would be nice to describe this feature in the howto docs (not blocking this PR)

@adriansr adriansr merged commit 88f40b5 into elastic:master Dec 11, 2020
andrewkroh added a commit to elastic/integrations that referenced this pull request Dec 15, 2020
* Add system test for Office 365 audiit

This changes mappings for a few fields to boolean. And it changes client.port and source.port to be numbers in the
JSON source to match their mappings. The remaining issues will be handled by elastic/elastic-package#193.
This is the test output after fixing the booleans.

    FAILURE DETAILS:

    o365/audit :
    [0] parsing field value failed: field "client.port"''s Go type, string, does not match the expected field type: long
    [1] parsing field value failed: field "event.code"''s Go type, float64, does not match the expected field type: keyword
    [2] parsing field value failed: field "o365.audit.ActorYammerUserId"''s Go type, float64, does not match the expected field type: keyword
    [3] parsing field value failed: field "o365.audit.AzureActiveDirectoryEventType"''s Go type, float64, does not match the expected field type: keyword
    [4] parsing field value failed: field "o365.audit.InternalLogonType"''s Go type, float64, does not match the expected field type: keyword
    [5] parsing field value failed: field "o365.audit.LogonType"''s Go type, float64, does not match the expected field type: keyword
    [6] parsing field value failed: field "o365.audit.RecordType"''s Go type, float64, does not match the expected field type: keyword
    [7] parsing field value failed: field "o365.audit.UserType"''s Go type, float64, does not match the expected field type: keyword
    [8] parsing field value failed: field "o365.audit.Version"''s Go type, float64, does not match the expected field type: keyword
    [9] parsing field value failed: field "o365.audit.YammerNetworkId"''s Go type, float64, does not match the expected field type: keyword
    [10] parsing field value failed: field "source.port"''s Go type, string, does not match the expected field type: long
    --- Test results for package: o365 - END   ---

* Configure numeric_keyword_fields

* Sync pipeline JS to get elastic/beats#22939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants