Skip to content

feat(clp-s): Add support for generic ingestion-unit-level range index inside of archives #737

@gibber9809

Description

@gibber9809

Request

When ingesting files into clp-s there can be several properties related to the file as a whole that can potentially be relevant to both decompression and search. These properties include the name of the file, but can also include other things known at ingestion time such as the service that created the log file and the host on which the logs were generated. Being able to accurately track these properties would be useful for things like decompressing or filtering on names of specific log files and implementing more powerful top level filtering. Our most immediate need is to gain the ability to track the relationship between files and data, but we may as well solve that problem in a way that allows us to build out other powerful features.

Currently we have an archive-level tagging system which doesn't quite meet these goals. The main issue is that these tags only track properties at the archive granularity, but archives can contain data from several different log files and log files may even be split across archives. As well, these tags are currently "unstructured" -- they're simply a list of strings associated with an archive (as opposed to a structured set of key value pairs).

Generally we'd like to introduce a more powerful mechanism that allows us to associate named properties with the units of data that we ingest so that they can be leveraged for decompression and search.

Possible implementation

We can accomplish this goal by combining the following three ideas:

  1. Allow users to specify properties for each log file passed to ingestion that they would like tracked
  2. Maintain a range-index within the metadata section of each archive which tracks all of the properties requested by the user
  3. Maintain some or all of the properties within an archive's range index in the global metadata db for top-level filtering

1.0 Specifying config

A user might pass us configuration like the following :

{
  "file_name": "/my/log/file.json",
  "properties": {
    "hostname": "host_1",
    "service": "service_1"
  }
}

Which tells us the file and some properties they want to track about it. The configuration wouldn't necessarily look like this, and would likely have more options, but this should serve to sketch the idea.

2.0 Tracking properties within an archive

Within an archive we simply need to map logical ranges of records to these properties, with some special handling for tracking relationships to an original file.

Specifically we would introduce a new metadata packet type RangeIndex with the following schema:

(range[X, Y)<size_t, size_t>, (tag_name<str>, tag_type, tag_value<tag_type>)+)+

Here each row of the range index maps a logical range of records [X, Y) to a list of named and typed properties. This allows us to track different properties for different subsections of an archive.

In the long run we can use this to help narrow search within an archive: if a user gives us a query targeting these metadata properties on top of their regular search query we should be able to consult the range index in order to narrow search to a logical range of records.

2.1 Tracking the mapping to an original file

To track the relationship between ranges of archives and an original file we can track the same three logical properties as CLP. That is we will split the logical "file_name" property into three actual properties: "file_name", "file_id", and "file_chunk_id".

Here each unique file has a unique file_id, a name, and its records exist in several contiguous logical chunks. Most likely file_id would be assigned by interacting with the global metadata database (either during or before compression), and file_chunk_id would be assigned dynamically as archives get split.

3.0 Tracking properties outside of an archive

It should be possible to track all of these properties in a metadata database using a schema something like the following. Note that it also wouldn't be hard for us to let users tell us not to track certain properties in the metadata database if desired.

<archive_id> | <property_name> | <property_value>

Where we have an index on archive_id and property_name to allow for quick lookup. We may also want to use some indirection to replace property names with integer ids like we do for our tag system.

This approach can run into some issues if we need to be able to represent properties with many different types, so it bears thinking about more.

For now we could possibly also just track important properties (file_name, file_id, file_chunk_id) in dedicated columns instead of putting all properties into a more generic structure as shown here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions