Skip to content

Provide dimension lookup in a star/snowflake schema style #464

@monfera

Description

@monfera

On this ticket, we could discuss the need and feasibility for dimensional lookup, to aid i18n, responsive annotation text lengths and assisting efficient, ad-hoc data exploration.

image (image: Wikipedia)

Currently, textual data for display as legends, categorical axis tick labels etc. are expected to be in the data array, ie. currently we rely on denormalization. This is a good approach with Cartesian charts, especially when elastic-charts can expect a newly built denormalized array on every operation. But these items might be motivation for allowing a dimensional lookup:

  1. the user, or their clients or audience needs internationalization, and texts for dimension codes (eg. 'United Kingdom' for 'UK') are represented in several languates, uploaded into an index
  2. the report/visualization generally has constraints for text sizes; many systems solve it by using progressively shorter text for table headers, pie/treemap labels, categorical tick labels (also depending on horizontal vs vertical orientation) etc., for example, 'United Kingdom of Great Britain and Northern Ireland' -> 'United Kingdom' -> UK - storing just 5 languages and 3 text lengths would require 15 new fields on the document, and the fields would need to have naming conventions to encode the language/text length etc. so it'd get tedious very fast, and any change (eg. to 'Swaziland' -> 'eSwatini') would require reindexing
  3. There may be other types of lookup, eg. deriving the country from the city code, or deriving the region from the country code; if there's a maximum of one hop, it's a star schema, if it's more, it's a snowflake schema
  4. While denormalization works in theory, it fails once we consider ad-hoc reporting, eg. the user has a sudden interest in slicing/dicing, aggregating, annotating etc. on something that was not designed to be in the document

The discussion is needed, because

  • such master data joins may eventually be needed or implemented for underlying ES query languages such as ESSQL, maybe lessening the need to do so
  • even if a SQL JOIN were provided, there'd be technical merit to avoid re-querying some large dataset just because the user wants to now switch to another language, or responsive resizing leads to cramped space and there's less room for the table headers, tick labels or other annotations

Describe a solution
A star or snowflake schema for elastic-charts: besides accepting a facts table (documents/aggregates as now) it could accept an arbitrary, typically low number of dimension tables, while also providing the means for reaching into those dimension tables via (most importantly) equality checking of dimension values against document/aggregate values (star schema) or even allow multi-hop, eg. hierarchical lookup as in the case of eg. a product nomenclature or any other breakdown often needed for treemaps, sunburst charts, table etc. visualizations

While technically, denormalization works (with potentially large to infeasible runtime cost), it's at odds with our evolving visualization algebra, because it hides and obscures how facts (dimensions or granular aggregates) relate to dimensional data. But this very information is key for the user to navigate the chart design space.

Alternatives considered

  1. Continuing as now: denormalized everything
  2. ES schema is not denormalized, but ESSQL obtains a JOIN: elastic-charts wouldn't need to change, but basic things eg. responsive resizing would need that ES is re-queried
  3. Handling dimensional lookup inside elastic-charts (this ticket is biased toward this option)
  4. A comprehensive approach in which Kibana expressions and relational joins (including client-side joins where they're preferable, and server-side joins, once ES or ESSQL supports them) are considered at the Kibana level (ie. so it's equally available to Lens, Vega, 3rd party charts etc.
  5. Combining either or both of the latter two with crossfiltering, which also benefits from the efficiencies of not having to denormalize data

There's no rush on it as option 1 works right now, but it's good to foresee this as it often comes up in tools but it's often hard to refactor for such a general approach later, because by that time a solution could've evolved or conserved, maybe overly specialized, one-off solutions (eg. "let's add a PR for text in multiple languages", then "let's add a PR for differing text lengths" etc.). Even if we don't do anything now, we can keep the approach in mind or refer to it in other discussions.

Checklist

  • this request is checked against already exist requests
  • every related Kibana issue is listed under Kibana Cross Issues list
  • kibana cross issue tag is associated to the issue if any kibana cross issue is present

Metadata

Metadata

Assignees

No one assigned

    Labels

    :dataData/series/scales related issuediscussTo be discussedenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions