Time series data

Why CrateDB for Time Series?

CrateDB employs a relational representation for time‑series, enabling you to work with timestamped data using standard SQL, while also seamlessly combining with document and context data.

  • While maintaining a high ingest rate, the columnar storage and automatic indexing let you access and analyze the data immediately with fast aggregations and near-real-time queries.

  • Handles high cardin­ality and a variety of data types, including nested JSON, geospatial and vector data — all queryable via the same SQL statements.

Data Model Template

A typical time‑series schema looks like this:

CREATE TABLE devices_readings (
   ts TIMESTAMP WITH TIME ZONE,
   device_id TEXT,
   battery OBJECT AS (
      level BIGINT,
      status TEXT,
      temperature DOUBLE PRECISION
   ),
   cpu OBJECT AS (
      avg_1min DOUBLE PRECISION,
      avg_5min DOUBLE PRECISION,
      avg_15min DOUBLE PRECISION
   ),
   memory OBJECT AS (
      free BIGINT,
      used BIGINT
   ),
   month TIMESTAMP GENERATED ALWAYS AS date_trunc('month', ts)
) PARTITIONED BY (month);

CREATE TABLE devices_info (
   "device_id" TEXT,
   "api_version" TEXT,
   "manufacturer" TEXT,
   "model" TEXT,
   "os_name" TEXT
);

Key points:

  • month is the partitioning key, optimizing data storage and retrieval.

  • Every column is stored in the column store by default for fast aggregations.

  • Using OBJECT columns provides a structured and efficient way to organize complex nested data in CrateDB, enhancing both data integrity and flexibility.

Ingesting and Querying

Data Ingestion

  • Use SQL INSERT or bulk import techniques like COPY FROM with JSON or CSV files.

  • Schema inference can often happen automatically during import.

Aggregation and Transformations

CrateDB offers built‑in SQL functions tailor‑made for time‑series analyses:

  • DATE_BIN(interval, timestamp, origin) for bucketed aggregations (down‑sampling).

  • Window functions like LAG() and LEAD() to detect trends or gaps.

  • MAX_BY(returnField, SearchField) / MIN_BY(returnField, SearchField) returns the value from one column matching the min/max value of another column in a group.

Example: compute hourly average battery levels and join with metadata:

WITH avg_metrics AS (
  SELECT device_id,
         DATE_BIN('1 hour'::interval, ts, 0) AS period,
         AVG(battery['level']) AS avg_battery
  FROM devices_readings
  GROUP BY device_id, period
)
SELECT period, t.device_id, i.manufacturer, avg_battery
FROM avg_metrics t
JOIN devices_info i USING (device_id)
WHERE i.model = 'mustang';

Example: gap detection interpolation:

WITH all_hours AS (
  SELECT
    generate_series(
      '2025-01-01',
      '2025-01-02',
      INTERVAL '30 second'
    ) AS expected_time
),
raw AS (
  SELECT
    ts,
    battery['level']
  FROM
    devices_readings
)
SELECT
  expected_time,
  r.battery['level']
FROM
  all_hours
  LEFT JOIN raw r ON expected_time = r.ts
ORDER BY
  expected_time;

Typical time series functions

  • Time extraction: date_trunc(...), extract(...), date_part(...), now(), current_timestamp

  • Time bucketing: date_bin(), interval, age()

  • Window functions: avg(...), over(...), lag(...), lead(...), first_value(...), last_value(...), row_number(), rank() , WINDOW ... AS (...)

  • Null handling: coalesce, nullif

  • Statistical aggregates: percentile(...), stddev(...), variance(), min(), max(...), sum(...), topk(...)

  • Advanced filtering & logic: greatest(...), least(...), case when ... then ... end

Downsampling & Interpolation

To reduce volume while preserving trends, use DATE_BIN. Missing data can be handled using LAG()/LEAD() or other interpolation logic within SQL.

Schema Evolution & Contextual Data

With column_policy = 'dynamic', ingest JSON payloads containing extra attributes—new columns are auto‑created and indexed. Perfect for capturing evolving sensor metadata. For column-level control, use OBJECT(DYNAMIC) to auto-create (and, by default, index) subcolumns, or OBJECT(IGNORED) to accept unknown keys without creating or indexing subcolumns.

You can also store:

  • Geospatial (GEO_POINT, GEO_SHAPE)

  • Vectors (up to 2048 dims via HNSW indexing)

  • BLOBs for binary data (e.g. images, logs)

All types are supported within the same table or joined together.

Storage Optimization

  • Partitioning and sharding: data can be partitioned by time (e.g. daily/monthly) and sharded across a cluster.

  • Supports long‑term retention with performant historic storage.

  • Columnar layout reduces storage footprint and accelerates aggregation queries.

See also