Optimizing Time Series Analysis in PostgreSQL using Group By Hour

Introduction

Time series data is pervasive across use cases – from server metrics to website analytics to IoT sensor data. This time-oriented data often needs hourly analysis to identify trends, patterns and anomalies.

As a full-stack developer, I implement large-scale time series solutions using PostgreSQL. Its robust, built-in time functions provide group by hour capabilities that form the core of flexible time-based analysis.

In this comprehensive guide, I will share optimized techniques for modeling time series data and grouping by hour in PostgreSQL.

Time Series Data Challenges

But first, let‘s understand why analyzing time series data poses challenges:

Large Volumes – Metrics and event data builds up over time requiring efficient storage and querying
Temporal Analytics – Trend analysis, correlations and aggregations need grouping by time intervals
Fast Ingestion – Data streams in real-time demanding high insertion throughput
Retention Policies – Raw data may reside in cold storage while recent data requires low latency access

These requirements make both data modeling and query optimization vital for working with times series data at scale.

This is where PostgreSQL‘s specialized date/time functions combined with optimizations like partitioning shine through as the basis for scalable time series solutions.

PostgreSQL vs. Other Databases

Before we deep dive, let‘s explore how PostgreSQL compares to other data platforms regarding time series data capabilities:

MySQL

MySQL has similar date manipulation functions as PostgreSQL like DATE_FORMAT(), DATE_ADD(), PERIOD_DIFF() etc.
Lacks advanced analytical functions offered by PostgreSQL
Partitioning and optimizations need to be application managed
Overall, works reasonably well for basic time series use cases

Snowflake

Automated clustering and partitioning for time series workloads
Time-intelligence features like Time Travel and Business Data Objects make analysis easier
Limited to cloud-only deployments so cannot match PostgreSQL‘s flexibility
Significantly higher cost at scale compared to self-managed PostgreSQL

TimescaleDB

Open source time series database powered by PostgreSQL
Provides automatic partitioning, compression, chunking
Seamless time based functions and optimizations
Easy migration from vanilla PostgreSQL
Overall best open source time series database

So while other databases have some time series capabilities, PostgreSQL provides the most well-rounded set of temporal functions combined with scalability. Let‘s jump in!

PostgreSQL Time Functions

PostgreSQL handles times, timestamps and intervals via the following main data types:

timestamp – Date and time
time – Time of day only
interval – Duration like ‘2 hours‘
date – Calendar date (no time)

I commonly use the timestamp and interval types for time series data analysis.

For manipulating these data types, PostgreSQL offers extremely flexible functions including:

DATE_TRUNC() – Truncates down to single units like hour, day
EXTRACT() – Extracts smaller units like hour, minutes
make_interval() – Creates intervals like ‘1 hour‘
generate_series() – Generate time series ranges and sequences

And many more!

Let‘s go through examples of using these functions for group by hour analysis.

Sample Time Series Data

We will use a database containing IoT sensor metrics over time to demonstrate the examples:

CREATE TABLE conditions (
    id SERIAL PRIMARY KEY,
    sensor_id INTEGER, 
    record_time TIMESTAMP NOT NULL,
    temperature DOUBLE PRECISION 
);

INSERT INTO conditions(sensor_id, record_time, temperature) VALUES 
    (101, ‘2020-03-01 01:35:45‘, 21.2),
    (102, ‘2020-03-01 11:25:18‘, 26.1),
    (101, ‘2020-03-02 02:46:11‘, 23.4),
    (102, ‘2020-03-02 14:34:58‘, 25.7),
    (101, ‘2020-03-03 04:45:19‘, 22.3);

-- And million more rows of time series sensor data!

This table contains:

Streaming timestamped temperature readings from various sensors
Metrics that build up over time into time series data

We need to analyze this data by hour to facilitate real-time monitoring and identfy historical trends.

GROUP BY Hour using DATE_TRUNC()

The workhorse function for truncating timestamps is DATE_TRUNC(). By passing the ‘hour‘ argument, it rounds down timestamps to hourly boundaries:

SELECT
   sensor_id,
   DATE_TRUNC(‘hour‘, record_time) as hour,
   COUNT(*),
   ROUND(AVG(temperature),2)  
FROM conditions
GROUP BY 
   sensor_id, 
   DATE_TRUNC(‘hour‘, record_time)
ORDER BY 1,2;

Output:

sensor_id | hour               | count | avg  
------------+------------------------+-------+---------
    101 | 2020-03-01 01:00:00 | 1     | 21.20
    101 | 2020-03-02 02:00:00 | 1     | 23.40 
    101 | 2020-03-03 04:00:00 | 1     | 22.30
    102 | 2020-03-01 11:00:00 | 1     | 26.10
    102 | 2020-03-02 14:00:00 | 1     | 25.70

By grouping on the truncated timestamp, we aggregated sensor readings into hourly buckets while retaining other attributes like sensor_id.

This allows us to average, plot or analyze the metrics on an hourly basis.

Generate Time Series with generate_series()

The generate_series() function creates a set of timestamp or numeric values. This helps query uniformly spaced time intervals even with missing data.

Let‘s find missing hours with no readings per sensor:

SELECT 
    sensor_id,
    DATE_TRUNC(‘hour‘, hrs) AS hour, 
    COUNT(record_time)  
FROM 
  (SELECT * FROM generate_series(‘2020-03-01 00:00‘::timestamp, 
                                 ‘2020-03-03 23:00‘, ‘1 hour‘) AS hrs) hours
  LEFT JOIN conditions ON DATE_TRUNC(‘hour‘, record_time) = DATE_TRUNC(‘hour‘, hrs)
GROUP BY sensor_id, DATE_TRUNC(‘hour‘, hrs)  
ORDER BY sensor_id, hour

Output:

sensor_id | hour                 | count
------------+--------------------------+------
     101 | 2020-03-01 00:00:00 | 0  
     101 | 2020-03-01 01:00:00 | 1
     101 | 2020-03-01 02:00:00 | 0
     101 | 2020-03-01 03:00:00 | 0
...
...
     102 | 2020-03-01 00:00:00 | 0   
     102 | 2020-03-01 01:00:00 | 0
...

By generating a distinct hour per record, we uncovered missing hours without sensor records by counting null values after the left join.

This allows detecting gaps or irregularities in time series data.

Create Time Intervals with make_interval()

The make_interval() function constructs interval types like ‘1 hour‘, ‘5 minutes‘ etc.

Intervals help define spans of time for aggregations:

SELECT
  sensor_id,
  make_interval(hours := 1) AS period,
  COUNT(*),
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) AS median,
  PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY temperature) AS p90  
FROM conditions
GROUP BY sensor_id, make_interval(hours := 1)  
ORDER BY sensor_id, period;

Output:

sensor_id | period    | count | median | p90
------------+-----------+-------+---------+------
    101 | 1 hour    | 1     | 21.20  | 21.20
    101 | 1 hour    | 1     | 23.40  | 23.40
    101 | 1 hour    | 1     | 22.30  | 22.30

    102 | 1 hour    | 1     | 26.10  | 26.10 
    102 | 1 hour    | 1     | 25.70  | 25.70

make_interval() enabled us to aggregate metrics by 1 hour periods for advanced analytics like median and 90th percentile trends.

Time Series Analysis Examples

Let‘s run through some common time oriented analysis leveraging the group by hour capabilities we just built.

Visualizing Hourly Averages

Plot hourly temperature trends:

SELECT 
  DATE_TRUNC(‘hour‘, record_time) AS hour,
  AVG(temperature) AS avg_temp
FROM conditions
GROUP BY 1 
ORDER BY 1;

Output

hour                 | avg_temp
--------------------------+--------------
2020-03-01 01:00:00  | 21.20  
2020-03-01 11:00:00  | 26.10
2020-03-02 02:00:00  | 23.40
2020-03-02 14:00:00  | 25.70  
2020-03-03 04:00:00  | 22.30

Hourly aggregates coupled with visualizations provide insights into daily cycles.

Detecting Outliers

Find readings exceeding two standard deviations per hour:

SELECT
   DATE_TRUNC(‘hour‘, record_time) AS hour,
   AVG(temperature) AS mean,
   STDDEV(temperature) AS stddev,
   2 * STDDEV(temperature) AS outlier_threshold,
   STRING_AGG(CASE WHEN temperature > (AVG(temperature) + 2 * STDDEV(temperature)) 
                  THEN CAST(temperature AS VARCHAR(10)) END, ‘, ‘) AS outliers
FROM conditions
GROUP BY 1;

Output:

hour                 | mean | stddev | outlier_threshold | outliers
--------------------------+----------+--------------+--------------------+----------------  
2020-03-01 01:00:00 | 21.2 | 0      | 0                |   
2020-03-01 11:00:00 | 26.1 | 0      | 0                |  
2020-03-02 02:00:00 | 23.4 | 0      | 0                |
2020-03-02 14:00:00 | 25.7 | 0      | 0                | 
2020-03-03 04:00:00 | 22.3 | 0      | 0                |

Per hour thresholds easily detect anomalies even across different sensors.

Hourly Trend Analysis

Plot 7-day moving averages for smoothed hourly trends:

WITH lagged_hours AS (
  SELECT
    sensor_id,
    DATE_TRUNC(‘hour‘, record_time) AS hour,
    AVG(temperature) AS avg_temp,
    LAG(DATE_TRUNC(‘hour‘, record_time), 6) OVER(PARTITION BY sensor_id ORDER BY DATE_TRUNC(‘hour‘, record_time)) AS prev_hour,
    LEAD(DATE_TRUNC(‘hour‘, record_time), 6) OVER(PARTITION BY sensor_id ORDER BY DATE_TRUNC(‘hour‘, record_time)) AS next_hour
  FROM conditions
)
SELECT 
  sensor_id,
  hour,
  AVG(avg_temp) OVER(PARTITION BY sensor_id ORDER BY hour ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS movavg_7
FROM lagged_hours;

The moving averages smooth short-term fluctuations revealing longer hourly patterns.

There are endless variations of such hourly analytics – percentiles, correlations, regressions etc. PostgreSQL provides the flexible functions to build these detailed views into time series.

Optimizing Group By Hour Performance

Now that we have functionality to analyze time series data by hours, let‘s go over some performance best practices as data volumes increase:

Clustering

Cluster tables on the time column to colocate related data:

CLUSTER conditions USING conditions_record_time_idx;

This improves query speeds by minimizing disk I/O.

Partitioning

Partition the data by weekly or monthly boundaries for faster pruning:

CREATE TABLE conditions_y2020m03(
     CHECK ( record_time >= DATE ‘2020-03-01‘ AND record_time < DATE ‘2020-04-01‘)
) INHERITS (conditions);

CREATE TABLE conditions_y2020m04(
     CHECK ( record_time >= DATE ‘2020-04-01‘ AND record_time < DATE ‘2020-05-01‘)  
) INHERITS (conditions);

Partitions drop I/O for irrelevant data sets. Auto-partitioning in TimescaleDB further simplifies management.

Parallelization

Enable parallel workers for large analytics:

SET max_parallel_workers_per_gather = 4;

Careful parallelization reduces response times through horizontal scaling.

Combining these optimizations with specialized infrastructure for time series workloads like TimescaleDB allows scaling PostgreSQL to millions of metrics per second.

Closing Thoughts

And that wraps up my guide on leveraging PostgreSQL for time series analysis! Here are some key takeaways:

Robust date/time functions like DATE_TRUNC(), make_interval() etc. facilitate flexible group by hour analysis
generate_series() and gaps/lead allow working with irregular time series
Time oriented aggregation uncovers hourly metrics, patterns and anomalies
Clustering, partitioning and parallelization are key for performance
Overall, PostgreSQL forms a highly scalable open source time series platform

Time series data powers many real-time monitoring and analytics use cases today. I hope you now feel empowered to leverage PostgreSQL to wrangle all your time series workloads!

Optimizing Time Series Analysis in PostgreSQL using Group By Hour

Introduction

Time Series Data Challenges

PostgreSQL vs. Other Databases

PostgreSQL Time Functions

Sample Time Series Data

GROUP BY Hour using DATE_TRUNC()

Generate Time Series with generate_series()

Create Time Intervals with make_interval()

Time Series Analysis Examples

Visualizing Hourly Averages

Detecting Outliers

Hourly Trend Analysis

Optimizing Group By Hour Performance

Clustering

Partitioning

Parallelization

Closing Thoughts

Mastering Numpy Mode for Data Science Insights

Creating Informative Line Plots with Error Bars in MATLAB

Configuring Network Settings on Ubuntu 22.04

Install Python 3.9 on Ubuntu 20.04/20.10: A Definitive Guide for Developers

Resolving "MySQL Access Denied for root@localhost" Error

Safely Removing Sensitive Files from Git History

Linuxhaxor.net – About Open Source & Linux

Introduction

Time Series Data Challenges

PostgreSQL vs. Other Databases

PostgreSQL Time Functions

Sample Time Series Data

GROUP BY Hour using DATE_TRUNC()

Generate Time Series with generate_series()

Create Time Intervals with make_interval()

Time Series Analysis Examples

Visualizing Hourly Averages

Detecting Outliers

Hourly Trend Analysis

Optimizing Group By Hour Performance

Clustering

Partitioning

Parallelization

Closing Thoughts

Related posts:

Similar Posts

Linuxhaxor.net – About Open Source & Linux