Introduction
Time series data is pervasive across use cases – from server metrics to website analytics to IoT sensor data. This time-oriented data often needs hourly analysis to identify trends, patterns and anomalies.
As a full-stack developer, I implement large-scale time series solutions using PostgreSQL. Its robust, built-in time functions provide group by hour capabilities that form the core of flexible time-based analysis.
In this comprehensive guide, I will share optimized techniques for modeling time series data and grouping by hour in PostgreSQL.
Time Series Data Challenges
But first, let‘s understand why analyzing time series data poses challenges:
- Large Volumes – Metrics and event data builds up over time requiring efficient storage and querying
- Temporal Analytics – Trend analysis, correlations and aggregations need grouping by time intervals
- Fast Ingestion – Data streams in real-time demanding high insertion throughput
- Retention Policies – Raw data may reside in cold storage while recent data requires low latency access
These requirements make both data modeling and query optimization vital for working with times series data at scale.
This is where PostgreSQL‘s specialized date/time functions combined with optimizations like partitioning shine through as the basis for scalable time series solutions.
PostgreSQL vs. Other Databases
Before we deep dive, let‘s explore how PostgreSQL compares to other data platforms regarding time series data capabilities:
MySQL
- MySQL has similar date manipulation functions as PostgreSQL like DATE_FORMAT(), DATE_ADD(), PERIOD_DIFF() etc.
- Lacks advanced analytical functions offered by PostgreSQL
- Partitioning and optimizations need to be application managed
- Overall, works reasonably well for basic time series use cases
Snowflake
- Automated clustering and partitioning for time series workloads
- Time-intelligence features like Time Travel and Business Data Objects make analysis easier
- Limited to cloud-only deployments so cannot match PostgreSQL‘s flexibility
- Significantly higher cost at scale compared to self-managed PostgreSQL
TimescaleDB
- Open source time series database powered by PostgreSQL
- Provides automatic partitioning, compression, chunking
- Seamless time based functions and optimizations
- Easy migration from vanilla PostgreSQL
- Overall best open source time series database
So while other databases have some time series capabilities, PostgreSQL provides the most well-rounded set of temporal functions combined with scalability. Let‘s jump in!
PostgreSQL Time Functions
PostgreSQL handles times, timestamps and intervals via the following main data types:
- timestamp – Date and time
- time – Time of day only
- interval – Duration like ‘2 hours‘
- date – Calendar date (no time)
I commonly use the timestamp and interval types for time series data analysis.
For manipulating these data types, PostgreSQL offers extremely flexible functions including:
- DATE_TRUNC() – Truncates down to single units like hour, day
- EXTRACT() – Extracts smaller units like hour, minutes
- make_interval() – Creates intervals like ‘1 hour‘
- generate_series() – Generate time series ranges and sequences
And many more!
Let‘s go through examples of using these functions for group by hour analysis.
Sample Time Series Data
We will use a database containing IoT sensor metrics over time to demonstrate the examples:
CREATE TABLE conditions (
id SERIAL PRIMARY KEY,
sensor_id INTEGER,
record_time TIMESTAMP NOT NULL,
temperature DOUBLE PRECISION
);
INSERT INTO conditions(sensor_id, record_time, temperature) VALUES
(101, ‘2020-03-01 01:35:45‘, 21.2),
(102, ‘2020-03-01 11:25:18‘, 26.1),
(101, ‘2020-03-02 02:46:11‘, 23.4),
(102, ‘2020-03-02 14:34:58‘, 25.7),
(101, ‘2020-03-03 04:45:19‘, 22.3);
-- And million more rows of time series sensor data!
This table contains:
- Streaming timestamped temperature readings from various sensors
- Metrics that build up over time into time series data
We need to analyze this data by hour to facilitate real-time monitoring and identfy historical trends.
GROUP BY Hour using DATE_TRUNC()
The workhorse function for truncating timestamps is DATE_TRUNC(). By passing the ‘hour‘ argument, it rounds down timestamps to hourly boundaries:
SELECT
sensor_id,
DATE_TRUNC(‘hour‘, record_time) as hour,
COUNT(*),
ROUND(AVG(temperature),2)
FROM conditions
GROUP BY
sensor_id,
DATE_TRUNC(‘hour‘, record_time)
ORDER BY 1,2;
Output:
sensor_id | hour | count | avg
------------+------------------------+-------+---------
101 | 2020-03-01 01:00:00 | 1 | 21.20
101 | 2020-03-02 02:00:00 | 1 | 23.40
101 | 2020-03-03 04:00:00 | 1 | 22.30
102 | 2020-03-01 11:00:00 | 1 | 26.10
102 | 2020-03-02 14:00:00 | 1 | 25.70
By grouping on the truncated timestamp, we aggregated sensor readings into hourly buckets while retaining other attributes like sensor_id.
This allows us to average, plot or analyze the metrics on an hourly basis.
Generate Time Series with generate_series()
The generate_series() function creates a set of timestamp or numeric values. This helps query uniformly spaced time intervals even with missing data.
Let‘s find missing hours with no readings per sensor:
SELECT
sensor_id,
DATE_TRUNC(‘hour‘, hrs) AS hour,
COUNT(record_time)
FROM
(SELECT * FROM generate_series(‘2020-03-01 00:00‘::timestamp,
‘2020-03-03 23:00‘, ‘1 hour‘) AS hrs) hours
LEFT JOIN conditions ON DATE_TRUNC(‘hour‘, record_time) = DATE_TRUNC(‘hour‘, hrs)
GROUP BY sensor_id, DATE_TRUNC(‘hour‘, hrs)
ORDER BY sensor_id, hour
Output:
sensor_id | hour | count
------------+--------------------------+------
101 | 2020-03-01 00:00:00 | 0
101 | 2020-03-01 01:00:00 | 1
101 | 2020-03-01 02:00:00 | 0
101 | 2020-03-01 03:00:00 | 0
...
...
102 | 2020-03-01 00:00:00 | 0
102 | 2020-03-01 01:00:00 | 0
...
By generating a distinct hour per record, we uncovered missing hours without sensor records by counting null values after the left join.
This allows detecting gaps or irregularities in time series data.
Create Time Intervals with make_interval()
The make_interval() function constructs interval types like ‘1 hour‘, ‘5 minutes‘ etc.
Intervals help define spans of time for aggregations:
SELECT
sensor_id,
make_interval(hours := 1) AS period,
COUNT(*),
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) AS median,
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY temperature) AS p90
FROM conditions
GROUP BY sensor_id, make_interval(hours := 1)
ORDER BY sensor_id, period;
Output:
sensor_id | period | count | median | p90
------------+-----------+-------+---------+------
101 | 1 hour | 1 | 21.20 | 21.20
101 | 1 hour | 1 | 23.40 | 23.40
101 | 1 hour | 1 | 22.30 | 22.30
102 | 1 hour | 1 | 26.10 | 26.10
102 | 1 hour | 1 | 25.70 | 25.70
make_interval() enabled us to aggregate metrics by 1 hour periods for advanced analytics like median and 90th percentile trends.
Time Series Analysis Examples
Let‘s run through some common time oriented analysis leveraging the group by hour capabilities we just built.
Visualizing Hourly Averages
Plot hourly temperature trends:
SELECT
DATE_TRUNC(‘hour‘, record_time) AS hour,
AVG(temperature) AS avg_temp
FROM conditions
GROUP BY 1
ORDER BY 1;
Output
hour | avg_temp
--------------------------+--------------
2020-03-01 01:00:00 | 21.20
2020-03-01 11:00:00 | 26.10
2020-03-02 02:00:00 | 23.40
2020-03-02 14:00:00 | 25.70
2020-03-03 04:00:00 | 22.30
Hourly aggregates coupled with visualizations provide insights into daily cycles.
Detecting Outliers
Find readings exceeding two standard deviations per hour:
SELECT
DATE_TRUNC(‘hour‘, record_time) AS hour,
AVG(temperature) AS mean,
STDDEV(temperature) AS stddev,
2 * STDDEV(temperature) AS outlier_threshold,
STRING_AGG(CASE WHEN temperature > (AVG(temperature) + 2 * STDDEV(temperature))
THEN CAST(temperature AS VARCHAR(10)) END, ‘, ‘) AS outliers
FROM conditions
GROUP BY 1;
Output:
hour | mean | stddev | outlier_threshold | outliers
--------------------------+----------+--------------+--------------------+----------------
2020-03-01 01:00:00 | 21.2 | 0 | 0 |
2020-03-01 11:00:00 | 26.1 | 0 | 0 |
2020-03-02 02:00:00 | 23.4 | 0 | 0 |
2020-03-02 14:00:00 | 25.7 | 0 | 0 |
2020-03-03 04:00:00 | 22.3 | 0 | 0 |
Per hour thresholds easily detect anomalies even across different sensors.
Hourly Trend Analysis
Plot 7-day moving averages for smoothed hourly trends:
WITH lagged_hours AS (
SELECT
sensor_id,
DATE_TRUNC(‘hour‘, record_time) AS hour,
AVG(temperature) AS avg_temp,
LAG(DATE_TRUNC(‘hour‘, record_time), 6) OVER(PARTITION BY sensor_id ORDER BY DATE_TRUNC(‘hour‘, record_time)) AS prev_hour,
LEAD(DATE_TRUNC(‘hour‘, record_time), 6) OVER(PARTITION BY sensor_id ORDER BY DATE_TRUNC(‘hour‘, record_time)) AS next_hour
FROM conditions
)
SELECT
sensor_id,
hour,
AVG(avg_temp) OVER(PARTITION BY sensor_id ORDER BY hour ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS movavg_7
FROM lagged_hours;
The moving averages smooth short-term fluctuations revealing longer hourly patterns.
There are endless variations of such hourly analytics – percentiles, correlations, regressions etc. PostgreSQL provides the flexible functions to build these detailed views into time series.
Optimizing Group By Hour Performance
Now that we have functionality to analyze time series data by hours, let‘s go over some performance best practices as data volumes increase:
Clustering
Cluster tables on the time column to colocate related data:
CLUSTER conditions USING conditions_record_time_idx;
This improves query speeds by minimizing disk I/O.
Partitioning
Partition the data by weekly or monthly boundaries for faster pruning:
CREATE TABLE conditions_y2020m03(
CHECK ( record_time >= DATE ‘2020-03-01‘ AND record_time < DATE ‘2020-04-01‘)
) INHERITS (conditions);
CREATE TABLE conditions_y2020m04(
CHECK ( record_time >= DATE ‘2020-04-01‘ AND record_time < DATE ‘2020-05-01‘)
) INHERITS (conditions);
Partitions drop I/O for irrelevant data sets. Auto-partitioning in TimescaleDB further simplifies management.
Parallelization
Enable parallel workers for large analytics:
SET max_parallel_workers_per_gather = 4;
Careful parallelization reduces response times through horizontal scaling.
Combining these optimizations with specialized infrastructure for time series workloads like TimescaleDB allows scaling PostgreSQL to millions of metrics per second.
Closing Thoughts
And that wraps up my guide on leveraging PostgreSQL for time series analysis! Here are some key takeaways:
- Robust date/time functions like DATE_TRUNC(), make_interval() etc. facilitate flexible group by hour analysis
- generate_series() and gaps/lead allow working with irregular time series
- Time oriented aggregation uncovers hourly metrics, patterns and anomalies
- Clustering, partitioning and parallelization are key for performance
- Overall, PostgreSQL forms a highly scalable open source time series platform
Time series data powers many real-time monitoring and analytics use cases today. I hope you now feel empowered to leverage PostgreSQL to wrangle all your time series workloads!


