Interpretable time series autoregression for periodicity quantification (ints โฎ integers). [Slides]
Made by Xinyu Chen โข ๐ https://xinychen.github.io
Which datasets we could provide for experimental evaluation?
- ๐ฆ NYC ridesharing dataset (2019-2025) (717 MB, see TLC trip record data)
- ๐ฆ NYC yellow taxi dataset (2011-2024) (455 MB, see TLC trip record data)
- ๐ฆ Manhattan subway rideship dataset (2024) (1.1 MB, see MTA subway hourly ridership: 2020-2024)
- ๐ฆ Manhattan bikesharing dataset (2024) (1.7 MB, see Citi bike system data - NYC)
- ๐ฆ Chicago ridesharing dataset (2018-2024) (92 MB, see Transportation Network Providers (TNP) - Trips (2018 - 2022))
- ๐ฆ Hangzhou metro passenger flow dataset (2019) (4.7 MB, see Hangzhou metro passenger data - 2019)
- ๐ฆ North America climate variable dataset (1980-2019) (3.0 GB, see Daymet)
- ๐ฆ Sea surface temperature dataset (1980-2019) (1.3 GB, see Sea surface temperature optimum interpolation)
- ๐ฆ Wikipedia page view dataset (January 2024) (4.7 GB, see Analytics datasets: Pageviews)
These mobility and climate datasets are formatted as multidimensional tensors and saved as NumPy arrays in the compressed form, i.e., .npz.
Figure 1. Conceptual overview of the diverse open datasets for periodicity quantification.
In urban systems, how to align the mobility datasets of different travel modes (e.g., ridesharing, taxi, subway, and bikesharing) with the same spatial resolution? For instance, Manhattan has hundreds of subway and bikesharing stations and 69 taxi areas, one can first project subway and bikesharing stations onto taxi areas and then aggregate trip counts.
Figure 2. (A) Subway stations are projected onto 52 areas in Manhattan. (B) Bikesharing stations are projected onto 67 areas in Manhattan.
What is the time series periodicity? How to get started the modeling process with machine learning and optimization? One of the most intuitive ways might be anotating the time series periodicity on the interactive visualization tool.
Figure 3. Anotating the time series periodicity of hourly ridesharing trip time series in Chicago since April 1, 2024.
While human mobility exhibits clear regularity in hourly, daily, and weekly cycles, the greatest challenge lies in accurately modeling these patterns. In addition, as shown in Figure 4, Wikipedia page view time series also demonstrate periodic patterns across multiple cycles.
Figure 4. Hourly time series of number of views on the 3-million Wikipedia page data in January 2024. These views are up to 72% of the total Wikipedia page views.
This work claims the practical contribution in the following ways:
- The classical autoregression can capture auto-correlations, but we do not know which are the dominant auto-correaltions.
- The sparse autoregression can limit the number of nonzero auto-correlations by imposing a sparsity level, allowing one to identify the dominant auto-correlations (e.g., time series periodicity).
Figure 5. Identification of the dominant auto-correlations from time series through sparse autoregression. The sparsity constraint allows one to find the dominant auto-correlated time lags (e.g., ).
The sample time series of as shown in Figures 3 and 5 is available at Chicago-ridesharing/rideshare_ts.txt.
import pandas as pd
import numpy as np
data = pd.read_csv('rideshare_ts.txt', sep = ' ', header = None, index_col = 0, names = ['trip_count'])One can draw the two-week time series as follows,
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (6, 1.4))
ax = fig.add_subplot(1, 1, 1)
plt.plot(data['trip_count'].values[: 2 * 7 * 24], color = 'purple', alpha = 0.75, linewidth = 2)
plt.xticks(np.arange(0, 24 * 7 * 3 + 1, 7 * 24))
plt.xlabel('Time (hour)')
plt.ylabel('Trip count')
plt.grid(axis = 'both', linestyle='dashed', linewidth = 0.1, color = 'gray')
ax.tick_params(direction = 'in')
ax.set_xlim([-1, 24 * 7 * 2])
plt.show()We use cplex as the mixed-integer optimization solver in our Python implementation. The setting of sparse autoregression includes d (order) and tau (sparsity level). The optimization problem of sparse autoregression can be reformulated as follows,
by introducing binary decision variables.
import numpy as np
from docplex.mp.model import Model
def obj(x, w, d):
T = x.shape[0]
loss = 0
for t in range(d, T):
loss += (x[t] - np.inner(w, np.flip(x[t - d : t]))) ** 2
return loss
def sparse_ar(x, d, tau):
model = Model()
alpha = 1
T = x.shape[0]
w = [model.continuous_var(name = f'w_{k}') for k in range(d)]
beta = [model.binary_var(name = f'beta_{k}') for k in range(d)]
model.minimize(model.sum((x[t] - model.sum(w[k] * x[t - k - 1] for k in range(d))) ** 2 for t in range(d, T)))
model.add_constraint(model.sum(beta[k] for k in range(d)) <= tau)
for k in range(d):
model.add_constraint(w[k] <= alpha * beta[k])
model.add_constraint(w[k] >= - alpha * beta[k])
solution = model.solve()
return np.array(solution.get_values(w))On the sample time series as mentioned above, please reproduce our results by running the following codes:
import numpy as np
d = 168
for tau in range(1, 7):
w = sparse_ar(data['trip_count'].values[: 2 * 7 * 24], d, tau)
print('tau = {}'.format(tau))
print('Objective function f = {}'.format(obj(x, w, d)))
ind = np.where(w != 0)[0].tolist()
print('Support set: {}'.format(ind))
print('Nonzero coefficients: {}'.format(w[ind]))
print()Here, the result at the sparsity level tau = 6 is given by
tau = 6
Objective function f = 50844056.30946854
Support set: [0, 22, 23, 33, 166, 167]
Nonzero coefficients: [0.29769501 0.00173922 0.03533629 0.00832573 0.16595001 0.48356377]- Xinyu Chen, Vassilis Digalakis Jr, Lijun Ding, Dingyi Zhuang, Jinhua Zhao (2025). Interpretable time series autoregression for periodicity quantification. arXiv preprint arXiv:2506.22895.
- Xinyu Chen, Qi Wang, Yunhan Zheng, Nina Cao, HanQin Cai, Jinhua Zhao (2025). Data-driven discovery of mobility periodicity for understanding urban systems. arXiv preprint arXiv:2508.03747.
- For any questions and feedback, please contact Dr. Xinyu Chen (chenxy346@gmail.com).
- If you like this repository, share it with your friends and colleagues.



