HoWDe (Home and Work Detection) is a Python package designed to identify home and work locations from individual timestamped sequences of stop locations. It processes stop location data to label each location as 'Home', 'Work', or 'None' based on user-defined parameters and heuristics.
A complete description of the algorithm can be found in our pre-print
- Processes stop location datasets to detect home and work locations.
- Allows customization through various parameters to fine-tune detection heuristics.
- Supports batch processing with multiple parameter configurations.
- Outputs results as a PySpark DataFrame for seamless integration with big data workflows.
HoWDe requires Python 3.6 or later and a functional PySpark environment.
1. Install PySpark
Before installing HoWDe, ensure PySpark and Java are properly configured. For detailed setup instructions, please refer to the official PySpark Installation Guidelines
Installation Note:
PySpark may raisePy4JJavaErrorif Java or Spark are not properly configured. We recommend checking the Debugging PySpark and Py4JJavaError Guidelines
Compatibility Note:
Once PySpark/Java is correctly configured, HoWDe runs consistently across macOS, Ubuntu, and Windows. The following environments have been tested:
- Python 3.9 + PySpark 3.3 + Java 20.0
- Python 3.12 + PySpark 4.0 + Java 17.0
2. Install HoWDe
Once PySpark is installed and configured, you can install HoWDe via pip:
pip install HoWDeThe core function of the HoWDe package is HoWDe_labelling, which performs the detection of home and work locations.
def HoWDe_labelling(
input_data,
edit_config_default=None,
range_window_home=28,
range_window_work=42,
C_hours=0.4,
C_days_H=0.4,
C_days_W=0.5,
f_hours_H=0.7,
f_hours_W=0.4,
f_days_W=0.6,
output_format="stop",
verbose=False,
):
"""
Perform Home and Work Detection (HoWDe)
"""HoWDe expects the input to be a PySpark DataFrame containing one row per user stop, with the following columns:
| Column | Type | Description |
|---|---|---|
useruuid |
str or int | Unique user identifier. |
loc |
str or int | Stop location ID (unique per useruuid). -1 to label meaningful stops, as these are dropped following the Infostop convention. |
start |
long | Start time of the stop (Unix timestamp). |
end |
long | End time of the stop (Unix timestamp). |
tz_hour_start, tz_minute_start |
int | Optional. Time zone offsets (hours and minutes) used to convert UTC timestamps to local time, if applicable. |
country |
int | Optional. Country code; if not provided, a default "GL0B" label is assigned. |
+---------+-----+-------------+-------------+---------------+----------------+---------+
| useruuid| loc | start | end | tz_hour_start | tz_minute_start| country |
+---------+-----+-------------+-------------+---------------+----------------+---------+
| 1001 | 1 | 1704031200 | 1704034800 | 1 | 0 | DK |
| 1001 | 2 | 1704056400 | 1704060000 | 1 | 0 | DK |
+---------+-----+-------------+-------------+---------------+----------------+---------+💡 Scalability Tip: This package involves heavy computations (e.g., window functions, UDFs). To ensure efficient parallel processing, use df.repartition("useruuid") to distribute data across partitions evenly. This reduces memory bottlenecks and improves resource utilization.
| Parameter | Type | Description | Suggested value and range |
|---|---|---|---|
range_window_home |
int or list | Sliding window size (in days) used to detect home locations. | 28 [14-112] |
range_window_work |
int or list | Sliding window size (in days) used to detect work locations. | 42 [14-112] |
C_hours |
float or list | Minimum fraction of night/business hourly-bins with data in a day | 0.4 [0.2-0.9] |
C_days_H |
float or list | Minimum fraction of days with data in a window | 0.4 [0.1-0.6] |
C_days_W |
float or list | Minimum fraction of days with data in a window | 0.5 [0.4-0.6] |
f_hours_H |
float or list | Minimum average fraction of night hourly-bins (across days in the window) required for a location to qualify as Home. | 0.7 [0.5-0.9] |
f_hours_W |
float or list | Minimum average fraction of business hourly-bins (across days in the window) required for a location to qualify as Work. | 0.4 [0.4-0.6] |
f_days_W |
float or list | Minimum fraction of days within the window a location should be visited to qualify as Work. | 0.6 [0.5-0.8] |
All parameters listed above can also be provided as lists to explore multiple configurations in a single run.
💡 Tuning Tip:
When adjusting detection parameters, start by refining the temporal coverage filters C_days_H, C_days_W to match the characteristics of your data.
Once these are well aligned, tune the estimation thresholds f_hours_H, f_hours_W, f_days_W based on the case of study according to the specifics of your case study. These estimation thresholds play a major role in determining how strictly the algorithm identifies consistent home and work locations.
While we provide recommended parameter ranges to guide your exploration, the hard-coded limits in howde/config.py are intentionally more relaxed—they simply prevent non-sensical values. Inputs falling outside these hard limits will raise an error.
-
edit_config_default(dict, optional): Optional dictionary that allows overriding the default settings inhowde/config.pyto fine-tune preprocessing and detection behavior.
The dictionary should include parameters:is_time_local— interpret timestamps as local time (True) or UTC (False)min_stop_t— minimum stop duration (seconds)start_hour_day,end_hour_day— hours used for home detectionstart_hour_work,end_hour_work— hours used for work detectiondata_for_predict— use only past data for estimation
-
stops_output(bool): Ifstop, returns stop-level data withlocation_typeand one row per stop. Ifchange, returns a compact DataFrame with only one row per day with home/work location changes. -
verbose(bool): If True, reports processing steps.
If a single parameter configuration is used, the function returns a PySpark DataFrame with three additional columns:
detect_H_locThe location ID (loc) identified as Home. Assigned if the location satisfies all filtering criteria. As such, it represents a day-level assessment, taking into account observations within a sliding window of t ±range_window_home/ 2 days.detect_W_locThe location ID (loc) identified as Work. Assigned if the location satisfies all filtering criteria. As such, it represents a day-level assessment, taking into account observations within a sliding window of t ±range_window_work/ 2 days.location_typeIndicates the detected location type for each stop ('H' for Home, 'W' for Work, or 'O' for Other), based on matching the stop location to the inferred home/work labels.
If multiple parameter configurations are provided (as lists), the function returns a list of dictionaries, each with keys:
configs: including the configuration usedres: including the resulting labeled PySpark DataFrame (as described above)
from pyspark.sql import SparkSession
from howde import HoWDe_labelling
# Initialize Spark session
spark = SparkSession.builder.appName('HoWDeApp').getOrCreate()
# Load your stop location data
input_data = spark.read.parquet('path_to_your_data.parquet')
# Run HoWDe labelling
labeled_data = HoWDe_labelling(
input_data,
range_window_home=28,
range_window_work=42,
C_hours=0.4,
C_days_H=0.4,
C_days_W=0.5,
f_hours_H=0.7,
f_hours_W=0.4,
f_days_W=0.6,
output_format="stop",
verbose=False,
)
# Show the results
labeled_data.show()See more examples at /tutorials
Anonymized stop location data with true home and work labels will be available at:
De Sojo Caso, Silvia; Lucchini, Lorenzo; Alessandretti, Laura (2025). Benchmark datasets for home and work location detection: stop sequences and annotated labels. Technical University of Denmark. Dataset. https://doi.org/10.11583/DTU.28846325
This project is licensed under the MIT License. See the License file for details.