Skip to content

LLucchini/HoWDe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

119 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HoWDe

HoWDe (Home and Work Detection) is a Python package designed to identify home and work locations from individual timestamped sequences of stop locations. It processes stop location data to label each location as 'Home', 'Work', or 'None' based on user-defined parameters and heuristics.

A complete description of the algorithm can be found in our pre-print

Features

  • Processes stop location datasets to detect home and work locations.
  • Allows customization through various parameters to fine-tune detection heuristics.
  • Supports batch processing with multiple parameter configurations.
  • Outputs results as a PySpark DataFrame for seamless integration with big data workflows.

Installation

HoWDe requires Python 3.6 or later and a functional PySpark environment.

1. Install PySpark

Before installing HoWDe, ensure PySpark and Java are properly configured. For detailed setup instructions, please refer to the official PySpark Installation Guidelines

Installation Note:
PySpark may raise Py4JJavaError if Java or Spark are not properly configured. We recommend checking the Debugging PySpark and Py4JJavaError Guidelines

Compatibility Note:
Once PySpark/Java is correctly configured, HoWDe runs consistently across macOS, Ubuntu, and Windows. The following environments have been tested:

  • Python 3.9 + PySpark 3.3 + Java 20.0
  • Python 3.12 + PySpark 4.0 + Java 17.0

2. Install HoWDe

Once PySpark is installed and configured, you can install HoWDe via pip:

pip install HoWDe

Usage

The core function of the HoWDe package is HoWDe_labelling, which performs the detection of home and work locations.

def HoWDe_labelling(
    input_data,
    edit_config_default=None,
    range_window_home=28,
    range_window_work=42,
    C_hours=0.4,
    C_days_H=0.4,
    C_days_W=0.5,
    f_hours_H=0.7,
    f_hours_W=0.4,
    f_days_W=0.6,
    output_format="stop",
    verbose=False,
):
    """
    Perform Home and Work Detection (HoWDe)
    """

📥 Input Data

HoWDe expects the input to be a PySpark DataFrame containing one row per user stop, with the following columns:

Column Type Description
useruuid str or int Unique user identifier.
loc str or int Stop location ID (unique per useruuid).
⚠️ Avoid using -1 to label meaningful stops, as these are dropped following the Infostop convention.
start long Start time of the stop (Unix timestamp).
end long End time of the stop (Unix timestamp).
tz_hour_start, tz_minute_start int Optional. Time zone offsets (hours and minutes) used to convert UTC timestamps to local time, if applicable.
country int Optional. Country code; if not provided, a default "GL0B" label is assigned.

Example

+---------+-----+-------------+-------------+---------------+----------------+---------+
| useruuid| loc | start       | end         | tz_hour_start | tz_minute_start| country |
+---------+-----+-------------+-------------+---------------+----------------+---------+
| 1001    |  1 | 1704031200  | 1704034800  | 1             | 0              | DK      |
| 1001    |  2 | 1704056400  | 1704060000  | 1             | 0              | DK      |
+---------+-----+-------------+-------------+---------------+----------------+---------+

💡 Scalability Tip: This package involves heavy computations (e.g., window functions, UDFs). To ensure efficient parallel processing, use df.repartition("useruuid") to distribute data across partitions evenly. This reduces memory bottlenecks and improves resource utilization.

⚙️ Key Parameters

Parameter Type Description Suggested value and range
range_window_home int or list Sliding window size (in days) used to detect home locations. 28 [14-112]
range_window_work int or list Sliding window size (in days) used to detect work locations. 42 [14-112]
C_hours float or list Minimum fraction of night/business hourly-bins with data in a day 0.4 [0.2-0.9]
C_days_H float or list Minimum fraction of days with data in a window 0.4 [0.1-0.6]
C_days_W float or list Minimum fraction of days with data in a window 0.5 [0.4-0.6]
f_hours_H float or list Minimum average fraction of night hourly-bins (across days in the window) required for a location to qualify as Home. 0.7 [0.5-0.9]
f_hours_W float or list Minimum average fraction of business hourly-bins (across days in the window) required for a location to qualify as Work. 0.4 [0.4-0.6]
f_days_W float or list Minimum fraction of days within the window a location should be visited to qualify as Work. 0.6 [0.5-0.8]

All parameters listed above can also be provided as lists to explore multiple configurations in a single run.

💡 Tuning Tip: When adjusting detection parameters, start by refining the temporal coverage filters C_days_H, C_days_W to match the characteristics of your data. Once these are well aligned, tune the estimation thresholds f_hours_H, f_hours_W, f_days_W based on the case of study according to the specifics of your case study. These estimation thresholds play a major role in determining how strictly the algorithm identifies consistent home and work locations.

While we provide recommended parameter ranges to guide your exploration, the hard-coded limits in howde/config.py are intentionally more relaxed—they simply prevent non-sensical values. Inputs falling outside these hard limits will raise an error.

🔧 Other Parameters

  • edit_config_default (dict, optional): Optional dictionary that allows overriding the default settings in howde/config.py to fine-tune preprocessing and detection behavior.
    The dictionary should include parameters:

    • is_time_local — interpret timestamps as local time (True) or UTC (False)
    • min_stop_t — minimum stop duration (seconds)
    • start_hour_day, end_hour_day — hours used for home detection
    • start_hour_work, end_hour_work — hours used for work detection
    • data_for_predict — use only past data for estimation
  • stops_output (bool): If stop, returns stop-level data with location_type and one row per stop. If change, returns a compact DataFrame with only one row per day with home/work location changes.

  • verbose (bool): If True, reports processing steps.

📤 Returns

If a single parameter configuration is used, the function returns a PySpark DataFrame with three additional columns:

  • detect_H_loc The location ID (loc) identified as Home. Assigned if the location satisfies all filtering criteria. As such, it represents a day-level assessment, taking into account observations within a sliding window of t ± range_window_home / 2 days.
  • detect_W_loc The location ID (loc) identified as Work. Assigned if the location satisfies all filtering criteria. As such, it represents a day-level assessment, taking into account observations within a sliding window of t ± range_window_work / 2 days.
  • location_type Indicates the detected location type for each stop ('H' for Home, 'W' for Work, or 'O' for Other), based on matching the stop location to the inferred home/work labels.

If multiple parameter configurations are provided (as lists), the function returns a list of dictionaries, each with keys:

  • configs: including the configuration used
  • res: including the resulting labeled PySpark DataFrame (as described above)

Example Usage

from pyspark.sql import SparkSession
from howde import HoWDe_labelling

# Initialize Spark session
spark = SparkSession.builder.appName('HoWDeApp').getOrCreate()

# Load your stop location data
input_data = spark.read.parquet('path_to_your_data.parquet')

# Run HoWDe labelling
labeled_data = HoWDe_labelling(
    input_data,
    range_window_home=28,
    range_window_work=42,
    C_hours=0.4,
    C_days_H=0.4,
    C_days_W=0.5,
    f_hours_H=0.7,
    f_hours_W=0.4,
    f_days_W=0.6,
    output_format="stop",
    verbose=False,
)

# Show the results
labeled_data.show()

See more examples at /tutorials

Data

Anonymized stop location data with true home and work labels will be available at:

De Sojo Caso, Silvia; Lucchini, Lorenzo; Alessandretti, Laura (2025). Benchmark datasets for home and work location detection: stop sequences and annotated labels. Technical University of Denmark. Dataset. https://doi.org/10.11583/DTU.28846325

License

This project is licensed under the MIT License. See the License file for details.

About

A package for detecting home and work locations from individual timestamped sequences of stop locations.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors