What it does

We attempt to devise a solution to predict missing values for data generated by IAQ sensors in the occasion of outages. We are given data for building IDs, device IDs of IAQ sensors, and the values of various chemicals, temperatures and relative humidities collected by the sensors over the timespan of 2023.

Additionally we propose a solution for detecting anomalies for this IAQ sensor data which can be used by sensors to identify and flag abnormal air conditions.

How we built it

Doing Data Analysis on Time Series Data: Going deep into analyzing the time series data taught us about trends, seasonality features, cyclical features, etc. We look at the Kaggle Time Series Course to quickly get caught up on these.

K-Fold Mean Target Encoding: This was the feature that made the biggest different in our MSE score, however we didn't actually know it was a thing prior to this competition. When we were doing data analysis with the 3x4 scatter plots we noticed how there were such strong hidden patterns in the data but we didn't have enough features to know what was actually causing this.

We thought about adding external data, however that wouldn't help much since we didn't know precise geographical locations of the buildings and sensors. Instead we noticed that different subsets of data had completely different correlations with the value measure. But we were hesitant on using the means of this because that sounded like overfitting, and felt like it was a bad practice.

However then we learned about the concept of Target Encoding and how data leakage and overfitting can be prevented using K-fold Target Encoding. This was a game changer for us.

Isolation Forest: On a smaller note, although we were familar with popular supervised tree based methods like Random Forests and XGBoost we were not aware of that it had niche unsupervised variant called Isolation forests.

Since they are built on similar underlying architecture we didn't have to repreat a lot of steps for the anomaly detection phase because now with Isolation Forests, we knew it would capture similar underlying patterns compared to XGBoost in our primary objective.

What's next for Quadreal Challenge:

In terms of improvments that could be made to this solution given that we have more time. We have:

K Means: K means would essentially help compress the building_id and device_id to a lower dimensional space, giving us more data per cluster, so better performanced.

Time Series Hybrid Models: Subtract the linear trends, and cylical seasonality trends using some linear model, and then use tree based models to predict the error would in theory make it easier for XGBoost to handle patterns in the data.

We didn't have the time to take a rigorous approach to this, so it is a good future improvement.

External Data using Domain Knowledge: In the future, we could create additional features based on domain knowledge about HVAC schedules of buildings, knowledge about how the chemicals behave in certain air conditions, etc. Our domain knowledge was pretty limited in this aspect which made it hard to confidently generate features based on external data.

Built With

Share this project:

Updates