Documentation - Population Density API - Algorithm Initial Proposal#25
Documentation - Population Density API - Algorithm Initial Proposal#25jgarciahospital merged 4 commits intomainfrom
Conversation
|
I believe each provider will have their own algorithm based on their backend design, so this proposal may not be useful. |
| * Requested Space is received in the API customers’ request. | ||
| ### Process is as follows: | ||
| <ol> | ||
| <li><strong>Cleanup</strong>: Repeated records will be deleted and traffic events associated with M2M lines are filtered due to they do not contribute to the process of estimating future population density.<br> |
There was a problem hiding this comment.
as discussed since this is a high level easy to follow illustration let's avoid terms which have broad usage.
I.e. replace M2M lines with "IoT and non human wearing devices".
"Traffic events" is also a unclear term. What traffic is mentioned?
There was a problem hiding this comment.
M2M has been replaced by "IOT (non-human wearing devices)" and the reference to "Traffic events" has been removed.
| * Requested Space is received in the API customers’ request. | ||
| ### Process is as follows: | ||
| <ol> | ||
| <li><strong>Cleanup</strong>: Repeated records will be deleted and traffic events associated with M2M lines are filtered due to they do not contribute to the process of estimating future population density.<br> |
There was a problem hiding this comment.
The important step is missing (it should be step 1 I think): which data is processed? The algorithm processes some records, but which records? The whole history? Two years? 1 Year? Last hour?
If I understood our discussion correctly the steps must be like.
- identify day of the week and time frame for which the prediction is requested
- read location information for the same weekday and time frame for the recent 4 weeks
- clean up and other current steps
There was a problem hiding this comment.
In step one has been included the time-frame of the records: one year of historical information.
And we illustrate the prediction (step 6) with the step 1 and 2 you mentioned "The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks".
| anonymity) are discarded.<br> | ||
| <strong>Output 3</strong> → Total number of UEs connected per cell in each time interval, | ||
| considering privacy.</li><br> | ||
| <li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), and associating the grids that overlap with its coverage area to each cell.<br> |
There was a problem hiding this comment.
As discussed: please add that a simples approach here: estimate number of devices based on equal users distribution.
This is important to illustrate how this (untrivial step can be done)
| <li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells. As we are assuming that the distribution of users in the coverage area | ||
| of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br> | ||
| <strong>Output 5</strong> → Number of UEs per grid, based on distribution in the cells covering that area</li><br> | ||
| <li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. For example, the population density forecast for next Tuesday should be similar to the one observed last Tuesday. As the data becomes available, improvements |
There was a problem hiding this comment.
"For example" The whole algorithm and every its step is an example. So. let's describe one exact solution here. Either based on the same date/time last week, or few week averaging or few weeks weighted averaging (newer data has higher weigh). But we need to describe one concrete option here.
There was a problem hiding this comment.
Described a concrete solution in step 6: "The forecast, then, will be the average of the records of the previous 4 Tuesdays at the same time frame"
| to this process will be proposed to consider seasonality, holidays/working days and also to be able to make revisions for short-term predictions. For example, on a typical Tuesday we observe a population density of X, but today Tuesday at 10 o'clock we are already 25% above that level, so it is foreseeable that at 11 or 12 o'clock we will also continue above that level.<br> | ||
| <strong>Output 6</strong> → Users per grid, considering historical and calendar data</li><br> | ||
| <li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated | ||
| taking into account the market shares by geographical area to obtain the total number of people (users or not of the mobile network with any operator).<br> |
There was a problem hiding this comment.
"market shares by geographical area": does this algorithm assumes that market share per grid cell or for the requested area is known (how dynamic is it)?
For a simple illustrative algorithm which already uses many rough estimations during previous steps, it could be good enough to consider county/state/city level market share instead
| This document serves as a reference model for the population density calculation algorithm. It provides an illustrative high-level overview of the steps proposed to determine population density. Designed for developers, this guide is created to enhance understanding and implementation of the algorithm. While it aims to enhance understanding and implementation, it is the responsibility of the developer to create their own algorithm to calculate population density. | ||
|
|
||
| ### Inputs for the algorithm: | ||
| * UEs connection records data is received pseudonymized or hashed. |
There was a problem hiding this comment.
Commonalties (https://github.com/camaraproject/Commonalities/blob/main/documentation/API-design-guidelines.md) guides to avoid using telco specific terms. "UE" is the very first example of term to avoid :)
Please replace it with "devices" here and in few other places below as well
| <strong>Output 2</strong> → Valid UEs per cell in each time interval</li><br><br> | ||
| <li><strong>Counting and Aggregation:</strong> A count of users is performed per cell and interval. | ||
| On this count, a statistical analysis is performed with the elimination of outliers. | ||
| Records associated with cells with less than K users in each time interval (k – |
There was a problem hiding this comment.
This looks like a too early step to apply k-anonymity. One will need to do this again later (after data will be split per grid) anyway.
Suggestion: apply k-anonymity to the final result only.
There was a problem hiding this comment.
to be further reviewed offline first
There was a problem hiding this comment.
This has been reviewed with our legal team, and the way to do it right was at this point because if you do it in a later step you are introducing noise to the estimation. The reason why is, as the users are distributed homogeneously (among the grids associated with the cell) the k-anonymity will be biased as the distribution is not 100% a faithful representation. So, from a legal point of view and to not introduce this noise, the right step to do so would be step 3 as this is the last time the data reflects 100% the reality.
There was a problem hiding this comment.
This approach can cause "over-anonymization" on the API level, but legal team opinion is a strong argument.
Then we have to change k-anonymity threshold from "number of devices" to "density".
Like "Records associated with cells with less than K devices per square km in each time interval".
Current text can be read as k-anonymity is "X users per cell".
But
- cells can have different sizes.
- requested area can cover only half of the cell. So "100 devices in the cell" will be scaled down to 50 devices on the next step.
But explicitly describe k-anonymity threshold as devices/km2 addresses both problems.
There was a problem hiding this comment.
As mentioned, the legal recommendation is to keep on applying the k-anonimity over the users in a cell in step 3 (Counting and Aggregation).
Implementing this at a later stage could lead to two potential problems: (1) Small towns with low populations might be excluded from the estimation if large cells cover these entire towns, where the small population is dispersed across multiple geohashes, and (2) There could be inaccuracies in the estimation. If the population density is lower than the k-anonymity threshold, applying k-anonymity to the geohashes and using the threshold as the population value could result in that the estimated population would be greater than the real population of the country. Conversely, if we choose to return a value of 0, the overall population density of the entire country would appear lower than it actually is. Therefore, the proposal of Implementing the k-anonimity at a later stage is not being considered.
On the other hand, we find it reasonable to implement k-anonymity based on km2 and time intervals. This approach will be incorporated into the Algorithm Proposal for assessment by the respective legal departments and in accordance with the local regulations of each country where it will be applied.
| anonymity) are discarded.<br> | ||
| <strong>Output 3</strong> → Total number of UEs connected per cell in each time interval, | ||
| considering privacy.</li><br> | ||
| <li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), to associate the grids that overlap with each cell coverage area.<br> |
There was a problem hiding this comment.
One cannot divide space into grid. one can map space to the grid.
| <li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), to associate the grids that overlap with each cell coverage area.<br> | ||
| <strong>Output 4</strong> → Regular grid covering the required area, over the cells coverage | ||
| areas.</li><br> | ||
| <li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells to stimate the number of devices based on equal users distribution. As we are assuming that the distribution of users in the coverage area of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br> |
| areas.</li><br> | ||
| <li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells to stimate the number of devices based on equal users distribution. As we are assuming that the distribution of users in the coverage area of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br> | ||
| <strong>Output 5</strong> → Number of UEs per grid, based on distribution in the cells covering that area</li><br> | ||
| <li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks. |
There was a problem hiding this comment.
In this step the language is changed to "users". I assume we still dealing with devices here.
There was a problem hiding this comment.
No, in this step is indeed where the estimation of users is calculated, taking as input the historical information and also the records of devices obtained in previous steps.
There was a problem hiding this comment.
If this step is about users then could you please clarify:
- why it says "process is repeated for all cells to estimate the number of devices"?
- it is not just "Distribution and aggregation" and probably should be split in two steps: "Distribution and aggregation" for devices and then estimate users (other other way around)
- please add a hint hint how to estimate number of users based on number of devices. Based on what information?
- define a "user" (e.g. a person has 2 smartphones from the same operator - is they a single user or 2? and when this person gives one the the phone to someone else for a day?)
There was a problem hiding this comment.
To avoid confusion, we will simplify the process as follow, “and the process is repeated for all cells”. Additionally, to clarify, a user corresponds to an IMSI, so when we mention users, we are referring to IMSIs. This is the option with less corner cases that guarantee the best estimation possible.
| <li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks. | ||
| The population density forecast for next Tuesday should be similar to the one observed in the last four Tuesdays. The forecast, then, will be the average of the records of the previous 4 Tuesdays at the same time frame.<br> | ||
| <strong>Output 6</strong> → Users per grid, considering historical and calendar data</li><br> | ||
| <li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated |
There was a problem hiding this comment.
As far as I understand step 6 predicts number of devices, and now we extrapolate number of devices (not users) to population
There was a problem hiding this comment.
As previous comment, users are obtained from step 6
| <strong>Output 6</strong> → Users per grid, considering historical and calendar data</li><br> | ||
| <li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated | ||
| taking into account the market shares by geographical area (municipality) to obtain the total number of people (users or not of the mobile network with any operator).<br> | ||
| <strong>Output 7</strong> → Total population per grid, considering extrapolation towards MNO market share From step 2 onwards, this is aggregated data that does not contain personal information (this is anonymous data). |
There was a problem hiding this comment.
And this is when we must apply "k-anonymity". It must be applied to predicted population on a grid tile level.
There was a problem hiding this comment.
to be further reviewed offline first
|
@gregory1g new commit created based on the agreed text in https://wiki.camaraproject.org/display/CAM/%5Bdraft%5D+2024-08-28+Population+Density+Data+-+Meeting+Minutes |
|
lgtm |
What type of PR is this?
Add one of the following kinds:
What this PR does / why we need it:
Including initial proposal for API algorithm
Which issue(s) this PR fixes:
Fixes #12