Skip to content

Documentation - Population Density API - Algorithm Initial Proposal#25

Merged
jgarciahospital merged 4 commits intomainfrom
jgarciahospital-patch-2
Sep 11, 2024
Merged

Documentation - Population Density API - Algorithm Initial Proposal#25
jgarciahospital merged 4 commits intomainfrom
jgarciahospital-patch-2

Conversation

@jgarciahospital
Copy link
Collaborator

What type of PR is this?

Add one of the following kinds:

  • documentation

What this PR does / why we need it:

Including initial proposal for API algorithm

Which issue(s) this PR fixes:

Fixes #12

@jgarciahospital jgarciahospital added the documentation Improvements or additions to documentation label May 14, 2024
@sachinvodafone
Copy link
Collaborator

I believe each provider will have their own algorithm based on their backend design, so this proposal may not be useful.

* Requested Space is received in the API customers’ request.
### Process is as follows:
<ol>
<li><strong>Cleanup</strong>: Repeated records will be deleted and traffic events associated with M2M lines are filtered due to they do not contribute to the process of estimating future population density.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed since this is a high level easy to follow illustration let's avoid terms which have broad usage.
I.e. replace M2M lines with "IoT and non human wearing devices".

"Traffic events" is also a unclear term. What traffic is mentioned?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M2M has been replaced by "IOT (non-human wearing devices)" and the reference to "Traffic events" has been removed.

* Requested Space is received in the API customers’ request.
### Process is as follows:
<ol>
<li><strong>Cleanup</strong>: Repeated records will be deleted and traffic events associated with M2M lines are filtered due to they do not contribute to the process of estimating future population density.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The important step is missing (it should be step 1 I think): which data is processed? The algorithm processes some records, but which records? The whole history? Two years? 1 Year? Last hour?
If I understood our discussion correctly the steps must be like.

  1. identify day of the week and time frame for which the prediction is requested
  2. read location information for the same weekday and time frame for the recent 4 weeks
  3. clean up and other current steps

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In step one has been included the time-frame of the records: one year of historical information.

And we illustrate the prediction (step 6) with the step 1 and 2 you mentioned "The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks".

anonymity) are discarded.<br>
<strong>Output 3</strong> &rarr; Total number of UEs connected per cell in each time interval,
considering privacy.</li><br>
<li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), and associating the grids that overlap with its coverage area to each cell.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed: please add that a simples approach here: estimate number of devices based on equal users distribution.
This is important to illustrate how this (untrivial step can be done)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in step 5.

<li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells. As we are assuming that the distribution of users in the coverage area
of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br>
<strong>Output 5</strong> &rarr; Number of UEs per grid, based on distribution in the cells covering that area</li><br>
<li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. For example, the population density forecast for next Tuesday should be similar to the one observed last Tuesday. As the data becomes available, improvements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"For example" The whole algorithm and every its step is an example. So. let's describe one exact solution here. Either based on the same date/time last week, or few week averaging or few weeks weighted averaging (newer data has higher weigh). But we need to describe one concrete option here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described a concrete solution in step 6: "The forecast, then, will be the average of the records of the previous 4 Tuesdays at the same time frame"

to this process will be proposed to consider seasonality, holidays/working days and also to be able to make revisions for short-term predictions. For example, on a typical Tuesday we observe a population density of X, but today Tuesday at 10 o'clock we are already 25% above that level, so it is foreseeable that at 11 or 12 o'clock we will also continue above that level.<br>
<strong>Output 6</strong> &rarr; Users per grid, considering historical and calendar data</li><br>
<li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated
taking into account the market shares by geographical area to obtain the total number of people (users or not of the mobile network with any operator).<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"market shares by geographical area": does this algorithm assumes that market share per grid cell or for the requested area is known (how dynamic is it)?
For a simple illustrative algorithm which already uses many rough estimations during previous steps, it could be good enough to consider county/state/city level market share instead

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specified in step 7: "Municipality".

This document serves as a reference model for the population density calculation algorithm. It provides an illustrative high-level overview of the steps proposed to determine population density. Designed for developers, this guide is created to enhance understanding and implementation of the algorithm. While it aims to enhance understanding and implementation, it is the responsibility of the developer to create their own algorithm to calculate population density.

### Inputs for the algorithm:
* UEs connection records data is received pseudonymized or hashed.
Copy link
Contributor

@gregory1g gregory1g Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commonalties (https://github.com/camaraproject/Commonalities/blob/main/documentation/API-design-guidelines.md) guides to avoid using telco specific terms. "UE" is the very first example of term to avoid :)

Please replace it with "devices" here and in few other places below as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solved

<strong>Output 2</strong> &rarr; Valid UEs per cell in each time interval</li><br><br>
<li><strong>Counting and Aggregation:</strong> A count of users is performed per cell and interval.
On this count, a statistical analysis is performed with the elimination of outliers.
Records associated with cells with less than K users in each time interval (k –
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a too early step to apply k-anonymity. One will need to do this again later (after data will be split per grid) anyway.
Suggestion: apply k-anonymity to the final result only.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be further reviewed offline first

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reviewed with our legal team, and the way to do it right was at this point because if you do it in a later step you are introducing noise to the estimation. The reason why is, as the users are distributed homogeneously (among the grids associated with the cell) the k-anonymity will be biased as the distribution is not 100% a faithful representation. So, from a legal point of view and to not introduce this noise, the right step to do so would be step 3 as this is the last time the data reflects 100% the reality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach can cause "over-anonymization" on the API level, but legal team opinion is a strong argument.

Then we have to change k-anonymity threshold from "number of devices" to "density".
Like "Records associated with cells with less than K devices per square km in each time interval".
Current text can be read as k-anonymity is "X users per cell".
But

  1. cells can have different sizes.
  2. requested area can cover only half of the cell. So "100 devices in the cell" will be scaled down to 50 devices on the next step.

But explicitly describe k-anonymity threshold as devices/km2 addresses both problems.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, the legal recommendation is to keep on applying the k-anonimity over the users in a cell in step 3 (Counting and Aggregation).

Implementing this at a later stage could lead to two potential problems: (1) Small towns with low populations might be excluded from the estimation if large cells cover these entire towns, where the small population is dispersed across multiple geohashes, and (2) There could be inaccuracies in the estimation. If the population density is lower than the k-anonymity threshold, applying k-anonymity to the geohashes and using the threshold as the population value could result in that the estimated population would be greater than the real population of the country. Conversely, if we choose to return a value of 0, the overall population density of the entire country would appear lower than it actually is. Therefore, the proposal of Implementing the k-anonimity at a later stage is not being considered.

On the other hand, we find it reasonable to implement k-anonymity based on km2 and time intervals. This approach will be incorporated into the Algorithm Proposal for assessment by the respective legal departments and in accordance with the local regulations of each country where it will be applied.

anonymity) are discarded.<br>
<strong>Output 3</strong> &rarr; Total number of UEs connected per cell in each time interval,
considering privacy.</li><br>
<li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), to associate the grids that overlap with each cell coverage area.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One cannot divide space into grid. one can map space to the grid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

<li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), to associate the grids that overlap with each cell coverage area.<br>
<strong>Output 4</strong> &rarr; Regular grid covering the required area, over the cells coverage
areas.</li><br>
<li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells to stimate the number of devices based on equal users distribution. As we are assuming that the distribution of users in the coverage area of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stimate" -> "estimate"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

areas.</li><br>
<li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells to stimate the number of devices based on equal users distribution. As we are assuming that the distribution of users in the coverage area of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br>
<strong>Output 5</strong> &rarr; Number of UEs per grid, based on distribution in the cells covering that area</li><br>
<li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this step the language is changed to "users". I assume we still dealing with devices here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in this step is indeed where the estimation of users is calculated, taking as input the historical information and also the records of devices obtained in previous steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this step is about users then could you please clarify:

  1. why it says "process is repeated for all cells to estimate the number of devices"?
  2. it is not just "Distribution and aggregation" and probably should be split in two steps: "Distribution and aggregation" for devices and then estimate users (other other way around)
  3. please add a hint hint how to estimate number of users based on number of devices. Based on what information?
  4. define a "user" (e.g. a person has 2 smartphones from the same operator - is they a single user or 2? and when this person gives one the the phone to someone else for a day?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid confusion, we will simplify the process as follow, “and the process is repeated for all cells”. Additionally, to clarify, a user corresponds to an IMSI, so when we mention users, we are referring to IMSIs. This is the option with less corner cases that guarantee the best estimation possible.

<li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks.
The population density forecast for next Tuesday should be similar to the one observed in the last four Tuesdays. The forecast, then, will be the average of the records of the previous 4 Tuesdays at the same time frame.<br>
<strong>Output 6</strong> &rarr; Users per grid, considering historical and calendar data</li><br>
<li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand step 6 predicts number of devices, and now we extrapolate number of devices (not users) to population

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previous comment, users are obtained from step 6

<strong>Output 6</strong> &rarr; Users per grid, considering historical and calendar data</li><br>
<li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated
taking into account the market shares by geographical area (municipality) to obtain the total number of people (users or not of the mobile network with any operator).<br>
<strong>Output 7</strong> &rarr; Total population per grid, considering extrapolation towards MNO market share From step 2 onwards, this is aggregated data that does not contain personal information (this is anonymous data).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is when we must apply "k-anonymity". It must be applied to predicted population on a grid tile level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be further reviewed offline first

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained in previous comment.

@jgarciahospital
Copy link
Collaborator Author

@gregory1g
Copy link
Contributor

lgtm

@jgarciahospital jgarciahospital merged commit ab5a4d7 into main Sep 11, 2024
@jgarciahospital jgarciahospital deleted the jgarciahospital-patch-2 branch September 11, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discussion on API algorithm - Initial proposal

4 participants