Documentation - Population Density API - Algorithm Initial Proposal by jgarciahospital · Pull Request #25 · camaraproject/PopulationDensityData

jgarciahospital · 2024-05-14T14:13:03Z

What type of PR is this?

Add one of the following kinds:

documentation

What this PR does / why we need it:

Including initial proposal for API algorithm

Which issue(s) this PR fixes:

Fixes #12

sachinvodafone · 2024-05-21T15:18:02Z

I believe each provider will have their own algorithm based on their backend design, so this proposal may not be useful.

gregory1g · 2024-05-27T05:33:03Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+* Requested Space is received in the API customers’ request.
+### Process is as follows:
+<ol>
+<li><strong>Cleanup</strong>: Repeated records will be deleted and traffic events associated with M2M lines are filtered due to they do not contribute to the process of estimating future population density.<br>


as discussed since this is a high level easy to follow illustration let's avoid terms which have broad usage.
I.e. replace M2M lines with "IoT and non human wearing devices".

"Traffic events" is also a unclear term. What traffic is mentioned?

M2M has been replaced by "IOT (non-human wearing devices)" and the reference to "Traffic events" has been removed.

gregory1g · 2024-05-27T05:43:09Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+* Requested Space is received in the API customers’ request.
+### Process is as follows:
+<ol>
+<li><strong>Cleanup</strong>: Repeated records will be deleted and traffic events associated with M2M lines are filtered due to they do not contribute to the process of estimating future population density.<br>


The important step is missing (it should be step 1 I think): which data is processed? The algorithm processes some records, but which records? The whole history? Two years? 1 Year? Last hour?
If I understood our discussion correctly the steps must be like.

identify day of the week and time frame for which the prediction is requested

read location information for the same weekday and time frame for the recent 4 weeks

clean up and other current steps

In step one has been included the time-frame of the records: one year of historical information.

And we illustrate the prediction (step 6) with the step 1 and 2 you mentioned "The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks".

gregory1g · 2024-05-27T05:51:34Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+anonymity) are discarded.<br>
+<strong>Output 3</strong> &rarr;  Total number of UEs connected per cell in each time interval,
+considering privacy.</li><br>
+<li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), and associating the grids that overlap with its coverage area to each cell.<br>


As discussed: please add that a simples approach here: estimate number of devices based on equal users distribution.
This is important to illustrate how this (untrivial step can be done)

Added in step 5.

gregory1g · 2024-05-27T05:56:29Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+<li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells. As we are assuming that the distribution of users in the coverage area
+of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br>
+<strong>Output 5</strong> &rarr; Number of UEs per grid, based on distribution in the cells covering that area</li><br>
+<li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. For example, the population density forecast for next Tuesday should be similar to the one observed last Tuesday. As the data becomes available, improvements


"For example" The whole algorithm and every its step is an example. So. let's describe one exact solution here. Either based on the same date/time last week, or few week averaging or few weeks weighted averaging (newer data has higher weigh). But we need to describe one concrete option here.

Described a concrete solution in step 6: "The forecast, then, will be the average of the records of the previous 4 Tuesdays at the same time frame"

gregory1g · 2024-05-27T06:01:57Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+to this process will be proposed to consider seasonality, holidays/working days and also to be able to make revisions for short-term predictions. For example, on a typical Tuesday we observe a population density of X, but today Tuesday at 10 o'clock we are already 25% above that level, so it is foreseeable that at 11 or 12 o'clock we will also continue above that level.<br>
+<strong>Output 6</strong> &rarr; Users per grid, considering historical and calendar data</li><br>
+<li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated
+taking into account the market shares by geographical area to obtain the total number of people (users or not of the mobile network with any operator).<br> 


"market shares by geographical area": does this algorithm assumes that market share per grid cell or for the requested area is known (how dynamic is it)?
For a simple illustrative algorithm which already uses many rough estimations during previous steps, it could be good enough to consider county/state/city level market share instead

Specified in step 7: "Municipality".

gregory1g · 2024-06-10T07:56:22Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+This document serves as a reference model for the population density calculation algorithm. It provides an illustrative high-level overview of the steps proposed to determine population density. Designed for developers, this guide is created to enhance understanding and implementation of the algorithm. While it aims to enhance understanding and implementation, it is the responsibility of the developer to create their own algorithm to calculate population density.
+
+### Inputs for the algorithm:
+* UEs connection records data is received pseudonymized or hashed.


Commonalties (https://github.com/camaraproject/Commonalities/blob/main/documentation/API-design-guidelines.md) guides to avoid using telco specific terms. "UE" is the very first example of term to avoid :)

Please replace it with "devices" here and in few other places below as well

gregory1g · 2024-06-10T08:04:18Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+<strong>Output 2</strong> &rarr; Valid UEs per cell in each time interval</li><br><br>
+<li><strong>Counting and Aggregation:</strong> A count of users is performed per cell and interval.
+On this count, a statistical analysis is performed with the elimination of outliers.
+Records associated with cells with less than K users in each time interval (k –


This looks like a too early step to apply k-anonymity. One will need to do this again later (after data will be split per grid) anyway.
Suggestion: apply k-anonymity to the final result only.

to be further reviewed offline first

This has been reviewed with our legal team, and the way to do it right was at this point because if you do it in a later step you are introducing noise to the estimation. The reason why is, as the users are distributed homogeneously (among the grids associated with the cell) the k-anonymity will be biased as the distribution is not 100% a faithful representation. So, from a legal point of view and to not introduce this noise, the right step to do so would be step 3 as this is the last time the data reflects 100% the reality.

This approach can cause "over-anonymization" on the API level, but legal team opinion is a strong argument.

Then we have to change k-anonymity threshold from "number of devices" to "density".
Like "Records associated with cells with less than K devices per square km in each time interval".
Current text can be read as k-anonymity is "X users per cell".
But

cells can have different sizes.

requested area can cover only half of the cell. So "100 devices in the cell" will be scaled down to 50 devices on the next step.

But explicitly describe k-anonymity threshold as devices/km2 addresses both problems.

As mentioned, the legal recommendation is to keep on applying the k-anonimity over the users in a cell in step 3 (Counting and Aggregation).

Implementing this at a later stage could lead to two potential problems: (1) Small towns with low populations might be excluded from the estimation if large cells cover these entire towns, where the small population is dispersed across multiple geohashes, and (2) There could be inaccuracies in the estimation. If the population density is lower than the k-anonymity threshold, applying k-anonymity to the geohashes and using the threshold as the population value could result in that the estimated population would be greater than the real population of the country. Conversely, if we choose to return a value of 0, the overall population density of the entire country would appear lower than it actually is. Therefore, the proposal of Implementing the k-anonimity at a later stage is not being considered.

On the other hand, we find it reasonable to implement k-anonymity based on km2 and time intervals. This approach will be incorporated into the Algorithm Proposal for assessment by the respective legal departments and in accordance with the local regulations of each country where it will be applied.

gregory1g · 2024-06-10T08:05:55Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+anonymity) are discarded.<br>
+<strong>Output 3</strong> &rarr;  Total number of UEs connected per cell in each time interval,
+considering privacy.</li><br>
+<li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), to associate the grids that overlap with each cell coverage area.<br>


One cannot divide space into grid. one can map space to the grid.

gregory1g · 2024-06-10T08:07:10Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+<li><strong>Spatial indexing (no personal data is processed):</strong> the space will be divided into units (grids), to associate the grids that overlap with each cell coverage area.<br>
+<strong>Output 4</strong> &rarr; Regular grid covering the required area, over the cells coverage
+areas.</li><br>
+<li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells to stimate the number of devices based on equal users distribution. As we are assuming that the distribution of users in the coverage area of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br>


"stimate" -> "estimate"

gregory1g · 2024-06-10T08:10:46Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+areas.</li><br>
+<li><strong>Distribution and aggregation:</strong> Taking into account the coverage area of the cell, users are going to be distributed, homogeneously among the grids (currently 150m x 150m or larger) associated with the cell, and the process is repeated for all cells to stimate the number of devices based on equal users distribution. As we are assuming that the distribution of users in the coverage area of each cell is uniform, we are introducing an error/noise in the distribution of users by time interval, which will be transferred to population density predictions that contribute to reducing the risk of user reidentification. Each grid on the map can typically be served by several cells of different technologies and frequency bands. To obtain the number of users per grid and time interval, an aggregation is performed.<br>
+<strong>Output 5</strong> &rarr; Number of UEs per grid, based on distribution in the cells covering that area</li><br>
+<li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks.


In this step the language is changed to "users". I assume we still dealing with devices here.

No, in this step is indeed where the estimation of users is calculated, taking as input the historical information and also the records of devices obtained in previous steps.

If this step is about users then could you please clarify:

why it says "process is repeated for all cells to estimate the number of devices"?

it is not just "Distribution and aggregation" and probably should be split in two steps: "Distribution and aggregation" for devices and then estimate users (other other way around)

please add a hint hint how to estimate number of users based on number of devices. Based on what information?

define a "user" (e.g. a person has 2 smartphones from the same operator - is they a single user or 2? and when this person gives one the the phone to someone else for a day?)

To avoid confusion, we will simplify the process as follow, “and the process is repeated for all cells”. Additionally, to clarify, a user corresponds to an IMSI, so when we mention users, we are referring to IMSIs. This is the option with less corner cases that guarantee the best estimation possible.

gregory1g · 2024-06-10T08:15:34Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+<li><strong>Prediction:</strong> Based on the historical information of users by grid and interval, the prediction of users by grid and interval is made for each time interval of the future. The algorithm identifies a day of the week and time frame for which the prediction is requested. Then, it will read the records associated for the same weekday and time frame for the previous 4 weeks.
+The population density forecast for next Tuesday should be similar to the one observed in the last four Tuesdays. The forecast, then, will be the average of the records of the previous 4 Tuesdays at the same time frame.<br>
+<strong>Output 6</strong> &rarr; Users per grid, considering historical and calendar data</li><br>
+<li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated


As far as I understand step 6 predicts number of devices, and now we extrapolate number of devices (not users) to population

As previous comment, users are obtained from step 6

gregory1g · 2024-06-10T08:16:00Z

documentation/SupportingDocuments/Population Density API - Algorithm Initial Proposal.md

+<strong>Output 6</strong> &rarr; Users per grid, considering historical and calendar data</li><br>
+<li><strong>Extrapolation:</strong> Finally, the users of the MNO mobile network are extrapolated
+taking into account the market shares by geographical area (municipality) to obtain the total number of people (users or not of the mobile network with any operator).<br> 
+<strong>Output 7</strong> &rarr; Total population per grid, considering extrapolation towards MNO market share From step 2 onwards, this is aggregated data that does not contain personal information (this is anonymous data).


And this is when we must apply "k-anonymity". It must be applied to predicted population on a grid tile level.

to be further reviewed offline first

As explained in previous comment.

jgarciahospital · 2024-09-06T08:06:25Z

@gregory1g new commit created based on the agreed text in https://wiki.camaraproject.org/display/CAM/%5Bdraft%5D+2024-08-28+Population+Density+Data+-+Meeting+Minutes

gregory1g · 2024-09-10T13:06:25Z

lgtm

Create Population Density API - Algorithm Initial Proposal.md

29220ee

jgarciahospital added the documentation Improvements or additions to documentation label May 14, 2024

jgarciahospital requested review from maheshc01 and sachinvodafone as code owners May 14, 2024 14:13

gregory1g reviewed May 27, 2024

View reviewed changes

Update Population Density API - Algorithm Initial Proposal.md

c24e336

gregory1g reviewed Jun 10, 2024

View reviewed changes

Update Population Density API - Algorithm Initial Proposal.md

f925e6e

jgarciahospital mentioned this pull request Jul 9, 2024

Validate API documentation (meta-release item 4-5) #39

Closed

Update Population Density API according to discussions

aabd667

jgarciahospital requested a review from gregory1g September 9, 2024 16:27

sachinvodafone approved these changes Sep 11, 2024

View reviewed changes

jgarciahospital merged commit ab5a4d7 into main Sep 11, 2024

jgarciahospital deleted the jgarciahospital-patch-2 branch September 11, 2024 12:14

Conversation

jgarciahospital commented May 14, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Uh oh!

sachinvodafone commented May 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gregory1g Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgarciahospital commented Sep 6, 2024

Uh oh!

gregory1g commented Sep 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

gregory1g Jun 10, 2024 •

edited

Loading