- Developed confidence score tuning process (see
disambiguation_1880folder) - Developed interpolation process (see
interpolationfolder for code andinterpolation_notebooksfolder for jupyter notebooks that document process) - For overview of process and next steps see google drive
HNYC_Project/Projects/spatial_linkage/Spatial Linkage & Interpolation: Summer 2020.pptandHNYC_Project/Projects/spatial_linkage/Spatial Linkage and Interpolation Workflow.doc
Accomplished:
- Updated confidence score to include census conflicts (see
disambiguation_2.ipynb) - Merged lat lng data to matched data -> produced
matched.csv - Experimented with methods to add spatial weights, including graph-based and cluster-based approaches on a subset of the data (see
spatial-disambiguation.ipynb) - Outlined overall workflow for disambiguation using bipartite graph matching algorithm (
linkage-disambiguation.ipynb). - Ran algorithms on full dataset & obtained initial performance metrics
- Functionalized process using disambiguation module
- Tuned algorithms and compared metrics + recommendations
- Updated ES matching process to reflect metaphone matching See Google Drive Documentation > disambiguation > Spatial Linkage: Spring 2020 for slides
Two sources:
- City Directory: name (first, last, initial), address, ID, occupation, ED, ward, block number (constructed)
- Census: address (hidden during match process), ID (different from CD), occupation, ED, ward, age, gender, dwelling, household characteristics
Steps:
0. Preprocessing: changing names to their phonetic index
- Initial entity link (to generate possible matches): criteria = same ED + JW dist < 2 on indexed name
- Disambiguation (to choose between non-unique matches):
a. Generate a confidence score based on occupation (having occupation), age > 12 (not implemented), JW dist of name, relative probability (number of non-unique matches)
Similar, but no ED and address data, only ward data in the census.
/ES_matchingcontains all work related to elastic searchreadme.mdgives guidelines on how to implement the processfall_2019_analysis.mddescribes the work done up to fall 2019/doc: relevant documentation of the process/src: source code for running elastic search on our dataexplore_output.ipynbquick validation of the ES output (targeted at understanding whether metaphone matching was implemented in ES process)
/disambiguation_1850contains all work related to disambiguation of 1880 datadisambiguation_1850_v1.ipynbruns disambiguation process on 1850 ES output
/disambiguation_1880contains all work related to disambiguation of 1880 dataConfidence_Score_Tuning.ipynb: Documents confidence score tuning processConfidence_Score_Tuning_v02.ipynb: Confidence score tuning results used for 1850 v02 disambiguation run (10/2020), uses old version of 1880 data because of data issues with new 1880Confidence_Score_Tuning_new_1880_data_draft.ipynb: Attempt to tune confidence score with new 1880 data, revealed that there was an issue with the data/_archived: archived scripts/confidence_scorepreprocess.ipynb: preprocessing of data including generation of metaphonesget_confidence_score.ipynb: outlines process for generating confidence score, including calculation of jaro-wrinklerdisambiguation_analysis.ipynb: EDA on confidence scores -- generatesmatch_results_confidence_score.csvandfall_2019_disambiguation_report.md
/linkage_edaspatial-disambiguation.ipynb: documentation of different spatial weight algorithmslinkage-disambiguation.ipynb: outline record linkage approach (conceptually valid but code is outdated)linkage_full_run_v1.ipynb: applies basic algorithm to the whole dataset + initial performance analysislinkage_eda.ipynb: applies various iterations of algorithm to the whole dataset + conclusionslinkage_eda_v2.ipynb: applies updated geocodes on best 2 algorithms + improved benchmarking
run_link_records.ipynb: implemented record matching using pysparkconfidence_score_latlng.ipynb: adding of census conflicts to confidence score, merging of lat lng data (contains most updated confidence score formula)linkage_full_run_SPRING_LATEST.ipynb: informed by linkage EDA (see archive), generates latest disambiguated output from ES matching (with metaphone issue fixed)
/interpolation_notebooksProcess_DocumentationDisambiguation_Analysis_v01.ipynb: Resolves dwelling conflicts, calculates statistics,explores distance based sequences, and interpolation between known dwellingsInterpolation_v01.ipynb: Runs through current version of predicting unknown recordsDisambiguation_Analysis_v02.ipynb: Same information as v01, for new data (10/2020)Interpolation_v02.ipynb: Same information as v01, for new data (10/2020)
Concepts_and_Development:Block and Centroid Prediction with Analysis.ipynb: Walks through approaches to predicting block numbers directly, and then clusters (tests different clustering algorithms)Block Centroids and What They Represent, 1850.ipynb: Creates block centroids and illustrates them with visualizationsDwelling Addresses Fill In and Conflict Resolution Development.ipynb: Development of conflict resolution within dwelling processDeveloping Distance Based Sequences.ipynb: Process of developing distance based sequencesModel Comparison.ipynb: Tests a few different model options (no in depth tuning)Sequences Exploration.ipynb: Tests different iterations of sequence identificationModel Exploration.ipynb: Brief experimentation with using neural networks, incomplete because of preprocessing necessary
Archived1880_1850_for_Interpolation.ipynb: Explores 1880 and 1850 census datasetsFeature_Exploration.ipynb: Explores some of the columns in 1880 and 1850 datasets in order to determine what they represents and if they can be used for modellingInterpolation Pilots.ipynb: Working notebook for starting explorations of options for interpolation (often moved into a separate notebook when they seem worth looking at in more depth)Linear_Model.ipynb: Creates and tests linear models for house number interpolationModeling Comparison.ipynb: Tests different modeling approaches for house numbers (currently linear model and gradient boosting) -- includes haversine sequences and block numbers as featuresBlock_Numbers Early Exploration.ipynb: Explore block numbers distributions/data analysis and try using them as feature to predict house numberStreet_Dictionaries.ipynb: Tried out looking at street dictionaries for dwellings in betweenBlock Number Prediction.ipynb: Initial experiment with predicting block numbersInterpolation between known address development.ipynb: Process of looking at values between known dwellings
/interpolationSee read me within this folder for details/disambiguationis a python module containing wrapper functions needed in the disambiguation processinit.pycontains a Disambiguator object, when instantiated can be used to run entire disambiguation process, calling functions from below (seelinkage_eda.ipynbfor example on usage)preprocess.pycontains functions needed before applying disambiguation algorithms, including confidence score generationdisambiguation.pycontains functions needed for disambiguationanalysis.pywrapper functions to produce performance metricsconfidence_score_tuning.pycontains functions needed for the confidence tuning processbenchmarking.pycontains Benchmark objects, to run benchmarking process in confidence tuning for 1880
/matching_vizvisualization web app to understand disambiguation output, see readme in folder for guidance on how to run
Data is available in the HNYC Spatial Linkage Google Drive HNYC_Project/Projects/spatial_linkage/Data
- 1850 disambiguated output:
1850_disambiguated.csv - 1850 disambiguated output 10/2020 (current):
1850_mn_match_v02.csv - 1850 ES matches:
es-1850-22-9-2020.csv
- Matches with confidence score (raw input for 1880 disambiguation processes):
matches.csv- this is based off Fall 2019's Spark matching output
- Latest ES matches:
es-1880-21-5-2020.csv - Latest Disambiguated Output:
disambiguated-21-5-2020.csv