This is a subdirectory to hold the final outputs of the project. It is generated by the functions in make_final_datasets.py. There are two types of output;
-
The first is the individual payments datsets; [payments_ccg_final.zip, payments_trust_final.zip, and payments_nhsengland_final.zip]. They contain all of the raw individual payments as rows, with reconcilied supplier data (in the form of either verif_name, or a charity or company name and number, as described below). A 'codebook' of the columns is as follows;
date: The date at which the payment was made -- Can variably be either settlement, delivery, or purchase dateexpensetype: Type of procurementexpensearea: Broad and high level description of what is being procuredsupplier: The (raw) name of the supplier from which services are being procured, as it appears in the raw data (other than being upper cased).transactionnumber: The transaction number individual to that specific fileamount: The amount of the specific payment made (filtered to be >£25k where an organisation doesn't threshold internally)file: The name of the file from which the raw data comes fromnegative: A flag for whether the payment is negative or notdept: The NHS organisation from which the data comesquery_string_n: A normalized version of the supplier stringverif_match: An accepted match from the automatic recogniliation processmatch_type: The type of the matchCompanyName: The name of the company -- either as an accepted match from the automated reconciliation process, or as a manually and thoroughly determined data audit.CompanyNumber: As provided on the Companies House RegisterCompanyCategory: a link of the company number to the CompanyCategory field within the Companies House register.CharityName: a manually determined and verified Charity, based on the initial pass of the reconciliation algorithm.CharityRegNo: An associated CharityRegNo from the Charity Commission data.CharitySubNo: An associated CharitySubNo from the Charity Commission data.CharityNameNo: An associated CharityNameNo from the Charity Commission data (but not from the audit)audit_type: Whether the match; be it 'CompanyName' or 'ChartityName' was determined either solely by the automatic reconciliation process and then reviewed ("1"), or via an extremely deep audit process ("2: " pertains to that done by Steve Barnard, and 3. to that done for Companies by Ben Goodair)CHnotes: Notes on the charity (CharityName, not verif_match) as determined by the audit processCCnotes: Notes on the company (CompanyName, not verif_match) as determined by the audit processisCIC: Whether the supplier is a Community Interest Company
-
The second is a set of tabulated payments per institution type (i.e. [supplier_charity_final.csv, supplier_company_final.csv, supplier_nhsdigital_final.csv, supplier_nameddoctor_supplier.csv, supplier_namedperson_final.csv, supplier_nomatch_final.csv]). These files are mostly used as 'does it smell right' checks, and as a way to audit the reconciliation part of the pipeline. If you are looking to make any suggestions on the reconciliations (and please do!), these files would naturally be a better place to start, because they contain unique raw--> approved reconciliations! The columns are as follows:
query_string_n: A normalized version of the supplier stringverif_match(optional): If no Charity or Company match (either approximate or audit-based), this corresponds to the institute\person.CharityName(optional): a manually determined and verified Charity, based on the initial pass of the reconciliation algorithm.CharityRegNo(optional): An associated CharityRegNo from the Charity Commission data.CCnotes(optional): Notes on the charity (CharityName, not verif_match) as determined by the audit processCompanyName:CompanyNumber:CHnotes: Notes on the charity (CharityName, not verif_match) as determined by the audit processisCIC: Whether the supplier is a Community Interest Companyaudit_type: Whether the match; be it 'CompanyName' or 'ChartityName' was determined either solely by the automatic reconciliation process and then reviewed ("1"), or via an extremely deep audit process ("2: " pertains to that done by Steve Barnard, and 3. to that done for Companies by Ben Goodair)count: The number of payments to this query_string_namount: The amount of procurement purchased from this query_string_nmatch_0_n: The closed suggested (ElasticSearch) match to thequery_string_nfieldmatch_1_n: The second closed suggested (ElasticSearch) match to thequery_string_nfieldmatch_2_n: The third closed suggested (ElasticSearch) match to thequery_string_nfieldscore_0_n: The ElasticSearch score pertaining to the closest matchscore_1_n: The ElasticSearch score pertaining to the second closest matchscore_2_n: The ElasticSearch score pertaining to the third closest matchmatch_0_n_lev: The Levenshtein distance to the closed ElasticSearch matchmatch_1_n_lev: The Levenshtein distance to the second closed ElasticSearch matchmatch_2_n_lev: The Levenshtein distance to the third closed ElasticSearch match
Do note; an institution can be on multiple registers!
Do note; these files correspond directly to the match_type!
Thanks again to Steve Barnard, John Mohan, Max Hattersly, Ben Goodair and Yu Pei for all of their help with the manual verification of these files!