Skip to content

Latest commit

 

History

History

Final Datasets

This is a subdirectory to hold the final outputs of the project. It is generated by the functions in make_final_datasets.py. There are two types of output;

  1. The first is the individual payments datsets; [payments_ccg_final.zip, payments_trust_final.zip, and payments_nhsengland_final.zip]. They contain all of the raw individual payments as rows, with reconcilied supplier data (in the form of either verif_name, or a charity or company name and number, as described below). A 'codebook' of the columns is as follows;

    • date: The date at which the payment was made -- Can variably be either settlement, delivery, or purchase date
    • expensetype: Type of procurement
    • expensearea: Broad and high level description of what is being procured
    • supplier: The (raw) name of the supplier from which services are being procured, as it appears in the raw data (other than being upper cased).
    • transactionnumber: The transaction number individual to that specific file
    • amount: The amount of the specific payment made (filtered to be >£25k where an organisation doesn't threshold internally)
    • file: The name of the file from which the raw data comes from
    • negative: A flag for whether the payment is negative or not
    • dept: The NHS organisation from which the data comes
    • query_string_n: A normalized version of the supplier string
    • verif_match: An accepted match from the automatic recogniliation process
    • match_type: The type of the match
    • CompanyName: The name of the company -- either as an accepted match from the automated reconciliation process, or as a manually and thoroughly determined data audit.
    • CompanyNumber: As provided on the Companies House Register
    • CompanyCategory: a link of the company number to the CompanyCategory field within the Companies House register.
    • CharityName: a manually determined and verified Charity, based on the initial pass of the reconciliation algorithm.
    • CharityRegNo: An associated CharityRegNo from the Charity Commission data.
    • CharitySubNo: An associated CharitySubNo from the Charity Commission data.
    • CharityNameNo: An associated CharityNameNo from the Charity Commission data (but not from the audit)
    • audit_type: Whether the match; be it 'CompanyName' or 'ChartityName' was determined either solely by the automatic reconciliation process and then reviewed ("1"), or via an extremely deep audit process ("2: " pertains to that done by Steve Barnard, and 3. to that done for Companies by Ben Goodair)
    • CHnotes: Notes on the charity (CharityName, not verif_match) as determined by the audit process
    • CCnotes: Notes on the company (CompanyName, not verif_match) as determined by the audit process
    • isCIC: Whether the supplier is a Community Interest Company
  2. The second is a set of tabulated payments per institution type (i.e. [supplier_charity_final.csv, supplier_company_final.csv, supplier_nhsdigital_final.csv, supplier_nameddoctor_supplier.csv, supplier_namedperson_final.csv, supplier_nomatch_final.csv]). These files are mostly used as 'does it smell right' checks, and as a way to audit the reconciliation part of the pipeline. If you are looking to make any suggestions on the reconciliations (and please do!), these files would naturally be a better place to start, because they contain unique raw--> approved reconciliations! The columns are as follows:

    • query_string_n: A normalized version of the supplier string
    • verif_match (optional): If no Charity or Company match (either approximate or audit-based), this corresponds to the institute\person.
    • CharityName (optional): a manually determined and verified Charity, based on the initial pass of the reconciliation algorithm.
    • CharityRegNo (optional): An associated CharityRegNo from the Charity Commission data.
    • CCnotes (optional): Notes on the charity (CharityName, not verif_match) as determined by the audit process
    • CompanyName:
    • CompanyNumber:
    • CHnotes: Notes on the charity (CharityName, not verif_match) as determined by the audit process
    • isCIC: Whether the supplier is a Community Interest Company
    • audit_type: Whether the match; be it 'CompanyName' or 'ChartityName' was determined either solely by the automatic reconciliation process and then reviewed ("1"), or via an extremely deep audit process ("2: " pertains to that done by Steve Barnard, and 3. to that done for Companies by Ben Goodair)
    • count: The number of payments to this query_string_n
    • amount: The amount of procurement purchased from this query_string_n
    • match_0_n: The closed suggested (ElasticSearch) match to the query_string_n field
    • match_1_n: The second closed suggested (ElasticSearch) match to the query_string_n field
    • match_2_n: The third closed suggested (ElasticSearch) match to the query_string_n field
    • score_0_n: The ElasticSearch score pertaining to the closest match
    • score_1_n: The ElasticSearch score pertaining to the second closest match
    • score_2_n: The ElasticSearch score pertaining to the third closest match
    • match_0_n_lev: The Levenshtein distance to the closed ElasticSearch match
    • match_1_n_lev: The Levenshtein distance to the second closed ElasticSearch match
    • match_2_n_lev: The Levenshtein distance to the third closed ElasticSearch match

Do note; an institution can be on multiple registers!

Do note; these files correspond directly to the match_type!

Thanks again to Steve Barnard, John Mohan, Max Hattersly, Ben Goodair and Yu Pei for all of their help with the manual verification of these files!