This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read, and a library that provides different methods to access the data.
To install this package, simply run
pip install ukbiobank-loadersPlease note that python 3.7 or newer is needed.
We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.
These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory <DATA_FOLDER>:
death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt
Additionally, also the withdrawn consent file is needed:
withdrawn_consent.txt
From the terminal, run
update_data.py --raw_dir <DATA_FOLDER> --withdrawn_file <WITHDRAWN_CONSENT_FILE_PATH> --out_dir <OUTPUT_DIR_FOLDER>The processed data will be saved in a folder named <OUTPUT_DIR_FOLDER>/final.
We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be because there is not enough RAM available.
This is a simple example on how to use the library. Specific documentation about the methods is given below.
>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "<OUTPUT_DIR_FOLDER>/final")
>>> dl.get_hospital_data("icd10")
date_of_visit source feature value
eid
68 1986-04-22 icd10 N181 1
68 1945-05-03 icd10 N181 1
68 1950-04-03 icd10 N181 1
68 1966-08-07 icd10 N181 1
67 1991-03-12 icd10 N181 1
.. ... ... ... ...
73 NaT icd10 N181 1
48 1997-06-20 icd10 N181 1
48 1945-03-05 icd10 N181 1
48 1956-02-25 icd10 N181 1
48 1981-04-08 icd10 N181 1def load_lookup(lookup_name: str) -> pd.DataFrameLoads lookup table.
Arguments:
lookup_namestr - The name of the lookup table to be loaded.
Returns:
(pd.DataFrame)- The lookup table of interest.
Example: Load lookup of ICD10 diagnosis codes:
load_lookup("ehr_diagnosis_icd10")
def load_mapper(mapper_name: str) -> pd.DataFrameLoads ontology mapper.
Arguments:
mapper_namestr - The name of the mapper to be loaded.
Returns:
(pd.DataFrame)- The mapper of interest.
Example: Load mapping from ICD10 codes to Phecodes:
load_mapper("icd10_to_phecodes")
Loaders for versioned UKBB data.
class DataLoader()def __init__(data_dir: str)Class for loading UKBB data.
Arguments:
data_dirstr - The path to the directory containing the processed data. Note that on Windows the path must have forward-slashes, e.g. "C:/Users/john/Documents/data_dir"
def get_hospital_data(source: Union[str, List[str]],
level=None,
patient_list: np.ndarray = None) -> pd.DataFrameMethod that fetches hospital data for the UKBB population.
Arguments:
sourcestr or list - The coding/representation/source we would like to fetch. It needs to be one or more of:icd10- for fetching all icd10 related diagnoses.icd9- for fetching all icd9 related diagnoses.opcs3- for fetching all opcs4 related operational codes.opcs4- for fetching all opcs4 related operational codes.levellist or string - The level/significance of diagnoses we would like to fetch. It needs to be one or both of:primary- for fetching only the primary code related to one diagnosis.secondary- for fetching all the secondary (complementary) codes for one diagnosis.external- For fetching diagnosis codes from external sources. Defaults to all of them.patient_listnp.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.
Returns:
dfpd.DataFrame - A long canonical dataframe with patients as the index and the following columns:- date_of_visit: pandas datetime for each hospital visit
- feature: the different codes used (e.g. the different icd10 codes)
- source: this is relevant to the source the feature is referring to (e.g. icd10)
- value: the occurrence value for each row combination (initially 1.)
def get_death_data(level=None,
patient_list: np.ndarray = None) -> pd.DataFrameMethod that fetches death information for the UKBB population.
Arguments:
levellist or string - The level/significance of deaths we would like to fetch. It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.patient_listnp.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.
Returns:
dfpd.DataFrame - A long canonical dataframe with patients as the index and all recorded death information including death date in the right format.
def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)Method that fetches GP diagnosis information for the UKBB population.
Arguments:
sourcestr or list - Whether to load read_2, read_3 or both. Defaults to both.patient_listnp.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.
Returns:
dfpd.DataFrame - A long canonical dataframe with patients as the index and all recorded gp information including date in the right format.
def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrameMethod that fetches GP medication data for the UKBB population.
Arguments:
patient_listnp.ndarray - The patients to fetch medication data for. If this is empty, all UKBB patients will be used.
Returns:
dfpd.DataFrame - A canonical long dataframe with patients as the index and features as columns.
This package is developed using the UK Biobank Resource under Application Number 43138.