The required packages are installed as a new conda environment including both R and Python dependencies with the following command:
conda env create -f requirements_conda.yml
⚠️ The use ofmambais faster and more stable for packages installation.
The missing R packages can be found in the "requirements_r.rda" file and can be downloaded using the following commands:
R
load("requirements_r.rda")
for (count in 1:length(installedpackages)) {
install.packages(installedpackages[count])
}
⚠️ Forreticulate, if asked for default python virtual environment, the answer should benoto take the default conda environment into consideration
- Set
DEBUGtoFALSE. N_SIMULATIONSis set to the range (1,100)- With
N_CPU> 1, the parallel processing is used - The list of methods contains (
marginal,permfit,cpi,cpi_rf,gpfi,gopfi,dgiandgoi). n_samplesis set to1000andn_featuesis set to50rho_grouplists all the correlation strengths in this experiment (0,0.2,0.5,0.8)- Number of permutations/samples
n_permis set to100 - The output csv file is found in
results/results_csv
- Preparing csv files with R script
plot_simulations_allunder[AUC-type1error-power-time_bars]_blocks_100_grps.csv - The plotting is done under
plots/plot_figure_simulations_grps.ipynbwith:Figure 1for the Figure 2 in the main textPower + Time + Prediction scoresfor the Figure 6 in the supplementFigure 1 Calibrationfor the Figure 5 in the supplement
- We use
compute_simulations_groups. - The script can be launched with the following command:
python -u compute_simulations_groups.py --n 1000 --pgrp 100 --nblocks 10 --intra 0.8 --inter 0.8 --conditional 1 --stacking 1 --f 1 --s 100 --njobs 1--nstands for the number of samples (Default1000)--pgrpstands for number of variables per group (Default100)--nblocksstands for the number of blocks/groups in the data structure (Default10)--intrastands for the intra correlation inside the groups (Default0.8)--interstands for the inter correlation between the groups (Default0.8)--conditionalstands for the use of CPI (1) or PI (0)--stackingstands for the use of stacking (1) or not (0)--fstands for the first point of the range (Default1)--sstands for the step-size i.e. range size (Default100)--njobsstands for the serial/parallel implementation underJoblib(Default1)
- The output csv file is found in
results/results_csvunder[AUC-type1error-power-time_bars]_blocks_100_groups_CPI_n_1000_p_1000_1::100_folds_2.csv - The plotting is done under
plots/plot_figure_simulations_grps.ipynbwithCompare Stacking vs Non Stackingfor the Figure 3 in the main text
- The data are the public data from
UKBBthat needs to sign an agreement before using it (Any personal data are already removed) - The
biomarkeris set by default toage n_jobsstands for serial/parallel computationsk_fold_bbistands for the number of folds for the internal cross validation of the methodk_foldstands for the number of folds for train/test splitting the original data
- The
$\underline{representative}$ p-valuewill be 2*median(p-values) across the 10 folds - As for the
$\underline{performance}$ , it is measured on the 10% test set split per fold - The output csv file is found in
results/results_csvunderResult_UKBB_age_all_imp_10_outer_2_inner_PERF.csvandResult_UKBB_age_all_imp_10_outer_2_inner_SIGN.csv - The plotting is done under
plots/plot_figure_simulations_grps.ipynbwithFigure 3for the Figure 4 in the main text