The Open Bio Ref Dataset provides genomic reference data and software packages for use with Galaxy and Bioconductor applications. The reference data is available for hundreds of reference genomes and has been formatted for use with a variety of tools. The available configuration files make this data easily incorporable with a local Galaxy server without additional data preparation. Additionally, data experiments for R / Bioconducutor are provided as packages that can be downloaded and installed into an R environment.
Broadly, the data is organized as versioned objects for reference data, and data packages for use with the R programming language. The reference data is primarily intended for use with the Galaxy application, and has been formatted for easy configuration. In the future, additional formats may be provided. The packages are intended for use with R / Bioconductor.
The reference data is comprised mainly of genome sequence builds and associated prebuilt indexes for a variety of tools available for the Galaxy platform. In addition, Galaxy-specific configuration files are available that make the data discoverable by Galaxy tools. For the time being, the reference data is available only by mounting the data via a CernVM-FS (CVMFS) client and configuring Galaxy to use the mount as its reference data path.
Examples of types of data included:
- twoBit (.2bit) and FASTA (.fa) sequence files
- Bowtie 2 and BWA indexes
- Mutation Annotation Format (.maf) files
- SAMTools FASTA indexes (.fai)
The following folder structure is available once the data is mounted via CVMFS. There are two primary directories in the data repository:
- /managed: Data generated with Galaxy Data Managers, organized by data table (index format), then by genome build.
- /byhand: Data generated prior to the existence/use of Data Managers, manually curated.
These directories have somewhat different structures:
- /managed is organized by index type, then by genome build (Galaxy dbkey)
- /byhand is organized by genome build, then by index type
Both directories contain a location subdirectory, and each of these contain a tool_data_table_conf.xml file:
- /managed/location/tool_data_table_conf.xml
- /byhand/location/tool_data_table_conf.xml
Galaxy consumes these tool_data_table_conf.xml files and the .loc "location" files they reference. The paths contained in these files are valid once the data from this repository is mounted via CVMFS.
Data experiment packages are large-scale resources representing experimental results that serve as pedagogical aids, benchmarks, or reference resources to enable _R _users to conduct genomic studies. Example data sets range from bulk RNAseq to recent single-cell genomic analyses, with many other data types represented. Data packages will be available via the http or s3 protocols.
Data packages are arranged as objects presented in a standardized naming scheme that follows a pattern known as a 'CRAN-style' repository. This organization facilitates direct use of the data packages in R, without additional software requirements. The basic format is
- /packages/<version>/data/experiment/<OS-specific path>/PACKAGES containing an index of available packages for a particular version and operating system.
- _/packages/<version>/data/experiment/<OS-specific path>/<package_name> _providing the actual data package.
- The data/experiment component may be replaced with data/annotation and bioc, depending on the nature of the data package – data/experiment packages provide existing experiments; data/annotation provide resources analogous to the Galaxy-oriented reference data; bioc provide software packages allowing computation on experiment or annotation data.
The <version> reflects the (semi-annual) Bioconductor release in use, e.g., the current release as of 1 March 2022 is 3.14; a new release 3.15 is anticipated in May 2022. The <OS-specific path> supports packages for use under Windows, macOS, and Linux environments; cloud-based deployments will use Linux packages. The <package_name> includes the package version within the release, as well as information about compression and other details relevant to use.
The CRAN-style repository reflected in the data experiment resources makes it straight-forward for users to discover and use available resources using standard R / Bioconductor commands by simply setting the 'BioCmirror' _R _option to point to the path of the versioned resource, e.g., for the current release https://<S3-bucket>/packages/3.14.
_R / Bioconductor _packages in the current release branch may have 'bug fixes' applied, and in the current development branch may undergo frequent changes as new features are added. The _R / Bioconductor _build system updates packages on a regularly scheduled basis, and it is essential that the AWS resources remain synchronized with the latest _R / Bioconductor _build system products. As such, in addition to periodic release, package updates will also be periodically pushed to the AWS Open Data bucket whenever needed.