Skip to content

selkamand/sigshared

Repository files navigation

sigshared

Lifecycle: experimental CRAN status R-CMD-check

sigshared contains general utilities used in many other R packages in the sigverse

Unless you’re developing / maintaining a sigverse-compatible package, there is probably very little reason to ever download this package explicitly. We recommend just installing the full sigverse as described here

Installation

You can install the development version of sigshared from GitHub with:

# install.packages("devtools")
devtools::install_github("selkamand/sigshared")

Sigverse Data Types

Data Structure Requirements
Signature

The profile of a mutational signature

data.frames with 3 columns

  1. channel
  2. type
  3. fraction
Signature Collections Lists of signature data.frames, where name of list entry is the name of the signature
Signature Annotations

Signature level annotations. data.frames with 4 required columns:

  1. signature
  2. aetiology
  3. class
  4. subclass

class and subclass of aetiology do not need to conform to any specific ontology. However, we include the data-dictionary used by sigstash collections below (see Signature Aetiology Classes)

Catalogue

The mutational profile of a sample, described by tallying mutations belonging to each mutational channel. Catalogues are not always observational. They can also be simulated from signature models. data.frames with 4 required columns:

  1. channel

  2. type

  3. fraction

  4. count

Catalogue Collections Lists of catalogue data.frames (1 per sample) where name represents a sample identifier.
Cohort Signature Analysis Results

data.frame with 4 columns:

  1. sample

  2. signature

3 . contribution_absolute

  1. contribution

  2. p_value (see `?sigstats::sig_com p

u te_experimental_p_value()`)

Bootstraps

data.frame with 1 row per signature per bootstrap

  1. bootstrap
  2. signature 3 . contribution_absolute
  3. contribution (percentage)
Model Specification Named numeric vector where names represent signatures and values represent their proportional contribution to the model.
Cohort Metadata

Data frames describing sample-level metadata with required columns:

  1. sample
  2. disease

Can include additional columns with other metadata.

UMAP

Data frames representing UMAP coordinates with at least 3 columns:

  1. sample
  2. dim1
  3. dim2
Similarity Against Cohort

data.frame that describes how similar a sample catalogue is to others in the cohort. Contains 2 columns:

  1. sample
  2. cosine_similarity

Assertions

You can assert an object belongs to each of data types

library(sigshared)

# Generate Example Datatypes

# Signatures
signature = example_signature()
signature_collection = example_signature_collection()
signature_annotations = example_annotations()

# Catalogues
catalogue = example_catalogue()
catalogue_collection = example_catalogue_collection()

# Cohort Analysis Results
cohort_analysis = example_cohort_analysis()

# Model
model = example_model()

# Assert Signatures
assert_signature(signature)
assert_signature_collection(signature_collection)
assert_signature_annotations(signature_annotations)

# Assert catalogues
assert_catalogue(catalogue)
assert_catalogue_collection(catalogue_collection)

# Assert Analyses
assert_cohort_analysis(cohort_analysis)

# Assert Model
assert_model(model, signature_collection,arg_name = "bob")

Signature Aetiology Classes

library(knitr)
kable(sig_aetiology_classes())
class subclass
artefact 8-oxo-guanine
artefact sequencing_artefact
artefact germline_contamination
artefact oversegmentation
clock-like clock-like
dysfunctional_dna_repair MMR
dysfunctional_dna_repair HR
dysfunctional_dna_repair NER
dysfunctional_dna_repair BER
dysfunctional_dna_repair NHEJ
dysfunctional_dna_replication proofreading
dysfunctional_dna_replication polymerase_mutations
treatment_associated chemotherapy_platinum
treatment_associated chemotherapy_thiopurine
treatment_associated chemotherapy_pyrimidine_antagonists
treatment_associated chemotherapy_unknown
treatment_associated chemotherapy_nitrogen_mustards
treatment_associated triazenes
treatment_associated immunosuppression
environmental_mutagens tobacco
environmental_mutagens haloalkanes
environmental_mutagens UV
environmental_mutagens aristolochic_acid
plants_and_microbes aflatoxin
plants_and_microbes colibactin
adenosine_deamination adenosine_deaminases
cytosine_deamination cytidine_deaminases
cytosine_deamination cytosine_deamination
immune ROS
dysfunctional_epigenetics topology
chromosomal chromosomal_losses
chromosomal chromosomal_instability
chromosomal chromothripsis
ploidy diploid
ploidy tetraploid
unknown unknown

For Developers

Argument Naming Conventions

  • catalogue
    A single mutational catalogue for a sample. A data.frame with columns: channel, type, fraction, count. Catalogues may be empirical (observed) or simulated.
    See: sigshared::example_catalogue()

  • signature
    A mutational signature profile. A data.frame with columns: type, channel, fraction.
    See: sigshared::example_signature()

  • signatures
    A collection of signatures. A named list of signature data.frames.
    See: sigshared::example_signature_collection()

  • catalogues
    A collection of catalogues. A named list of catalogue data.frames.
    See: sigshared::example_catalogue_collection()

  • model
    A named numeric vector describing a signature mixture. Names are signature IDs, values are their proportional contributions (e.g., c(SBS1 = 0.6, SBS5 = 0.4)).
    See: sigshared::example_model()

  • cohort
    A data.frame describing signature contributions per sample. Columns: sample, signature, contribution_absolute, contribution.
    See: sigshared::example_cohort()

  • cohort_metadata
    Sample-level metadata as a data.frame. Must include columns: sample, disease. Can include others.
    See: sigshared::example_metadata()

  • similarity_against_cohort
    A data.frame summarizing pairwise similarity between a sample and all others in the cohort. Columns: sample, cosine_similarity.
    See: sigshared::example_similarity_against_cohort()

  • umap
    A 2D UMAP projection of catalogue similarities. A data.frame with columns: sample, dim1, dim2.
    See: sigshared::example_umap()

Example Data

For each data structure used by the sigverse we include a toy example.

example_annotations()
#>   signature                       aetiology               class
#> 1      sig1          A clock like signature          clock-like
#> 2      sig2 An AID/APOBEC related signature cytidine deaminases
#>              subclass
#> 1          clock-like
#> 2 cytidine deaminases
example_bootstraps()
#>   bootstrap  signature contribution_absolute contribution
#> 1         1 Signature1                   300         0.30
#> 2         1 Signature2                   690         0.69
#> 3         1 Signature3                    10         0.01
#> 4         2 Signature1                   440         0.44
#> 5         2 Signature2                   500         0.50
#> 6         2 Signature3                    60         0.06

example_catalogue()
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444
example_catalogue_collection()
#> $catalogue1
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444
#> 
#> $catalogue2
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444
#> 
#> $catalogue3
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444

example_signature()
#>   channel type fraction
#> 1 A[T>C]G  T>C      0.4
#> 2 A[T>C]C  T>C      0.1
#> 3 A[T>C]T  T>C      0.5
example_signature_collection()
#> $sig1
#>   channel type fraction
#> 1 A[T>C]G  T>C      0.4
#> 2 A[T>C]C  T>C      0.1
#> 3 A[T>C]T  T>C      0.5
#> 
#> $sig2
#>   channel type fraction
#> 1 A[T>C]G  T>C      0.4
#> 2 A[T>C]C  T>C      0.1
#> 3 A[T>C]T  T>C      0.5

example_model()
#> sig1 sig2 
#>  0.3  0.7

example_cohort_analysis()
#>    sample signature contribution_absolute contribution p_value
#> 1 sample1      sig1                     3          0.3    0.05
#> 2 sample1      sig2                     7          0.7    0.10
#> 3 sample2      sig1                    40          0.4    0.20
#> 4 sample2      sig2                    60          0.6    0.15
example_similarity_against_cohort()
#>      sample cosine_similarity
#> 1   sample1              0.95
#> 2   sample2              0.89
#> 3   sample3              0.78
#> 4   sample4              0.85
#> 5   sample5              0.92
#> 6   sample6              0.88
#> 7   sample7              0.90
#> 8   sample8              0.86
#> 9   sample9              0.80
#> 10 sample10              0.84

example_cohort_metadata()
#>      sample     disease
#> 1   sample1    Melanoma
#> 2   sample2    Melanoma
#> 3   sample3    Melanoma
#> 4   sample4    Melanoma
#> 5   sample5    Melanoma
#> 6   sample6 Lung Cancer
#> 7   sample7 Lung Cancer
#> 8   sample8 Lung Cancer
#> 9   sample9 Lung Cancer
#> 10 sample10 Lung Cancer

example_umap()
#>      sample dim1 dim2
#> 1   sample1  0.5 -0.4
#> 2   sample2  1.2  0.6
#> 3   sample3 -0.7  1.3
#> 4   sample4  2.3 -0.8
#> 5   sample5 -1.5  2.0
#> 6   sample6  0.0 -1.2
#> 7   sample7  1.8  0.5
#> 8   sample8 -0.3  1.1
#> 9   sample9  0.9 -0.9
#> 10 sample10 -1.1  0.0

We also include examples from real SBS mutational signature analysis of the colo829 melanoma cell line

head(example_catalogue_colo829())
#>   channel type     fraction count
#> 1 A[C>A]A  C>A 0.0035802073   134
#> 2 A[C>A]C  C>A 0.0016565138    62
#> 3 A[C>A]G  C>A 0.0004542054    17
#> 4 A[C>A]T  C>A 0.0015229240    57
#> 5 A[C>G]A  C>G 0.0016565138    62
#> 6 A[C>G]C  C>G 0.0011488725    43
head(example_bootstraps_colo829())
#>   signature bootstrap contribution_absolute contribution
#> 1      SBS1     Rep_1                0.0000  0.000000000
#> 2      SBS2     Rep_1              271.9152  0.007265021
#> 3      SBS3     Rep_1                0.0000  0.000000000
#> 4      SBS4     Rep_1                0.0000  0.000000000
#> 5      SBS5     Rep_1                0.0000  0.000000000
#> 6      SBS6     Rep_1                0.0000  0.000000000

Reformatting Signature/Catalogue Collections

# List of signatures -> matrix (rows = channels; columns = signatures; values = fractions)
sig_collection_reformat_list_to_matrix(example_signature_collection())
#>         sig1 sig2
#> A[T>C]G  0.4  0.4
#> A[T>C]C  0.1  0.1
#> A[T>C]T  0.5  0.5
#> attr(,"type")
#> [1] "T>C" "T>C" "T>C"

# Matrix -> List of signatures
sig_collection_reformat_matrix_to_list(example_signature_collection_matrix(),
                                       values = "fraction")
#> $sig1
#>   channel type fraction
#> 1 A[T>C]G  T>C      0.4
#> 2 A[T>C]C  T>C      0.1
#> 3 A[T>C]T  T>C      0.5
#> 
#> $sig2
#>   channel type fraction
#> 1 A[T>C]G  T>C      0.4
#> 2 A[T>C]C  T>C      0.1
#> 3 A[T>C]T  T>C      0.5

# List of signatures -> tidy data.frame
sig_collection_reformat_list_to_tidy(example_signature_collection())
#>   signature type channel fraction
#> 1      sig1  T>C A[T>C]G      0.4
#> 2      sig1  T>C A[T>C]C      0.1
#> 3      sig1  T>C A[T>C]T      0.5
#> 4      sig2  T>C A[T>C]G      0.4
#> 5      sig2  T>C A[T>C]C      0.1
#> 6      sig2  T>C A[T>C]T      0.5

# Tidy data.frame -> list of signatures
sig_collection_reformat_tidy_to_list(example_signature_collection_tidy())
#> $sig1
#>   type channel fraction
#> 1  T>C A[T>C]G      0.4
#> 2  T>C A[T>C]C      0.1
#> 3  T>C A[T>C]T      0.5
#> 
#> $sig2
#>   type channel fraction
#> 4  T>C A[T>C]G      0.4
#> 5  T>C A[T>C]C      0.1
#> 6  T>C A[T>C]T      0.5



# All the above methods work with catalogues
sig_collection_reformat_list_to_matrix(example_catalogue_collection(), values = "count")
#>         catalogue1 catalogue2 catalogue3
#> A[T>C]G          5          5          5
#> A[T>C]C         10         10         10
#> A[T>C]T         12         12         12
#> attr(,"type")
#> [1] "T>C" "T>C" "T>C"

sig_collection_reformat_matrix_to_list(example_catalogue_collection_matrix(), 
                                       values = "count")
#> $catalogue1
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444
#> 
#> $catalogue2
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444
#> 
#> $catalogue3
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     5 0.1851852
#> 2 A[T>C]C  T>C    10 0.3703704
#> 3 A[T>C]T  T>C    12 0.4444444

Other Utility Functions

sigverse dependencies are kept minimal. To mitigate the cost to readability, sigshared includes some baseR implementations of common data.frame manipulation functions and other utilities.

# Setup data
mtcars <- mtcars[1:5,1:5]

# Rename a dataframe
brename(mtcars, c(miles_per_gallon = "mpg"))
#>                   miles_per_gallon cyl disp  hp drat
#> Mazda RX4                     21.0   6  160 110 3.90
#> Mazda RX4 Wag                 21.0   6  160 110 3.90
#> Datsun 710                    22.8   4  108  93 3.85
#> Hornet 4 Drive                21.4   6  258 110 3.08
#> Hornet Sportabout             18.7   8  360 175 3.15

# Select a subset of columns
bselect(mtcars, c("mpg"))
#>                    mpg
#> Mazda RX4         21.0
#> Mazda RX4 Wag     21.0
#> Datsun 710        22.8
#> Hornet 4 Drive    21.4
#> Hornet Sportabout 18.7

# Evaluate code with a specific random seed
with_seed(seed = 123, { runif(1) })
#> [1] 0.2875775

# Compute fraction from count vector
compute_fraction(c(1, 100, 10, 40))
#> [1] 0.006622517 0.662251656 0.066225166 0.264900662

S3 classes

Most types of data used by sigverse are so simple-stuctured we can just expect data.frames/lists and use custom assertions to ensure it matches expectation. There are a couple of exceptions, however, where we provide S3 classes abstract away complexity in the datastores. For example we use 2 different s3 objects to store signature analysis results.

  1. signature_analysis_results
    The numeric results of signature analysis (no-visualisation). Designed to allow a creation of all sigstory visualisations from EXCLUSIVELY data in the object.

  2. sigstory_visualisations
    Contains mostly visualisations only + metrics we display in sigstory reports. This helps us keep sigstory very light (logic is hard to debug in knitted quarto templates). Additionally, this object type allows us to easily write all visualisations to disk so they can be pulled into non-sigstory reporting tools.

Unless you’re developing a sigstory compatible R package you probably won’t ever need to think about either of these object types.

About

A collection of utility functions used by many R packages in the sigverse

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages