visualisation/highdimvis.Rmd at master · lgatto/visualisation

History

250 lines (196 loc) · 7.96 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

---

title: 'Visualisation of high-dimensional data'

author: "Laurent Gatto"

output:

rmdshower::shower_presentation:

theme: material

self_contained: true

ratio: 16x10

---

## Visualisation of high-dimensional data

```

Laurent Gatto Computational Biology

https://lgatto.github.io de Duve Institute, UCLouvain

laurent.gatto@uclouvain.be @lgatt0

```

Slides: http://bit.ly/highdimvis

Source: https://github.com/lgatto/visualisation

![CC-BY](./figs/cc1.jpg)

```{r env, warning=FALSE, echo=FALSE}

suppressPackageStartupMessages(library("pRoloc"))

suppressPackageStartupMessages(library("pRolocdata"))

library("BiocParallel")

suppressPackageStartupMessages(library("Rtsne"))

data(mulvey2015)

data(hyperLOPIT2015)

```

## Introduction

* Unsupervised machine learning (clustering): make use of the data to

identify patterns of interest.

* Supervised machine learning (classification and regression): use

prior knowledge (such as known labels about a subset of the data) to

infer new information (labels) for the unlabelled or new data.

## From data

```{r data, echo=FALSE}

set.seed(1L)

m <- data.frame(matrix(round(rnorm(15, 10, 3), 2), ncol = 3))

m$new <- rep("...", 5)

colnames(m) <- c(paste("Sample", 1:3), "...")

rownames(m) <- c(paste("Protein", 1:4), "...")

m$group <- c("A", "", "B", "A", "")

knitr::kable(m, align = "c")

```

## To visualisation, annotation and results

```{r introvis, echo=FALSE, fig.width=12, fig.height=4}

par(mfrow = c(1, 3))

plot2D(hyperLOPIT2015, fcol = NULL, col = "black")

plot2D(hyperLOPIT2015, fcol = "markers")

plot2D(hyperLOPIT2015, fcol = "final.assignment")

```

## Content

* Data

* Hierarchical clustering

* Dimentionality reduction: PCA

* Dimentionality reduction: t-SNE

## Proteomics examples

Sample-level visualisations using data from

[Mulvey *et al.* (2015)](https://www.ncbi.nlm.nih.gov/pubmed/26059426)

*Dynamic Proteomic Profiling of Extra-Embryonic Endoderm

Differentiation in Mouse Embryonic Stem Cells.*

Protein-level visualisations using data from

[Christoforou *et al.* (2016)](https://www.ncbi.nlm.nih.gov/pubmed/26754106)

*A draft map of the mouse pluripotent stem cell spatial proteome.*

## Hierarchical clustering

Hierarchical clustering methods start by calculating all pairwise

distances between all features (or samples) and then clusters/groups

these based on these similarities. There are various distances

measures and clustering algorithms that can be used.

Plots prepared with `dist` and `hclust` from the `stats` package and

`mrkHClust` from

[`pRoloc`](https://bioconductor.org/packages/devel/bioc/html/pRoloc.html).

## {.fullpage}

```{r hclust, echo=FALSE, fig.width=12, fig.height=6}

d <- dist(exprs(t(mulvey2015)))

hcl <- hclust(d)

par(mfrow = c(1, 2))

plot(hcl, xlab = "",

main = "Samples from Mulvey et al. (2015)",

sub = "Euclidian distance\nComplete clustering")

par(mar = c(10, 2, 3, 1))

mrkHClust(hyperLOPIT2015,

main = "Organelle protein markers from Christoforou et al. (2016)")

```

## Dimensionality reduction

When the data span over many dimensions (more than 2 or 3, up to

thousands), it becomes impossible to easily visualise it in its

entirety. *Dimensionality reduction* techniques such as **PCA** or

**t-SNE** will transform the data into a new space that summarise

properties of the whole data set along a reduced number of

dimensions. These are then used to visualise the data along these

informative dimensions or perform calculations more efficiently.

## Principal Component Analysis

Principal Component Analysis (PCA) is a technique that transforms the

original n-dimentional data into a new data space. Along these new

dimensions, called principal components, the data expresses most of

its variability along the first PC, then second, .... These new

dimensions are *linear combinations* of the orignal data.

Figures produces with `plot2D` function from the

[`pRoloc`](https://bioconductor.org/packages/devel/bioc/html/pRoloc.html)

package.

## {.fullpage}

```{r pcaex, echo=FALSE, fig.width = 12, fig.height = 4}

set.seed(1)

xy <- data.frame(x = (x <- rnorm(50, 2, 1)),

y = x + rnorm(50, 1, 0.5))

pca <- prcomp(xy)

z <- cbind(x = c(-1, 1), y = c(0, 0))

zhat <- z %*% t(pca$rotation[, 1:2])

zhat <- scale(zhat, center = colMeans(xy), scale = FALSE)

par(mfrow = c(1, 3))

plot(xy, main = "Orignal data (2 dimensions)")

plot(xy, main = "Orignal data with PC1")

abline(lm(y ~ x, data = data.frame(zhat - 10)), lty = "dashed")

grid()

plot(pca$x, main = "Data in PCA space")

grid()

```

## { .fullpage }

```{r pcaprot, echo=FALSE, fig.width=14, fig.height=7}

par(mfrow = c(1, 2))

setStockcol(NULL)

setStockcol(paste0(getStockcol(), 70))

.pca <- plot2D(t(mulvey2015), fcol = "times", cex = 3,

sub = "Mulvey et al. (2015)",

main = "Stem cell differentiation")

text(.pca[, 1], .pca[, 2], mulvey2015$rep)

legend("topleft", bty = "n",

legend = c(paste("Time", unique(mulvey2015$times)), "Replicates (1,2,3)"),

pch = 19, col = c(getStockcol()[1:6], NA))

plot2D(hyperLOPIT2015,

main = "Sub-cellular localisation map",

sub = "Christoforou et al. (2016)")

addLegend(hyperLOPIT2015, cex = .8)

```

## t-Distributed Stochastic Neighbour Embedding

t-Distributed Stochastic Neighbour Embedding (t-SNE) is a *non-linear*

dimensionality reduction techique, i.e. that different regions of the

data space will be subjected to different transformations. t-SNE will

compress small distances, thus bringing close neighbours together, and

will ignore large distances.

Figures produces with `plot2D` function from the

[`pRoloc`](https://bioconductor.org/packages/devel/bioc/html/pRoloc.html)

package.

## { .fullpage }

```{r tsneex, cache=TRUE, echo=FALSE, fig.width=14, fig.height=7}

par(mfrow = c(1, 2))

plot2D(hyperLOPIT2015, method = "PCA",

sub = "Mulvey et al. (2015)",

main = "Stem cell differentiation (PCA)")

plot2D(hyperLOPIT2015, method = "t-SNE",

sub = "Mulvey et al. (2015)",

main = "Stem cell differentiation (t-SNE)")

```

## Parameter tuning

t-SNE (as well as many other methods, in particular classification

algorithms) has two important parameters that can substantially

influence the clustering of the data

- **Perplexity**: balances global and local aspects of the data.

- **Iterations**: number of iterations before the clustering is

stopped.

It is important to adapt these for different data.

## { .fullpage }

```{r tsneparams, echo=FALSE}

params <- expand.grid(perplexity = c(5, 30, 50, 100),

steps = c(10, 1000, 10000))

if (!file.exists("./cache/tsnes.rda")) {

tsnes <- bplapply(seq_len(nrow(params)),

function(i)

plot2D(hyperLOPIT2015, method = "t-SNE",

methargs = list(pca_scale = TRUE,

pca_center = TRUE,

perplexity = params[i, "perplexity"],

max_iter = params[i, "steps"]),

plot = FALSE))

} else load("./cache/tsnes.rda")

```

```{r tsnesplots, echo=FALSE, fig.width=10, fig.height=7.5}

par(mfrow = c(3, 4),

oma = rep(0, 4),

mar = c(1, 1, 3, 1))

for (i in 1:nrow(params))

plot2D(tsnes[[i]], method = "none",

xaxt = "n", yaxt = "n",

methargs = list(hyperLOPIT2015),

main = paste("Steps:", params[i, "steps"],

"Perplexity", params[i, "perplexity"]))

```

## License

This material is made available under the

[Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/).

You are free to **share** - copy and redistribute the material in any

medium or format, and **adapt** - remix, transform, and build upon the

material for any purpose, even commercially it.

**Attribution** You must give appropriate credit, provide a link to

the license, and indicate if changes were made. You may do so in any

reasonable manner, but not in any way that suggests the licensor

endorses you or your use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

highdimvis.Rmd

Latest commit

History

highdimvis.Rmd

File metadata and controls