-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathhighdimvis.Rmd
More file actions
250 lines (196 loc) · 7.96 KB
/
highdimvis.Rmd
File metadata and controls
250 lines (196 loc) · 7.96 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
title: 'Visualisation of high-dimensional data'
author: "Laurent Gatto"
output:
rmdshower::shower_presentation:
theme: material
self_contained: true
ratio: 16x10
---
## Visualisation of high-dimensional data
```
Laurent Gatto Computational Biology
https://lgatto.github.io de Duve Institute, UCLouvain
laurent.gatto@uclouvain.be @lgatt0
```
Slides: http://bit.ly/highdimvis
Source: https://github.com/lgatto/visualisation

```{r env, warning=FALSE, echo=FALSE}
suppressPackageStartupMessages(library("pRoloc"))
suppressPackageStartupMessages(library("pRolocdata"))
library("BiocParallel")
suppressPackageStartupMessages(library("Rtsne"))
data(mulvey2015)
data(hyperLOPIT2015)
```
## Introduction
* Unsupervised machine learning (clustering): make use of the data to
identify patterns of interest.
* Supervised machine learning (classification and regression): use
prior knowledge (such as known labels about a subset of the data) to
infer new information (labels) for the unlabelled or new data.
## From data
```{r data, echo=FALSE}
set.seed(1L)
m <- data.frame(matrix(round(rnorm(15, 10, 3), 2), ncol = 3))
m$new <- rep("...", 5)
colnames(m) <- c(paste("Sample", 1:3), "...")
rownames(m) <- c(paste("Protein", 1:4), "...")
m$group <- c("A", "", "B", "A", "")
knitr::kable(m, align = "c")
```
## To visualisation, annotation and results
```{r introvis, echo=FALSE, fig.width=12, fig.height=4}
par(mfrow = c(1, 3))
plot2D(hyperLOPIT2015, fcol = NULL, col = "black")
plot2D(hyperLOPIT2015, fcol = "markers")
plot2D(hyperLOPIT2015, fcol = "final.assignment")
```
## Content
* Data
* Hierarchical clustering
* Dimentionality reduction: PCA
* Dimentionality reduction: t-SNE
## Proteomics examples
Sample-level visualisations using data from
[Mulvey *et al.* (2015)](https://www.ncbi.nlm.nih.gov/pubmed/26059426)
*Dynamic Proteomic Profiling of Extra-Embryonic Endoderm
Differentiation in Mouse Embryonic Stem Cells.*
Protein-level visualisations using data from
[Christoforou *et al.* (2016)](https://www.ncbi.nlm.nih.gov/pubmed/26754106)
*A draft map of the mouse pluripotent stem cell spatial proteome.*
## Hierarchical clustering
Hierarchical clustering methods start by calculating all pairwise
distances between all features (or samples) and then clusters/groups
these based on these similarities. There are various distances
measures and clustering algorithms that can be used.
Plots prepared with `dist` and `hclust` from the `stats` package and
`mrkHClust` from
[`pRoloc`](https://bioconductor.org/packages/devel/bioc/html/pRoloc.html).
## {.fullpage}
```{r hclust, echo=FALSE, fig.width=12, fig.height=6}
d <- dist(exprs(t(mulvey2015)))
hcl <- hclust(d)
par(mfrow = c(1, 2))
plot(hcl, xlab = "",
main = "Samples from Mulvey et al. (2015)",
sub = "Euclidian distance\nComplete clustering")
par(mar = c(10, 2, 3, 1))
mrkHClust(hyperLOPIT2015,
main = "Organelle protein markers from Christoforou et al. (2016)")
```
## Dimensionality reduction
When the data span over many dimensions (more than 2 or 3, up to
thousands), it becomes impossible to easily visualise it in its
entirety. *Dimensionality reduction* techniques such as **PCA** or
**t-SNE** will transform the data into a new space that summarise
properties of the whole data set along a reduced number of
dimensions. These are then used to visualise the data along these
informative dimensions or perform calculations more efficiently.
## Principal Component Analysis
Principal Component Analysis (PCA) is a technique that transforms the
original n-dimentional data into a new data space. Along these new
dimensions, called principal components, the data expresses most of
its variability along the first PC, then second, .... These new
dimensions are *linear combinations* of the orignal data.
Figures produces with `plot2D` function from the
[`pRoloc`](https://bioconductor.org/packages/devel/bioc/html/pRoloc.html)
package.
## {.fullpage}
```{r pcaex, echo=FALSE, fig.width = 12, fig.height = 4}
set.seed(1)
xy <- data.frame(x = (x <- rnorm(50, 2, 1)),
y = x + rnorm(50, 1, 0.5))
pca <- prcomp(xy)
z <- cbind(x = c(-1, 1), y = c(0, 0))
zhat <- z %*% t(pca$rotation[, 1:2])
zhat <- scale(zhat, center = colMeans(xy), scale = FALSE)
par(mfrow = c(1, 3))
plot(xy, main = "Orignal data (2 dimensions)")
plot(xy, main = "Orignal data with PC1")
abline(lm(y ~ x, data = data.frame(zhat - 10)), lty = "dashed")
grid()
plot(pca$x, main = "Data in PCA space")
grid()
```
## { .fullpage }
```{r pcaprot, echo=FALSE, fig.width=14, fig.height=7}
par(mfrow = c(1, 2))
setStockcol(NULL)
setStockcol(paste0(getStockcol(), 70))
.pca <- plot2D(t(mulvey2015), fcol = "times", cex = 3,
sub = "Mulvey et al. (2015)",
main = "Stem cell differentiation")
text(.pca[, 1], .pca[, 2], mulvey2015$rep)
legend("topleft", bty = "n",
legend = c(paste("Time", unique(mulvey2015$times)), "Replicates (1,2,3)"),
pch = 19, col = c(getStockcol()[1:6], NA))
plot2D(hyperLOPIT2015,
main = "Sub-cellular localisation map",
sub = "Christoforou et al. (2016)")
addLegend(hyperLOPIT2015, cex = .8)
```
## t-Distributed Stochastic Neighbour Embedding
t-Distributed Stochastic Neighbour Embedding (t-SNE) is a *non-linear*
dimensionality reduction techique, i.e. that different regions of the
data space will be subjected to different transformations. t-SNE will
compress small distances, thus bringing close neighbours together, and
will ignore large distances.
Figures produces with `plot2D` function from the
[`pRoloc`](https://bioconductor.org/packages/devel/bioc/html/pRoloc.html)
package.
## { .fullpage }
```{r tsneex, cache=TRUE, echo=FALSE, fig.width=14, fig.height=7}
par(mfrow = c(1, 2))
plot2D(hyperLOPIT2015, method = "PCA",
sub = "Mulvey et al. (2015)",
main = "Stem cell differentiation (PCA)")
plot2D(hyperLOPIT2015, method = "t-SNE",
sub = "Mulvey et al. (2015)",
main = "Stem cell differentiation (t-SNE)")
```
## Parameter tuning
t-SNE (as well as many other methods, in particular classification
algorithms) has two important parameters that can substantially
influence the clustering of the data
- **Perplexity**: balances global and local aspects of the data.
- **Iterations**: number of iterations before the clustering is
stopped.
It is important to adapt these for different data.
## { .fullpage }
```{r tsneparams, echo=FALSE}
params <- expand.grid(perplexity = c(5, 30, 50, 100),
steps = c(10, 1000, 10000))
if (!file.exists("./cache/tsnes.rda")) {
tsnes <- bplapply(seq_len(nrow(params)),
function(i)
plot2D(hyperLOPIT2015, method = "t-SNE",
methargs = list(pca_scale = TRUE,
pca_center = TRUE,
perplexity = params[i, "perplexity"],
max_iter = params[i, "steps"]),
plot = FALSE))
} else load("./cache/tsnes.rda")
```
```{r tsnesplots, echo=FALSE, fig.width=10, fig.height=7.5}
par(mfrow = c(3, 4),
oma = rep(0, 4),
mar = c(1, 1, 3, 1))
for (i in 1:nrow(params))
plot2D(tsnes[[i]], method = "none",
xaxt = "n", yaxt = "n",
methargs = list(hyperLOPIT2015),
main = paste("Steps:", params[i, "steps"],
"Perplexity", params[i, "perplexity"]))
```
## License
This material is made available under the
[Creative Commons Attribution license](https://creativecommons.org/licenses/by/4.0/).
You are free to **share** - copy and redistribute the material in any
medium or format, and **adapt** - remix, transform, and build upon the
material for any purpose, even commercially it.
**Attribution** You must give appropriate credit, provide a link to
the license, and indicate if changes were made. You may do so in any
reasonable manner, but not in any way that suggests the licensor
endorses you or your use.