-
Notifications
You must be signed in to change notification settings - Fork 9
Expand file tree
/
Copy patha4-mosaics.Rmd
More file actions
439 lines (358 loc) · 16.3 KB
/
a4-mosaics.Rmd
File metadata and controls
439 lines (358 loc) · 16.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
---
title: "4. Mosaic plots"
author: "Michael Friendly"
date: "`r Sys.Date()`"
package: vcdExtra
output:
rmarkdown::html_vignette:
fig_caption: yes
bibliography: ["vcd.bib", "vcdExtra.bib"]
csl: apa.csl
vignette: >
%\VignetteIndexEntry{4. Mosaic plots}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = FALSE,
fig.height = 6,
fig.width = 7,
# fig.path = "fig/tut04-",
dev = "png",
comment = "##"
)
# save some typing
knitr::set_alias(w = "fig.width",
h = "fig.height",
cap = "fig.cap")
# Load packages
set.seed(1071)
library(vcd)
library(vcdExtra)
library(ggplot2)
library(seriation)
data(HairEyeColor)
data(PreSex)
data(Arthritis, package="vcd")
art <- xtabs(~Treatment + Improved, data = Arthritis)
#if(!file.exists("fig")) dir.create("fig")
```
Mosaic plots provide an ideal method both for visualizing contingency tables and for
visualizing the fit--- or more importantly--- **lack of fit** of a loglinear model.
For a two-way table, `mosaic()`, by default, fits a model of independence, $[A][B]$
or `~A + B` as an R formula. The `vcdExtra` package extends this to models fit
using `glm(..., family=poisson)`, which can include specialized models for
ordered factors, or square tables that are intermediate between the saturated model,
$[A B]$ = `A * B`, and the independence model $[A][B]$.
For $n$-way tables, `vcd::mosaic()` can fit any loglinear model, and can also be
used to plot a model fit with `MASS:loglm()`. The `vcdExtra` package extends this
to models fit using `stats::glm()` and, by extension, to non-linear models fit
using the [gnm package](https://cran.r-project.org/package=gnm).
See @vcd:Friendly:1994, @vcd:Friendly:1999 for the statistical ideas behind these
uses of mosaic displays in connection with loglinear models. Our book @FriendlyMeyer:2016:DDAR
gives a detailed discussion of mosaic plots and many more examples.
The essential ideas are to:
* recursively sub-divide a unit square into rectangular "tiles" for the
cells of the table, such that the area of each tile is proportional to the cell frequency. Tiles are split in a sequential order:
+ First according to the **marginal** proportions of a first variable, V1
+ Next according to the **conditional** proportions of a 2nd variable, V2 | V1
+ Next according to the **conditional** proportions of a 3rd variable, V3 | {V1, V2}
+ ...
* For a given loglinear model, the tiles can then be shaded in various ways to reflect
the residuals (lack of fit) for a given model.
* The pattern of residuals can then
be used to suggest a better model or understand *where* a given model fits or
does not fit.
`mosaic()` provides a wide range of options for the directions of splitting,
the specification of shading, labeling, spacing, legend and many other details.
It is actually implemented as a special case of a more general
class of displays for $n$-way tables called `strucplot`, including
sieve diagrams, association plots, double-decker plots as well as mosaic
plots.
For details, see `help(strucplot)` and the "See also" links therein,
and also @vcd:Meyer+Zeileis+Hornik:2006b, which is available as
an R vignette via `vignette("strucplot", package="vcd")`.
***Example***:
A mosaic plot for the Arthritis treatment data fits the model of independence,
`~ Treatment + Improved` and displays the association in the pattern of
residual shading. The goal is to visualize the difference in the proportions
of `Improved` for the two levels of `Treatment` : "Placebo" and "Treated".
The plot below is produced with the following call to `mosaic()`.
With the first split by `Treatment` and the shading used, it is easy to see
that more people given the placebo experienced no improvement, while more
people given the active treatment reported marked improvement.
```{r}
#| Arthritis1,
#| fig.height = 6,
#| fig.width = 7,
#| fig.cap = "Mosaic plot for the `Arthritis` data, using `shading_max`"
data(Arthritis, package="vcd")
art <- xtabs(~Treatment + Improved, data = Arthritis)
mosaic(art, gp = shading_max,
split_vertical = TRUE,
main="Arthritis: [Treatment] [Improved]")
```
`gp = shading_max` specifies that color in the plot signals a significant residual at a 90% or 99% significance level,
with the more intense shade for 99%.
Note that the residuals for the independence model were not large
(as shown in the legend),
yet the association between `Treatment` and `Improved`
is highly significant.
```{r, art1}
summary(art)
```
In contrast, one of the other shading schemes, from @vcd:Friendly:1994
(use: `gp = shading_Friendly`),
uses fixed cutoffs of $\pm 2, \pm 4$,
to shade cells which are *individually* significant
at approximately $\alpha = 0.05$ and $\alpha = 0.001$ levels, respectively.
The plot below uses `gp = shading_Friendly`.
```{r}
#| Arthritis2,
#| fig.height = 6,
#| fig.width = 7,
#| fig.cap = "Mosaic plot for the `Arthritis` data, using `shading_Friendly`"
mosaic(art, gp = shading_Friendly,
split_vertical = TRUE,
main="Arthritis: gp = shading_Friendly")
```
## Permuting variable levels
Mosaic plots using tables or frequency data frames as input typically take the levels of the
table variables in the order presented in the dataset. For character variables, this is often
alphabetical order. That might be helpful for looking up a value, but is unhelpful for seeing
and understanding the pattern of association.
It is usually much better to order the levels of the row and column variables to help reveal
the nature of their association. This is an example of **effect ordering for data display**
[@FriendlyKwan:02:effect].
***Example***:
Data from @Glass:54 gave this 5 x 5 table on the occupations of 3500 British fathers and their sons, where the occupational categories are listed in alphabetic order.
```{r glass}
data(Glass, package="vcdExtra")
(glass.tab <- xtabs(Freq ~ father + son, data=Glass))
```
The mosaic display shows very strong association, but aside from the
diagonal cells, the pattern is unclear. Note the use of
`set_varnames` to give more descriptive labels for the variables
and abbreviate the occupational category labels.
and `interpolate` to set the shading levels for the mosaic.
```{r glass-mosaic1}
largs <- list(set_varnames=list(father="Father's Occupation",
son="Son's Occupation"),
abbreviate=10)
gargs <- list(interpolate=c(1,2,4,8))
mosaic(glass.tab,
shade=TRUE,
labeling_args=largs,
gp_args=gargs,
main="Alphabetic order",
legend=FALSE,
rot_labels=c(20,90,0,70))
```
The occupational categories differ in **status**, and can be reordered correctly as follows,
from `Professional` down to `Unskilled`.
```{r glass-order}
# reorder by status
ord <- c(2, 1, 4, 3, 5)
row.names(glass.tab)[ord]
```
The revised mosaic plot can be produced by indexing the rows and columns of
the table using `ord`.
```{r glass-mosaic2}
mosaic(glass.tab[ord, ord],
shade=TRUE,
labeling_args=largs,
gp_args=gargs,
main="Effect order",
legend=FALSE,
rot_labels=c(20,90,0,70))
```
From this, and for the examples in the next section, it is useful to re-define
`father` and `son` as **ordered** factors in the original `Glass` frequency data.frame.
```{r glass-ord}
Glass.ord <- Glass
Glass.ord$father <- ordered(Glass.ord$father, levels=levels(Glass$father)[ord])
Glass.ord$son <- ordered(Glass.ord$son, levels=levels(Glass$son)[ord])
str(Glass.ord)
```
## Square tables
For mobility tables such as this, where the rows and columns refer to the same
occupational categories it comes as no surprise that there is a strong association
in the diagonal cells: most often, sons remain in the same occupational categories
as their fathers.
However, the re-ordered mosaic display also reveals something subtler:
when a son differs in occupation from the father, it is more likely that
he will appear in a category one-step removed than more steps removed.
The residuals seem to decrease with the number of steps from the diagonal.
For such tables, specialized loglinear models provide interesting cases
intermediate between the independence model, [A] [B],
and the saturated model, [A B]. These can be fit using `glm()`, with the data
in frequency form,
```
glm(Freq ~ A + B + assoc, data = ..., family = poisson)
```
where `assoc` is a special term to handle a restricted form of association, different from
`A:B` which specifies the saturated model in this notation.
* **Quasi-independence**: Asserts independence, but ignores the diagonal cells by
fitting them exactly. The loglinear model is:
$\log m_{ij} = \mu + \lambda^A_i + \lambda^B_j + \delta_i I(i = j)$,
where $I()$ is the indicator function.
* **Symmetry**: This model asserts that the joint distribution of the row and column variables
is symmetric, that is $\pi_{ij} = \pi_{ji}$:
A son is equally likely to move from their father's occupational category $i$ to another
category, $j$, as the reverse, moving from $j$ to $i$.
Symmetry is quite strong, because it also implies
**marginal homogeneity**, that the marginal probabilities of the row and column variables
are equal,
$\pi{i+} = \sum_j \pi_{ij} = \sum_j \pi_{ji} = \pi_{+i}$ for all $i$.
* **Quasi-symmetry**: This model uses the standard main-effect terms in the loglinear
model, but asserts that the association parameters are symmetric,
$\log m_{ij} = \mu + \lambda^A_i + \lambda^B_j + \lambda^{AB}_{ij}$,
where $\lambda^{AB}_{ij} = \lambda^{AB}_{ji}$.
The [gnm package](https://cran.r-project.org/package=gnm) provides a variety of these functions:
`gnm::Diag()`, `gnm::Symm()` and `gnm::Topo()` for an interaction factor as specified by an array of levels, which may be arbitrarily structured.
For example, the following generates a term for a diagonal factor in a $4 \times 4$
table. The diagonal values reflect parameters fitted for each diagonal cell. Off-diagonal
values, "." are ignored.
```{r diag}
rowfac <- gl(4, 4, 16)
colfac <- gl(4, 1, 16)
diag4by4 <- Diag(rowfac, colfac)
matrix(Diag(rowfac, colfac, binary = FALSE), 4, 4)
```
`Symm()` constructs parameters for symmetric cells. The particular values don't matter.
All that does matter is that the same value, e.g., `1:2` appears in both the (1,2) and
(2,1) cells.
```{r symm}
symm4by4 <- Symm(rowfac, colfac)
matrix(symm4by4, 4, 4)
```
***Example***:
To illustrate, we fit the four models below, starting with the independence model
`Freq ~ father + son` and then adding terms to reflect the restricted
forms of association, e.g., `Diag(father, son)` for diagonal terms and
`Symm(father, son)` for symmetry.
```{r glass-models}
library(gnm)
glass.indep <- glm(Freq ~ father + son,
data = Glass.ord, family=poisson)
glass.quasi <- glm(Freq ~ father + son + Diag(father, son),
data = Glass.ord, family=poisson)
glass.symm <- glm(Freq ~ Symm(father, son),
data = Glass.ord, family=poisson)
glass.qsymm <- glm(Freq ~ father + son + Symm(father, son),
data = Glass.ord, family=poisson)
```
We can visualize these using the `vcdExtra::mosaic.glm()` method, which extends
mosaic displays to handle fitted `glm` objects. *Technical note*: for
models fitted using `glm()`, standardized residuals, `residuals_type="rstandard"`
have better statistical properties than the default Pearson residuals in mosaic
plots and analysis.
```{r glass-quasi}
mosaic(glass.quasi,
residuals_type="rstandard",
shade=TRUE,
labeling_args=largs,
gp_args=gargs,
main="Quasi-Independence",
legend=FALSE,
rot_labels=c(20,90,0,70)
)
```
Mosaic plots for the other models would give further visual assessment of these models,
however we can also test differences among them. For nested models, `anova()` gives tests
of how much better a more complex model is compared to the previous one.
```{r glass-anova}
# model comparisons: for *nested* models
anova(glass.indep, glass.quasi, glass.qsymm, test="Chisq")
```
Alternatively, `vcdExtra::LRstats()` gives model summaries for a collection of
models, not necessarily nested, with AIC and BIC statistics reflecting
model parsimony.
```{r glass-lrstats}
models <- glmlist(glass.indep, glass.quasi, glass.symm, glass.qsymm)
LRstats(models)
```
By all criteria, the model of quasi symmetry fits best. The residual deviance $G^2
is not significant. The mosaic is largely unshaded, indicating a good fit, but there
are a few shaded cells that indicate the remaining positive and negative residuals.
For comparative mosaic displays, it is sometimes useful to show the $G^2$ statistic
in the main title, using `vcdExtra::modFit()` for this purpose.
```{r glass-qsymm}
mosaic(glass.qsymm,
residuals_type="rstandard",
shade=TRUE,
labeling_args=largs,
gp_args=gargs,
main = paste("Quasi-Symmetry", modFit(glass.qsymm)),
legend=FALSE,
rot_labels=c(20,90,0,70)
)
```
## Correspondence analysis ordering
When natural orders for row and column levels are not given a priori, we can find
orderings that make more sense using correspondence analysis.
The general ideas are that:
* Correspondence analysis assigns scores to the row and column variables to best account for the association in 1, 2, ... dimensions
* The first CA dimension accounts for largest proportion of the Pearson $\chi^2$
* Therefore, permuting the levels of the row and column variables by the CA Dim1 scores gives a more coherent mosaic plot, more clearly showing the nature of the association.
* The [seriation package](https://cran.r-project.org/package=seriation) now has a method to order variables in frequency tables using CA.
***Example***:
As an example, consider the `HouseTasks` dataset, a 13 x 4 table of frequencies of household tasks performed by couples, either by the `Husband`, `Wife`, `Alternating` or `Jointly`.
You can see from the table that some tasks (Repairs) are done largely by the husband; some (laundry, main meal)
are largely done by the wife, while others are done jointly or alternating between husband and
wife. But the `Task` and `Who` levels are both in alphabetical order.
```{r housetasks}
data("HouseTasks", package = "vcdExtra")
HouseTasks
```
The naive mosaic plot for this dataset is shown below, splitting first by `Task` and then by `Who`. Due to the length of the factor labels, some features of `labeling` were used to make the display more readable.
```{r housetasks-mos1}
require(vcd)
mosaic(HouseTasks, shade = TRUE,
labeling = labeling_border(rot_labels = c(45,0, 0, 0),
offset_label =c(.5,5,0, 0),
varnames = c(FALSE, TRUE),
just_labels=c("center","right"),
tl_varnames = FALSE),
legend = FALSE)
```
Correspondence analysis, using the [ca package](https://cran.r-project.org/package=ca),
shows that nearly 89% of the $\chi^2$ can be accounted for in two dimensions.
```{r housetasks-ca}
require(ca)
HT.ca <- ca(HouseTasks)
summary(HT.ca, rows=FALSE, columns=FALSE)
```
The CA plot has a fairly simple interpretation: Dim1 is largely the distinction between
tasks primarily done by the wife vs. the husband. Dim2 distinguishes tasks that are done
singly vs. those that are done jointly.
```{r housetasks-ca-plot}
plot(HT.ca, lines = TRUE)
```
So, we can use the `CA` method of `seriation::seriate()` to find the order of permutations of
`Task` and `Who` along the CA dimensions.
```{r housetasks-seriation}
require(seriation)
order <- seriate(HouseTasks, method = "CA")
# the permuted row and column labels
rownames(HouseTasks)[order[[1]]]
colnames(HouseTasks)[order[[2]]]
```
Now, use `seriation::permute()` to use `order` for the permutations of `Task` and `Who`,
and plot the resulting mosaic:
```{r housetasks-mos2}
# do the permutation
HT_perm <- permute(HouseTasks, order, margin=1)
mosaic(HT_perm, shade = TRUE,
labeling = labeling_border(rot_labels = c(45,0, 0, 0),
offset_label =c(.5,5,0, 0),
varnames = c(FALSE, TRUE),
just_labels=c("center","right"),
tl_varnames = FALSE),
legend = FALSE)
```
It is now easy to see the cluster of tasks (laundry and cooking) done largely by the wife
at the top, and those (repairs, driving) done largely by the husband at the bottom.
## References