-
Notifications
You must be signed in to change notification settings - Fork 9
Expand file tree
/
Copy patha1-creating.Rmd
More file actions
867 lines (695 loc) · 32.4 KB
/
a1-creating.Rmd
File metadata and controls
867 lines (695 loc) · 32.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
---
title: "1. Creating and manipulating frequency tables"
author: "Michael Friendly"
date: "`r Sys.Date()`"
package: vcdExtra
output:
rmarkdown::html_vignette:
fig_caption: yes
bibliography: ["vcd.bib", "vcdExtra.bib"]
csl: apa.csl
vignette: >
%\VignetteIndexEntry{1. Creating and manipulating frequency tables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
message = FALSE,
warning = FALSE,
fig.height = 6,
fig.width = 7,
# fig.path = "fig/tut01-",
dev = "png",
comment = "##"
)
# save some typing
knitr::set_alias(w = "fig.width",
h = "fig.height",
cap = "fig.cap")
# Old Sweave options
# \SweaveOpts{engine=R,eps=TRUE,height=6,width=7,results=hide,fig=FALSE,echo=TRUE}
# \SweaveOpts{engine=R,height=6,width=7,results=hide,fig=FALSE,echo=TRUE}
# \SweaveOpts{prefix.string=fig/vcd-tut,eps=FALSE}
# \SweaveOpts{keep.source=TRUE}
# preload datasets ???
set.seed(1071)
library(vcd)
library(vcdExtra)
library(ggplot2)
data(HairEyeColor)
data(PreSex)
data(Arthritis, package="vcd")
art <- xtabs(~Treatment + Improved, data = Arthritis)
#if(!file.exists("fig")) dir.create("fig")
```
R provides many methods for creating frequency and contingency tables. Several are
described below. In the examples below, we use some real examples and some anonymous
ones, where the variables `A`, `B`, and `C` represent
categorical variables, and `X` represents an arbitrary R data object.
## Forms of frequency data
The first thing you need to know is that categorical data can be represented in
three different forms in R, and it is sometimes necessary to convert
from one form to another, for carrying out statistical tests, fitting models
or visualizing the results. Once a data object exists in R,
you can examine its complete structure with
the `str()` function, or view the names of its components with the
`names()` function.
### Case form
Categorical data in case form are simply data frames containing individual observations, with one or
more factors, used as the classifying variables. In case form, there may also
be numeric covariates.
The total number of observations is `nrow(X)`, and the number of variables is `ncol(X)`.
***Example***:
The `Arthritis` data is available in case form in the `vcd` package.
There are two explanatory factors: `Treatment` and `Sex`. `Age`
is a numeric covariate, and `Improved` is the response--- an ordered factor,
with levels
`r paste(levels(Arthritis$Improved),collapse=' < ')`.
Excluding `Age`, this represents
a $2 \times 2 \times 3$ contingency table for `Treatment`, `Sex` and `Improved`, but in case form.
```{r, case-form}
names(Arthritis) # show the variables
str(Arthritis) # show the structure
head(Arthritis,5) # first 5 observations, same as Arthritis[1:5,]
```
### Frequency form
Data in frequency form is also a data frame containing one or more factors, and a frequency
variable, often called `Freq` or `count`. The total number of observations
is: `sum(X$Freq)`, `sum(X[,"Freq"])` or some equivalent form.
The number of cells in the table is given by `nrow(X)`.
***Example***:
For small frequency tables, it is often convenient to enter them in frequency form
using `expand.grid()` for the factors and `c()` to list the counts in a vector.
The example below, from [@vcd:Agresti:2002] gives results for the 1991 General Social Survey,
with respondents classified by sex and party identification.
```{r, frequency-form}
# Agresti (2002), table 3.11, p. 106
GSS <- data.frame(
expand.grid(sex = c("female", "male"),
party = c("dem", "indep", "rep")),
count = c(279,165,73,47,225,191))
GSS
names(GSS)
str(GSS)
sum(GSS$count)
```
### Table form
Table form data is represented by a `matrix`, `array` or `table` object, whose elements are the frequencies
in an $n$-way table. The variable names (factors) and their levels are given by
`dimnames(X)`. The total number of observations
is `sum(X)`. The number of dimensions of the table is `length(dimnames(X))`,
and the table sizes are given by `sapply(dimnames(X), length)`.
***Example***:
The `HairEyeColor` is stored in table form in `vcd`.
```{r, table-form1}
str(HairEyeColor) # show the structure
sum(HairEyeColor) # number of cases
sapply(dimnames(HairEyeColor), length) # table dimension sizes
```
***Example***:
Enter frequencies in a matrix, and assign `dimnames`,
giving the variable names and category labels. Note that, by default,
`matrix()` uses the elements supplied by *columns* in the
result, unless you specify `byrow=TRUE`.
```{r, table-form2}
# A 4 x 4 table Agresti (2002, Table 2.8, p. 57) Job Satisfaction
JobSat <- matrix(c( 1, 2, 1, 0,
3, 3, 6, 1,
10,10,14, 9,
6, 7,12,11), 4, 4)
dimnames(JobSat) = list(
income = c("< 15k", "15-25k", "25-40k", "> 40k"),
satisfaction = c("VeryD", "LittleD", "ModerateS", "VeryS")
)
JobSat
```
`JobSat` is a **matrix**, not an object of `class("table")`, and some functions
are happier with tables than matrices.
You can coerce it to a table with `as.table()`,
```{r, table-form3}
JobSat <- as.table(JobSat)
str(JobSat)
```
## Ordered factors and reordered tables {#sec:ordered-factors}
In table form, the values of the table factors are ordered by their position in the table.
Thus in the `JobSat` data, both `income` and `satisfaction` represent ordered
factors, and the *positions* of the values in the rows and columns reflects their
ordered nature.
Yet, for analysis, there are times when you need *numeric* values for the levels
of ordered factors in a table, e.g., to treat a factor as a quantitative variable.
In such cases, you can simply re-assign the `dimnames` attribute of the table
variables. For example, here, we assign numeric values to `income` as the middle of their
ranges, and treat `satisfaction` as equally spaced with integer scores.
```{r, relevel, eval=FALSE}
dimnames(JobSat)$income <- c(7.5,20,32.5,60)
dimnames(JobSat)$satisfaction <- 1:4
```
For the `HairEyeColor` data, hair color and eye color are ordered arbitrarily.
For visualizing the data using mosaic plots and other methods described below, it
turns out to be more useful to assure that both hair color and eye color are
ordered from dark to light.
Hair colors are actually ordered this way already, and it is easiest to re-order
eye colors by indexing. Again `str()` is your friend.
```{r, reorder1}
HairEyeColor <- HairEyeColor[, c(1,3,4,2), ]
str(HairEyeColor)
```
This is also the order for both hair color and eye color shown in
the result of a correspondence analysis (@ref(fig:ca-haireye) below.
With data in case form or frequency form, when you have ordered factors
represented with character values, you must ensure that they are treated
as ordered in R.
<!-- \footnote{In SAS, many procedures offer the option -->
<!-- `order = data | internal | formatted` to allow character values -->
<!-- to be ordered according to (a) their order in the data set, (b) -->
<!-- sorted internal value, or (c) sorted formatted representation -->
<!-- provided by a SAS format. -->
<!-- } -->
Imagine that the `Arthritis` data was read from a text file.
By default the `Improved` will be ordered alphabetically:
`Marked`,
`None`,
`Some` --- not what we want. In this case, the function
`ordered()` (and others) can be useful.
```{r, reorder2, echo=TRUE, eval=FALSE}
Arthritis <- read.csv("arthritis.txt",header=TRUE)
Arthritis$Improved <- ordered(Arthritis$Improved,
levels=c("None", "Some", "Marked")
)
```
The dataset `Arthritis` in the `vcd` package is a data.frame in this form
With this order of `Improved`, the response in this data,
a mosaic display of `Treatment` and `Improved` (@ref(fig:arthritis) shows a clearly
interpretable pattern.
The original version of `mosaic` in the `vcd` package required the input to be
a contingency table in array form, so we convert using `xtabs()`.
<!-- ```{r Arthritis, height=6, width=7, fig.cap="Mosaic plot for the `Arthritis` data ..."} -->
```{r}
#| Arthritis,
#| fig.height = 6,
#| fig.width = 6,
#| fig.cap = "Mosaic plot for the `Arthritis` data, showing the marginal model of independence for Treatment and Improved. Age, a covariate, and Sex are ignored here."
#|
data(Arthritis, package="vcd")
art <- xtabs(~Treatment + Improved, data = Arthritis)
mosaic(art, gp = shading_max, split_vertical = TRUE, main="Arthritis: [Treatment] [Improved]")
```
Several data sets in the package illustrate the salutary effects of reordering factor levels
in mosaic displays and other analyses. See:
* `help(AirCrash)`
* `help(Glass)`
* `help(HouseTasks)`
The [seriate](https://CRAN.R-project.org/package=seriation) package now contains a general method
to permute the row and column variables in a table according to the result of a correspondence
analysis, using scores on the first CA dimension.
### Re-ordering dimensions
Finally, there are situations where, particularly for display purposes, you
want to re-order the *dimensions* of an $n$-way table, or change the
labels for the variables or levels.
This is easy when the data are in table form: `aperm()` permutes
the dimensions, and assigning to `names` and `dimnames`
changes variable names and level labels respectively.
We will use the following version of `UCBAdmissions` in
\@ref(sec:mantel) below.
^[Changing `Admit` to `Admit?` might be useful for display purposes, but is
dangerous--- because it is then difficult to use that variable name in a model formula.
See \@ref(sec:tips) for options `labeling_args` and `set_labels`to change variable and level names for displays in the `strucplot` framework.]
```{r, reorder3}
UCB <- aperm(UCBAdmissions, c(2, 1, 3))
dimnames(UCB)[[2]] <- c("Yes", "No")
names(dimnames(UCB)) <- c("Sex", "Admit?", "Department")
# display as a flattened table
stats::ftable(UCB)
```
## `structable()` {#sec:structable}
For 3-way and larger tables
the `structable()` function in `vcd` provides a convenient and flexible tabular display.
The variables assigned to the rows and columns of a two-way display can be specified
by a model formula.
```{r, structable}
structable(HairEyeColor) # show the table: default
structable(Hair+Sex ~ Eye, HairEyeColor) # specify col ~ row variables
```
It also returns an object of class `"structable"` which may be plotted with
`mosaic()` (not shown here).
```{r, structable1,eval=FALSE}
HSE < - structable(Hair+Sex ~ Eye, HairEyeColor) # save structable object
mosaic(HSE) # plot it
```
## `table()` and friends {#sec:table}
You can generate frequency tables from factor variables using the `table()` function, tables of
proportions using the `prop.table()` function, and marginal frequencies using
`margin.table()`.
For these examples, create some categorical vectors:
```{r, table-setup}
n=500
A <- factor(sample(c("a1","a2"), n, rep=TRUE))
B <- factor(sample(c("b1","b2"), n, rep=TRUE))
C <- factor(sample(c("c1","c2"), n, rep=TRUE))
mydata <- data.frame(A,B,C)
```
These lines illustrate `table`-related functions:
```{r, table-ex1}
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
```
`table()` can also generate multidimensional tables based on 3 or more
categorical variables. In this case, you can use the `ftable()` or `structable()`
function to print the results more attractively.
```{r, table-ex2}
# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
```
`table()` ignores missing values by default.
To include `NA` as a category in counts, include the
table option `exclude=NULL` if the variable is a vector. If the variable is a
factor you have to create a new factor using \code{newfactor <- factor(oldfactor,
exclude=NULL)}.
## `xtabs()` {#sec:xtabs}
The `xtabs()` function allows you to create cross-tabulations of data using formula style input.
This typically works with case-form data supplied in a data frame or a matrix.
The result is a contingency table in array format, whose dimensions are determined by
the terms on the right side of the formula.
```{r, xtabs-ex1}
# 3-Way Frequency Table
mytable <- xtabs(~A+B+C, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence
```
If a variable is included on the left side of the formula, it is assumed to be a
vector of frequencies (useful if the data have already been tabulated in frequency form).
```{r, xtabs-ex2}
(GSStab <- xtabs(count ~ sex + party, data=GSS))
summary(GSStab)
```
## Collapsing over table factors: `aggregate()`, `margin.table()` and `apply()`
It sometimes happens that we have a data set with more variables or factors than
we want to analyse, or else, having done some initial analyses, we decide that
certain factors are not important, and so should be excluded from graphic displays
by collapsing (summing) over them. For example, mosaic plots and fourfold displays
are often simpler to construct from versions of the data collapsed over
the factors which are not shown in the plots.
The appropriate tools to use again depend on
the form in which the data are represented--- a case-form data frame, a
frequency-form data frame (`aggregate()`), or a table-form array or
table object (`margin.table()` or `apply()`).
When the data are in frequency form, and we want to produce another
frequency data frame, `aggregate()` is a handy tool, using
the argument `FUN=sum` to sum the frequency variable over the
factors *not* mentioned in the formula.
***Example***:
The data frame `DaytonSurvey` in the `vcdExtra` package represents a
$2^5$ table giving the frequencies of reported use (``ever used?'') of
alcohol, cigarettes and marijuana in a sample of high school seniors,
also classified by sex and race.
```{r, dayton1}
data("DaytonSurvey", package="vcdExtra")
str(DaytonSurvey)
head(DaytonSurvey)
```
To focus on the associations among the
substances, we want to collapse over sex and race. The right-hand side of the formula
used in the call to `aggregate()` gives the factors to be retained in the
new frequency data frame, `Dayton.ACM.df`.
```{r, dayton2}
# data in frequency form
# collapse over sex and race
Dayton.ACM.df <- aggregate(Freq ~ cigarette+alcohol+marijuana,
data=DaytonSurvey,
FUN=sum)
Dayton.ACM.df
```
When the data are in table form, and we want to produce another
table, `apply()` with `FUN=sum` can be used in a similar way
to sum the table over dimensions not mentioned in the `MARGIN`
argument. `margin.table()` is just a wrapper for `apply()`
using the `sum()` function.
***Example***:
To illustrate, we first convert the `DaytonSurvey` to a 5-way
table using `xtabs()`, giving `Dayton.tab`.
```{r, dayton3}
# in table form
Dayton.tab <- xtabs(Freq ~ cigarette+alcohol+marijuana+sex+race,
data=DaytonSurvey)
structable(cigarette+alcohol+marijuana ~ sex+race,
data=Dayton.tab)
```
Then, use `apply()` on `Dayton.tab` to give the
3-way table `Dayton.ACM.tab` summed over sex and race.
The elements in this new table are the column sums for
`Dayton.tab` shown by `structable()` just above.
```{r, dayton4}
# collapse over sex and race
Dayton.ACM.tab <- apply(Dayton.tab, MARGIN=1:3, FUN=sum)
Dayton.ACM.tab <- margin.table(Dayton.tab, 1:3) # same result
structable(cigarette+alcohol ~ marijuana, data=Dayton.ACM.tab)
```
Many of these operations can be performed using the `**ply()` functions
in the [`plyr`]( https://CRAN.R-project.org/package=plyr) package.
For example, with the data in a frequency form data frame, use `ddply()`
to collapse over unmentioned factors, and `plyr::summarise()`
as the function to be applied to each piece.
<!-- \footnote{ -->
<!-- Ugh. This `plyr` function clashes with a function of the same name in `vcdExtra`. -->
<!-- In this document I will use the explicit double-colon notation to keep them -->
<!-- separate. -->
<!-- } -->
```{r, dayton5}
library(plyr)
Dayton.ACM.df <- plyr::ddply(DaytonSurvey,
.(cigarette, alcohol, marijuana),
plyr::summarise, Freq=sum(Freq))
Dayton.ACM.df
```
## Collapsing table levels: `collapse.table()`
A related problem arises when we have a table or array and for some purpose
we want to reduce the number of levels of some factors by summing subsets
of the frequencies. For example, we may have initially coded Age in 10-year
intervals, and decide that, either for analysis or display purposes, we
want to reduce Age to 20-year intervals. The `collapse.table()` function
in `vcdExtra` was designed for this purpose.
***Example***:
Create a 3-way table, and collapse Age from 10-year to 20-year intervals.
First, we generate a $2 \times 6 \times 3$ table of random counts from a
Poisson distribution with mean of 100.
```{r, collapse1}
# create some sample data in frequency form
sex <- c("Male", "Female")
age <- c("10-19", "20-29", "30-39", "40-49", "50-59", "60-69")
education <- c("low", 'med', 'high')
data <- expand.grid(sex=sex, age=age, education=education)
counts <- rpois(36, 100) # random Possion cell frequencies
data <- cbind(data, counts)
# make it into a 3-way table
t1 <- xtabs(counts ~ sex + age + education, data=data)
structable(t1)
```
Now collapse `age` to 20-year intervals, and `education`
to 2 levels. In the arguments, levels of `age` and `education`
given the same label are summed in the resulting smaller table.
```{r, collapse2}
# collapse age to 3 levels, education to 2 levels
t2 <- collapse.table(t1,
age=c("10-29", "10-29", "30-49", "30-49", "50-69", "50-69"),
education=c("<high", "<high", "high"))
structable(t2)
```
## Collapsing table levels: `dplyr`
For data sets in frequency form or case form, factor levels can be collapsed by recoding the levels
to some grouping. One handy function for this is `dplyr::case_match()`
***Example***:
The `vcdExtra::Titanicp` data set contains information on 1309 passengers on the _RMS Titanic_,
including `sibsp`, the number of (0:8) siblings or spouses aboard, and `parch` (0:6), the
number of parents or children aboard, but the table is quite sparse.
```{r titanicSAVE, include=FALSE, echo=FALSE}
# Save unmodified Titanicp data for collapse_levels() example
saveDat <- Titanicp
```
```{r titanicp1}
table(Titanicp$sibsp, Titanicp$parch)
```
For purposes of analysis, we might want to collapse both of these to the levels `0, 1, 2+`.
Here's how:
```{r titanicp2}
library(dplyr)
Titanicp <- Titanicp |>
mutate(sibspF = case_match(sibsp,
0 ~ "0",
1 ~ "1",
2:max(sibsp) ~ "2+")) |>
mutate(sibspF = ordered(sibspF)) |>
mutate(parchF = case_match(parch,
0 ~ "0",
1 ~ "1",
2:max(parch) ~ "2+")) |>
mutate(parchF = ordered(parchF))
table(Titanicp$sibspF, Titanicp$parchF)
```
`car::recode()` is a similar function, but with a less convenient interface.
The [`forcats`]( https://CRAN.R-project.org/package=forcats) package provides a collection of functions for reordering the levels of a factor
or grouping categories according to their frequency:
* `forcats::fct_reorder()`: Reorder a factor by another variable.
* `forcats::fct_infreq()`: Reorder a factor by the frequency of values.
* `forcats::fct_relevel()`: Change the order of a factor by hand.
* `forcats::fct_lump()`: Collapse the least/most frequent values of a factor into “other”.
* `forcats::fct_collapse()`: Collapse factor levels into manually defined groups.
* `forcats::fct_recode()`: Change factor levels by hand.
## Collapsing table levels: `collapse_levels()`
As seen in the previous section, collapsing variables from datasets in frequency
or case form (in R) is not as intuitive/easy as it should be. Thus, the
`collapse_levels()` function was written to easily collapse variables contained
in datasets of *any form*.
Note that when your object `X` is in frequency form, an argument of
`freq = "your frequency column name"` must be supplied. This argument defaults
to a value of `freq = "Freq"`.
The use of this function is illustrated below using the same `Titanicp` data.
***Example***
First, factorize the `sibsp` and `parch` variables. Then collapse variables into
categories of `0`, `1`, and `2+`.
```{r titanicLOAD, include=FALSE, echo=FALSE}
# Load unmodified Titanicp data for collapse_levels() example
Titanicp <- saveDat
```
```{r collapse_level-ex1}
str(Titanicp) # The original dataset
Titanicp <- within(Titanicp, {
parchF <- factor(parch)
sibspF <- factor(sibsp)
})
Titanicp <- collapse_levels(
Titanicp,
parchF = list(
"2+" = c("2", "3", "4", "5", "6", "9")
),
sibspF = list(
"2+" = c("2", "3", "4", "5", "8")
)
)
table(Titanicp$sibspF, Titanicp$parchF)
```
For a more thorough overview of `collapse_levels()`, see
[1a. Steps Toward Tidy Categorical Data Analysis](a1a-convert-collapse.html).
## Converting among frequency tables and data frames
As we've seen, a given contingency table can be represented
equivalently in different forms,
but some R functions were designed for one particular representation.
The table below shows some handy tools for converting from one form to another.
<!-- [htb] -->
<!-- \caption{Tools for converting among different forms for categorical data} {#tab:convert} -->
<!-- {llll} -->
<!-- \hline -->
<!-- & \multicolumn{3}{c}{**To this**} \\ -->
<!-- **From this** & Case form & Frequency form & Table form \\ -->
<!-- \hline -->
<!-- Case form & noop & `xtabs(~A+B)` & `table(A,B)` \\ -->
<!-- Frequency form & `expand.dft(X)` & noop & `xtabs(count~A+B)`\\ -->
<!-- Table form & `expand.dft(X)` & `as.data.frame(X)` & noop \\ -->
<!-- \hline -->
| **From this** | | **To this** | |
|:-----------------|:--------------------|:---------------------|-------------------|
| | _Case form_ | _Frequency form_ | _Table form_ |
| _Case form_ | noop | `xtabs(~A+B)` | `table(A,B)` |
| _Frequency form_ | `expand.dft(X)` | noop | `xtabs(count~A+B)`|
| _Table form_ | `expand.dft(X)` | `as.data.frame(X)` | noop |
For example, a contingency table in table form (an object of `class(table)`) can be converted
to a data.frame with `as.data.frame()`.
^[Because R is object-oriented, this is actually a short-hand for the function `as.data.frame.table()`.]
The resulting
`data.frame` contains columns
representing the classifying factors and the table entries (as a column named by
the `responseName` argument, defaulting to `Freq`. This is the inverse of `xtabs()`.
***Example***:
Convert the `GSStab` in table form to a data.frame in frequency form.
```{r, convert-ex1}
as.data.frame(GSStab)
```
***Example***: Convert the `Arthritis` data in case form to a 3-way table of
`Treatment` $\times$ `Sex` $\times$ `Improved`.
Note the use of `with()` to avoid having to use `Arthritis\$Treatment` etc. within the call to `table()`.%
^[`table()` does not allow a `data` argument to provide an environment in which the table variables are to be found. In the examples in \@ref(sec:table) I used `attach(mydata)` for this purpose,
but `attach()` leaves the variables in the global environment, while `with()` just evaluates the `table()` expression in a temporary environment of the data.]
```{r, convert-ex2}
Art.tab <- with(Arthritis, table(Treatment, Sex, Improved))
str(Art.tab)
ftable(Art.tab)
```
There may also be times that you will need an equivalent case form `data.frame`
with factors representing the table variables
rather than the frequency table.
For example, the `mca()` function in package `MASS`
only operates on data in this format.
Marc Schwartz initially provided code for `expand.dft()` on the Rhelp
mailing list for converting a table back into a case form `data.frame`.
This function is included in `vcdExtra`.
***Example***: Convert the `Arthritis` data in table form (`Art.tab`) back to a `data.frame`
in case form, with factors
`Treatment`, `Sex` and `Improved`.
```{r, convert-ex3}
Art.df <- expand.dft(Art.tab)
str(Art.df)
```
### Tidy conversions
To make conversions more intuitive, `vcdExtra` includes tidy conversion
functions `as_table()`, `as_matrix()`, `as_array()`, `as_freqform()`, and `as_caseform()` to easily convert any form to the form of the function's
namesake (i.e., `as_table()` converts any form into table form). Arguments in
these functions are of similar format, with conversions *from* frequency form
requiring the column containing frequencies to be specified via the `freq`
argument. Likewise, `dims` (dimensions) may be specified, with output summing
over excluded variables. For a more thorough overview of these functions, see
[1a. Steps Toward Tidy Categorical Data Analysis](a1a-convert-collapse.html).
***Example***:
Convert the `GSStab` in table form to a `data.frame` in frequency form.
```{r, tidyconvert-ex1}
# tidy = TRUE (default) returns a tibble
GSS.ff <- as_freqform(GSStab, tidy = FALSE)
GSS.ff
```
***Example***:
Convert the `GSStab` data in frequency form (`GSS.ff`) back to table form.
```{r, tidyconvert-ex2}
as_table(GSS.ff, freq = "Freq") # When present, freq column must be specified
```
***Example***: Convert the `Arthritis` data in case form to a 3-way table of
`Treatment` $\times$ `Sex` $\times$ `Improved`.
```{r, tidyconvert-ex3}
tidy_Art.tab <- as_table(Arthritis, dims = c("Treatment", "Sex", "Improved"))
str(tidy_Art.tab)
ftable(tidy_Art.tab)
```
***Example***: Convert the `Arthritis` data in table form (`tidy_Art.tab`) back
to a `data.frame` in case form, with factors `Treatment`, `Sex` and `Improved`.
```{r, tidyconvert-ex4}
tidy_Art.df <- as_caseform(tidy_Art.tab, tidy = FALSE)
str(tidy_Art.df)
```
## A complex example {#sec:complex}
If you've followed so far, you're ready for a more complicated example.
The data file, `tv.dat` represents a 4-way table of size
$5 \times 11 \times 5 \times 3$ where the table variables (unnamed in the file)
are read as `V1` -- `V4`, and the cell frequency is read
as `V5`. The file, stored in the `doc/extdata` directory
of `vcdExtra`, can be read as follows:
```{r, tv1}
tv.data<-read.table(system.file("extdata","tv.dat", package="vcdExtra"))
head(tv.data,5)
```
For a local file, just use `read.table()` in this form:
```{r, tv2,eval=FALSE}
tv.data<-read.table("C:/R/data/tv.dat")
```
The data `tv.dat` came from the initial implementation of
mosaic displays in R by Jay Emerson.
In turn, they came from the initial development of mosaic displays
[@vcd:Hartigan+Kleiner:1984]
that illustrated the method with data on a large sample of TV viewers
whose behavior had been recorded for the Neilsen ratings.
This data set contains sample television audience data from Neilsen
Media Research for the week starting November 6, 1995.
The table variables are:
* `V1`-- values 1:5 correspond to the days Monday--Friday;
* `V2`-- values 1:11 correspond to the quarter hour times 8:00PM through 10:30PM;
* `V3`-- values 1:5 correspond to ABC, CBS, NBC, Fox, and non-network choices;
* `V4`-- values 1:3 correspond to transition states: turn the television Off, Switch channels, or Persist in viewing the current channel.
We are interested just the cell frequencies, and rely on the facts that the
(a) the table is complete--- there are no missing cells, so `nrow(tv.data)` = `r nrow(tv.data)`;
(b) the observations are ordered so that `V1` varies most rapidly and `V4` most slowly. From this, we can just extract the frequency column and reshape it into an array. [That would be dangerous if any observations were out of order.]
```{r, tv3}
TV <- array(tv.data[,5], dim=c(5,11,5,3))
dimnames(TV) <- list(c("Monday","Tuesday","Wednesday","Thursday","Friday"),
c("8:00","8:15","8:30","8:45","9:00","9:15","9:30",
"9:45","10:00","10:15","10:30"),
c("ABC","CBS","NBC","Fox","Other"),
c("Off","Switch","Persist"))
names(dimnames(TV))<-c("Day", "Time", "Network", "State")
```
More generally (even if there are missing cells), we can
use `xtabs()` (or `plyr::daply()`)
to do the cross-tabulation, using `V5` as the
frequency variable. Here's how to do this same operation with `xtabs()`:
```{r, tv3a,eval=FALSE}
TV <- xtabs(V5 ~ ., data=tv.data)
dimnames(TV) <- list(Day = c("Monday","Tuesday","Wednesday","Thursday","Friday"),
Time = c("8:00","8:15","8:30","8:45","9:00","9:15","9:30",
"9:45","10:00","10:15","10:30"),
Network = c("ABC","CBS","NBC","Fox","Other"),
State = c("Off","Switch","Persist"))
# table dimensions
dim(TV)
```
But this 4-way table is too large and awkward to work with. Among the networks,
Fox and Other occur infrequently.
We can also cut it down to a 3-way table by considering only viewers who persist
with the current station.
^[This relies on the fact that that indexing an array drops dimensions of length 1 by default, using the argument `drop=TRUE`; the result is coerced to the lowest possible dimension.]
```{r, tv4}
TV2 <- TV[,,1:3,] # keep only ABC, CBS, NBC
TV2 <- TV2[,,,3] # keep only Persist -- now a 3 way table
structable(TV2)
```
Finally, for some purposes, we might want to collapse the 11 times into a smaller number.
Half-hour time slots make more sense.
Here, we use `as.data.frame.table()` to convert the table back to a data frame,
`levels()` to re-assign the values of `Time`,
and finally, `xtabs()` to give a new, collapsed frequency table.
```{r, tv5}
TV.df <- as.data.frame.table(TV2)
levels(TV.df$Time) <- c(rep("8:00", 2),
rep("8:30", 2),
rep("9:00", 2),
rep("9:30", 2),
rep("10:00",2),
"10:30"
)
TV3 <- xtabs(Freq ~ Day + Time + Network, TV.df)
structable(Day ~ Time+Network, TV3)
```
We've come this far, so we might as well show a mosaic display. This is analogous to that used by
@vcd:Hartigan+Kleiner:1984.
```{r tv-mosaic1, fig.height=6, fig.width=7}
mosaic(TV3, shade = TRUE,
labeling = labeling_border(rot_labels = c(0, 0, 0, 90)))
```
This mosaic displays can be read at several levels, corresponding to the successive splits of
the tiles and the residual shading.
Several trends are clear for viewers who persist:
* Overall, there are about the same number of viewers on each weekday, with slightly more on Thursday.
* Looking at time slots, viewership is slightly greater from 9:00 - 10:00 overall and also
8:00 - 9:00 on Thursday and Friday
From the residual shading of the tiles:
* Monday: CBS dominates in all time slots.
* Tuesday" ABC and CBS dominate after 9:00
* Thursday: is a largely NBC day
* Friday: ABC dominates in the early evening
<!-- ```{r, tv4} -->
<!-- TV.df <- as.data.frame.table(TV) -->
<!-- levels(TV.df$Time) <- c(rep("8:00-8:59",4), -->
<!-- rep("9:00-9:59",4), -->
<!-- rep("10:00-10:44",3)) -->
<!-- TV3 <- xtabs(Freq ~ Day + Time + Network, TV.df) -->
<!-- structable(Day ~ Time+Network, TV3) -->
<!-- ``` -->
<!-- Whew! See \figref{fig:TV-mosaic} for a mosaic plot of the `TV3` data. -->
<!-- The table is too large to display conveniently, but we can show a subtable -->
<!-- by selecting the indices. This call to `ftable()` subsets the first three -->
<!-- levels of each factor. -->
<!-- ```{r tv4} -->
<!-- ftable(TV[1:3,1:3,1:3,1:3], col.vars = 3:4) -->
<!-- ``` -->
<!-- We've come this far, so we might as well show a mosaic display. This is analogous to that used by -->
<!-- @vcd:Hartigan+Kleiner:1984. The result is too complex to see very much. It would be useful to simplify -->
<!-- the table by collapsing one or more of the table dimensions, e.g., `Time`. -->
<!-- ```{r tv-mosaic2, fig.height=7, fig.width=7} -->
<!-- mosaic(TV[,,1:4,], shade=TRUE, -->
<!-- labeling_args = list(abbreviate_labs = c(8,5,1,1))) -->
<!-- ``` -->
# References