Note: This script is intended only as an aid for those who need a refresher in R. We will not talk about it explicitly during QHELP.
If you have completed the exercises and have questions about them or any other questions about R, I would be happy to answer your questions during QHELP. Just talk to me :)

If you notice any errors, please send me an email to: julian.mollenhauer@uni-tuebingen.de

1 Intro / Writing R Scripts

1.1 What is R?

R is a free software environment for statistical computing and graphics.
The design of R has been heavily influenced by the S language (Becker & Chambers, 1984).
R was originally created by Ihaka & Gentleman (1996) and is now being developed by an international team of scientists (R Core Team).
Yearly new R version with relatively minor changes
R runs on Unix, Macintosh, and Windows systems.

1.2 R as a calculator

Simple mathematical operations may be entered directly at the command line.

2 + 2  # plus  
3 - 7  # minus  
2 * 2  # times  
5 / 4  # divide  
log(3) # natural logarithm  
exp(1) # exponential function

Brackets can be used in longer expressions.

sqrt(2) * ( (5 - 1 / 6)^2 - pi / 2^(1/3) ) # pi = 3.14

(R ignores any input after the # symbol. Therefore, comments to the code are written preceded by #.)

1.3 Assigning values

Results of operations and function calls may be stored in a variable. The assignment operator is <- (less than + minus).

In order to print the value of a variable, the variable name needs to be entered.

x <- sqrt(2)
y <- x^2
x

## [1] 1.414214

## [1] 2

Valid variable names may consist of letters, numbers, underscores, and dots (e. g., errors.item6), but they may only start with letters or dots. R is case sensitive (both for functions and variable names).

1.4 Vectorized arithmetic

Vectors are created using the function c() (combine).

weight <- c(60, 72, 57, 90, 95, 72)
weight

## [1] 60 72 57 90 95 72

Mathematical operations may be applied to each element.

(weight - mean(weight))^2 # squared deviation

## [1] 205.444444   5.444444 300.444444 245.444444
## [5] 427.111111   5.444444

Many R functions accept vectors as arguments.

log(weight)

## [1] 4.094345 4.276666 4.043051 4.499810 4.553877
## [6] 4.276666

1.5 The offline help system in R

R offers a number of offline sources for getting help. The text-based documentation is called by ?. For example, the documentation for the rnorm() function is accessed by

?rnorm # or help(rnorm)

Words with syntactic meaning and special characters need to be enclosed in quotes

?"if"

At the bottom of the help page, there usually are examples that can be copied and pasted into the R console.

Typing apropos("norm") displays functions in the currently loaded packages having the expression ‘norm’ in their name.

The help system of all installed packages is searched for the expression ‘norm’ via

help.search("norm") # or ??"norm"

1.6 Documentation

R ships with several manuals. An HTML based (local) help system is available via

help.start()

The manuals may also be downloaded in pdf format from https://cran.r-project.org/manuals.html. Highly readable are

An Introduction to R (with many examples)
R Data Import/Export

Further sources of information

Frequently Asked Questions
FAQ for Windows

These are available via help.start(), too.

1.7 Resources on the Internet

A wealth of information may be found on the R website http://www.r-project.org.

Under R Project/Search

R site search (within all default and CRAN packages), also via RSiteSearch("topic") from inside R
Searchable mail archives (especially R help)

Under Documentation

Manuals, FAQ
The R Journal
Books

R code examples for common data analysis tasks

R Examples Repository

At the Stack Exchange network, there are popular Q&A websites for R https://stackoverflow.com/tags/r and for applied statistics with R https://stats.stackexchange.com/tags/r

1.8 Working memory

All objects that are created during a session are kept in the working memory. Objects that are not needed any more should be deleted.

ls()          # list objects in working memory
rm(A)         # remove object A
rm(list = ls()) # remove all objects

When quitting a session (q()) the working memory can be written to a file, so the objects are available upon the next start of R.

Note that this is usually unnecessary. Rather write the R code that created the objects into a file (using a text editor). In this way, you will keep your entire analysis documented.

1.9 Working directory

When R is started, its working directory is set per default.

getwd() # get working directory
dir()   # list files in working directory

It is recommended to change the working directory to the current project directory. This keeps file paths short, and prevents you from accidentally overwriting data.

setwd("D:/projects/experiment3/analysis") # set working directory

Note that this command is likely to cause an error on any computer but yours, because the directory is specific to you. Delete or comment out this command before sharing an R script.

1.10 Packages

R comes with a standard library of packages

Default (or ‘base’) packages: These are only updated with new releases of R
Recommended packages: Available and updated at CRAN (The Comprehensive R Archive Network)

A large part of the functionality of R is available in additional add-on packages.

library()               # which packages are installed?
library(foreign)        # load package foreign
search()                # which packages are loaded?
detach(package:foreign) # 'unload' package foreign

In order to install additional packages, the computer must be connected to the Internet. Packages are installed via

install.packages("packagename")

1.11 Using the R console and your editor

The R command line allows for:

Auto-completion using <Tab>
- For function names, function arguments, objects
- Press once to complete to the longest unique (partial) name
- Press twice to show all possible names
Command history browsing and editing
- Use arrow keys (up/down) to browse
- Use arrow keys (left/right) and Pos1 (or Home) to move to start of line, etc.
Using it for quick and efficient ‘trial & error’ coding

Use your editor (e.g., RStudio editor / Vim etc.) to:

Retain working code within a script serving a fixed purpose
Formatting (longer) code
Comment code

1.12 Example workflow

Short/Simple Code, ‘trial & error’, data input

Write code directly into the R Console. If it works and gives the desired result, copy it into your R script

More commplex Code, Loops (for, while), if/else, replicate, etc.

Write and edit the code in the R script. To run it, copy the code into the R console.

It is good practice to regularly test whether an R script can run in its entirety, especially before sharing it. To do so, clear your working memory rm(list = ls()) (or restart R) and execute the entire R code in a text file:

source("filename.R") # source file located in R's working directory

1.13 R scripts: a note on style

Hints on ‘good’ R style are given in the following documents:

Short R Style Guide by Hadley Wickham
http://adv-r.had.co.nz/Style.html
Long R Style Guide by Hadley Wickham
https://style.tidyverse.org
Google’s R style guide
https://google.github.io/styleguide/Rguide.xml
R coding conventions
http://www.aroma-project.org/developers/RCC

These documents are opinionated, but readable. There is no official style guide by the R Core Team, although there are strong opinions about what constitutes good R style. Consistency is usually considered an important criterion for good style. Style errors in your submitted homeworks will be pointed out.

1.14 RStudio

“RStudio is an Integrated Development Environment [IDE; ‘integrierte Entwicklungsumgebung’] It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.”
- RStudio.com

Open Source Edition:

RStudio Desktop
Download: https://www.rstudio.com/products/rstudio/download/#download

1.15 RStudio Tips & Tricks

Some useful shortcuts

Save active document: Ctrl + S or Command + S
Run current line/selection: Ctrl + Enter or Command + Enter
Source the current document: Ctrl + Shift + S or Shift + Command + S
Comment/uncomment current line/selection: Ctrl + Shift + C or Shift + Command + C
Reindent lines: Ctrl + I or Command + I
To access all shortcuts, type Alt + Shift + K on Linux and Windows or Option + Shift + K on a Mac.

1.16 Exercises

Click the checkbox below for exercises

A1: Basic handling of vectors

Create two vectors $\mathbf{x} = (5~7~8~1~2~4~5)'$ and $\mathbf{y} = (2~8~1~3~2~4~1)'$

For both vectors, calculate the mean, variance, standard deviation and median. When calculating, first try to use the formulas given below. Calculate these parameters again using the implemented functions in R and compare whether the results are the same.
- Mean: The statistical average value. \[\bar x = \frac{1}{n}\sum_{i = 1}^{n} x_i\]
- Variance: indicates the dispersion of the data around the mean value. \[s^2 = \frac{1}{n-1} \sum_{i = 1}^{n} (x_i - \bar x)^2\]
- Standard deviation: the square root of variance. \[s = \sqrt{s ^ 2}\]
- Median: the value exactly in the middle of a series sorted by size. \[ \tilde{x} = \begin{cases} x_{m + 1}, & \text {for odd } n = 2m + 1 \. \frac{x_m + x_{m + 1}}{2}, & \text {for even } n = 2m \end{cases} \]
Generate a scatter plot of $\mathbf{x}$ and $\mathbf{y}$.

Use the R help:
With ?... you can call the documentation of certain functions and with ??... you can search for keywords in the documentation. The latter is helpful if you do not yet know the exact name of the function.
When searching for keywords, it can happen that a lot of results are displayed. Since in this task only functions from the packages base, stats or graphics are used, look for the prefix base::, stats:: or graphics::. For this task, the following R help calls may be useful:
- ?c
- ?sum
- ?length
- ??"arithmetic mean"
- ??"standard deviation"
- ??median
- ??variance
- ?plot
Use ls() to check whether the results of all calculations are assigned to one variable each and are available in memory.

2 Data structures in R (1)

2.1 Functions and arguments

Many things in R are done using function calls. Functions consist of

a name
a pair of brackets
the arguments (none, one, or more)
a return value (visible, invisible, NULL)

Example:

weight <- c(60, 72, 57, 90, 95, 72)
height <- c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91)
plot(height, weight)

Arguments may be set to default values; what they are is documented in ?plot.default (which plot() calls if no specific plot method exists for the first argument).

2.2 Functions and arguments

Arguments are passed

either without name (in the defined order) -> positional matching
or with name (in arbitrary order) -> keyword matching

plot(height, weight, pch = 16, col = "blue")

Even if no arguments are passed, the brackets need to be written, e. g., ls(), dir(), getwd().

Entering only the name of a function without brackets will display the R code of that function.

2.3 Vectors

Vectors are of a defined mode and each element in that vector has that same mode, e.g.:

mode(c(60, 72, 57, 90, 95, 72))

## [1] "numeric"

Other than numeric vectors there are

character vectors with their elements in quotation marks

logical vectors with values TRUE and FALSE

c("subj01", "subj02", "subj03") # character vector
c(TRUE, TRUE, FALSE, TRUE)      # logical vector

Logical vectors often result from a comparison

weight > 60

## [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE

2.4 Creating vectors

Vectors are created by

concatenating elements using c()

sequences

0:10                  # 11 numbers from 0 to 10
seq(1.65, 1.90, .05)  # sequence in steps of 0.05 from 1.65 to 1.90

repeating elements

rep(1:5, 3)                     # repeat vector 1:5 3 times
rep(c("high", "low"), each = 3) # repeat each element 3 times
rep(c("f", "m"), c(5, 2))       # repeat "f" 5 times and "m" 2 times
paste("stim", 1:5, sep = "")    # append 1:5 to "stim"

2.5 Matrices and arrays

Generating a 3 × 4 matrix

A <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
A

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Labeling and transposing the matrix

rownames(A) <- c("a1", "a2", "a3") # respectively: colnames()
t(A)                               # transpose

##      a1 a2 a3
## [1,]  1  5  9
## [2,]  2  6 10
## [3,]  3  7 11
## [4,]  4  8 12

R implements linear algebra and matrix operations, such as matrix multiplication (using %*%) and inversion (using solve()).

2.6 Matrices and arrays

A matrix may be created by concatenating columns or rows.

cbind(a1 = 1:4, a2 = 5:8, a3 = 9:12) # "column bind"
rbind(a1 = 1:4, a2 = 5:8, a3 = 9:12) # "row bind"

Arrays are data structures having more than two dimensions.

array(c(A, 2 * A), c(3, 4, 2)) # 3 x 4 x 2 array

2.7 Factors

Factors are data structures for categorical variables, such as diagnosis, socio-economic status, or sex.

ses <- factor(c("low", "inter", "high", "low"))
ses

## [1] low   inter high  low  
## Levels: high inter low

The ordering of factor levels is changed via the levels argument.

ses2 <- factor(ses, levels = c("low", "inter", "high"))
ses2

## [1] low   inter high  low  
## Levels: low inter high

The labels argument is used to change the labels of factor levels.

ses3 <- factor(ses2, labels = c("normal", "normal", "high"))
ses3

## [1] normal normal high   normal
## Levels: normal high

Experimental factors may easily be generated by combining factor() and rep(), and providing the desired labels.

factor(rep(1:2, each = 10), labels = c("on", "off"))

##  [1] on  on  on  on  on  on  on  on  on  on  off off
## [13] off off off off off off off off
## Levels: on off

2.8 Exercises

Click the checkbox below for exercises

A1: `NA` and `mean()`

Use the function mean() to find the mean value of the vector $\mathbf{a} = (1~3~7~1~2~6~7~8~9~\mathrm{NA})'$.

Why is the result NA?
How can you calculate the mean without the missing value?

A2: Positional and Keyword Matching

Using the function mean(), understand what is meant in the script by the terms ‘positional matching’ and ‘keyword matching’.

What arguments does the function take (see ?mean)?
What does the trim argument do?

First use all arguments without explicitly specifying their names (positional matching). Then change the order of the arguments (keyword matching).

A3: Practical handling of vectors

Create a vector $\mathbf{x} = (1~5~8~3~7~2~6)'$.
Create a second vector $\mathbf{y}$ of the same length containing the odd numbers $1, 3, 5, \ldots$. Use the function seq() for this.

Calculate a vector $\mathbf{z}$ as follows: $\mathbf{z} = 4\mathbf{x} + 2\mathbf{y}$ ($\mathbf{z}$ is a linear combination of $\mathbf{x}$ and $\mathbf{y}$).

Create as efficiently as possible a matrix $\mathbf{B}$ consisting of the column vectors $\mathbf{x}$, $\mathbf{y}$, and $\mathbf{z}$.
Then create a matrix $\mathbf{C}$ in which the three vectors form the row vectors in two ways:
First by combining the vectors $\mathbf{x}$, $\mathbf{y}$, and $\mathbf{z}$ into a new matrix, and then by transposing the already created matrix $\mathbf{B}$.

Read out the vector $\mathbf{y}$ from $\mathbf{B}$ and $\mathbf{C}$ again by indexing (see ?Extract).

A4: Creating sequences

Use the function seq() to create the following sequences of numbers:

$5, 10, 15, 20, \dots$
$0.1, 0.2, 0.3, \dots$
$0.25, 0.5, 0.75, 1.0, \dots$

Each sequence should contain ten elements. Try to create the vectors also with the colon operator :.

A5: Using rep()

Create the vector $(a~a~b~c~d~e~e)'$ with the rep() function.
Why does the input rep(letters[1:5], 2) not give the desired result?

A6: Efficiently create vectors

Create the following vectors (without ‘writing them out’):

$(13~15~17~19~21~23~25~27~29~31~33)'$
$({-2}~{-1}~0~1~2~3~4~5~6~7~8)'$
$(\text{exp1}~\text{exp2}~\text{exp3}~\dots~\text{exp15})'$

What is the length of each vector? Answer this question by calling a function.

A7: Create matrices

Generate the matrix \[ \mathbf{A} = \begin{pmatrix}11 & 12 & 13 & 14\. 15 & 16 & 17 & 18\end{pmatrix}. \] Form $\mathbf{A}'$ (the transposed matrix of $\mathbf{A}$) once by using a function and once by indexing.

A8: Elements of factors

Create a factor G with G <- factor(rep(c(0, 2, 5), 10)). Execute the following R commands:

mean(G)
mean(as.numeric(G))
mean(as.character(G))
mean(as.numeric(as.character(G)))

Compare the results and explain the differences.

Use the functions table() or xtabs() to check whether each level of the factor is present equally often.

3 Data structures in R (2)

3.1 Lists

It is often necessary to store an ordered collection of R objects with more than one mode in a single object: a list.

list1 <- list(w = weight, h = height, s = ses2, A = A)

The components of a list are extracted using $ and the name of the component, or using [[]] and either the the name or number (index) of the component.

list1$h # or list1[["h"]] or list1[[2]]

## [1] 1.75 1.80 1.65 1.90 1.74 1.91

list1$A # or list1[["A"]] or list1[[4]]

##    [,1] [,2] [,3] [,4]
## a1    1    2    3    4
## a2    5    6    7    8
## a3    9   10   11   12

3.2 The fundamental data structure: data frames

Data frames are lists that consist of vectors and factors of equal length. The rows in a data frame refer to one unit (observation or subject).

id <- factor(paste("s", 1:6, sep = ""))
dat <- data.frame(id, weight, height)
dat

##   id weight height
## 1 s1     60   1.75
## 2 s2     72   1.80
## 3 s3     57   1.65
## 4 s4     90   1.90
## 5 s5     95   1.74
## 6 s6     72   1.91

Variables in a data frame are extracted by $ or []. (see also: Indexing data frames)

dat$id

## [1] s1 s2 s3 s4 s5 s6
## Levels: s1 s2 s3 s4 s5 s6

dat[, 1]

## [1] s1 s2 s3 s4 s5 s6
## Levels: s1 s2 s3 s4 s5 s6

dat[, "id"]

## [1] s1 s2 s3 s4 s5 s6
## Levels: s1 s2 s3 s4 s5 s6

3.2.1 Working with data frames

Frequently used functions (not only) for data frames

dim(dat)     # show number of rows and columns

## [1] 6 3

names(dat)   # variable names

## [1] "id"     "weight" "height"

str(dat)     # show variables of dat

## 'data.frame':    6 obs. of  3 variables:
##  $ id    : Factor w/ 6 levels "s1","s2","s3",..: 1 2 3 4 5 6
##  $ weight: num  60 72 57 90 95 72
##  $ height: num  1.75 1.8 1.65 1.9 1.74 1.91

summary(dat) # descriptive statistics

##   id        weight          height     
##  s1:1   Min.   :57.00   Min.   :1.650  
##  s2:1   1st Qu.:63.00   1st Qu.:1.742  
##  s3:1   Median :72.00   Median :1.775  
##  s4:1   Mean   :74.33   Mean   :1.792  
##  s5:1   3rd Qu.:85.50   3rd Qu.:1.875  
##  s6:1   Max.   :95.00   Max.   :1.910

plot(dat)    # pairwise plots

3.3 Indexing variables

Elements of a variable can be accessed using [] (see ?Extract).

weight <- c(60, 72, 57, 90, 95, 72)
weight[4]          # 4th element
weight[4] <- 92    # change 4th element
weight[c(1, 2, 6)] # elements 1, 2, 6
weight[1:5]        # elements 1 to 5
weight[-3]         # without element 3

Indices may be logical.

weight[weight > 60]               # values greater than 60
weight[weight > 60 & weight < 80] # between 60 and 80

Logical expressions may contain

AND &
OR |
EQUAL ==
NOT EQUAL !=
<, <=, >, >=, %in%.

For more information on logical expressions, see ?Comparison (or ?">"), ?Logic (or ?"&"), and ?match (or ?"%in%").

3.4 Indexing data frames

Data frames have a row and a column index. Indices can be numeric or character vectors. Omitting one index selects all rows or columns, respectively.

dat[3, 2]                # 3rd row, 2nd column
dat[1:4, ]               # rows 1 to 4, all columns
dat[, c(2,4)]            # all rows, colums 2 and 4
dat[, "id"]              # all rows, column "id"
dat[, c("id", "weight")] # all rows, columns "id" and "weight"

Using logical indices, specific rows may be selected for which certain conditions hold.

dat[dat$id == "s2", ]                  # all obs. of s2
dat[dat$id %in% c("s2","s5"), ]        # all obs. of s2 and s5
dat[dat$id == "s2" | dat$id == "s5", ] # same as above
dat[dat$weight > 60, ]                 # all obs. above 60kg

Matrices and arrays have two or more indices.

arr <- array(c(A, 2 * A), c(3, 4, 2))
arr[, , 2]    # 2nd "slice" ("Schicht")

3.5 Sorting

Data frames are most conveniently sorted using order() which returns the indices of the ordered variable.

order(dat$weight)        # weight: 60 72 57 90 95 72
dat[order(dat$weight), ] # sort by weight

order() accepts multiple arguments to allow for sorting by more than a single variable.

data[order(data$A, data$B), ] # first sort by A, then by B

3.6 Building contingency tables

Contingency tables contain the counts of one or more variables, usually factors:

table(mtcars$gear)

## 
##  3  4  5 
## 15 12  5

table(mtcars$gear, mtcars$cyl)

##    
##      4  6  8
##   3  1  2 12
##   4  8  4  0
##   5  2  1  2

The xtabs() function takes a formula as its first argument:

xtabs(~ gear + cyl, mtcars)

##     cyl
## gear  4  6  8
##    3  1  2 12
##    4  8  4  0
##    5  2  1  2

3.7 Aggregating

Frequently, one wants to aggregate a data frame, for example in order to average over multiple observations within factor levels. This may be achieved using the aggregate() function.

# Single response variable, single grouping variable
aggregate(y ~ x, data, FUN, ...)

# Multiple response and grouping variables
aggregate(cbind(y1, y2) ~ x1 + x2, data, FUN, ...)

The y variables are numeric data to be split into groups according to the grouping variables x (usually factors). FUN is a function to compute the statistics.

3.7.1 Aggregating: example

Average horse power (hp) and fuel efficiency (mpg; miles per gallon) for each combination of number of cylinders (cyl) and gears (gear).

aggregate(hp ~ cyl + gear, mtcars, mean)

##   cyl gear       hp
## 1   4    3  97.0000
## 2   6    3 107.5000
## 3   8    3 194.1667
## 4   4    4  76.0000
## 5   6    4 116.5000
## 6   4    5 102.0000
## 7   6    5 175.0000
## 8   8    5 299.5000

aggregate(mpg ~ cyl + gear, mtcars, mean)

##   cyl gear    mpg
## 1   4    3 21.500
## 2   6    3 19.750
## 3   8    3 15.050
## 4   4    4 26.925
## 5   6    4 19.750
## 6   4    5 28.200
## 7   6    5 19.700
## 8   8    5 15.400

# combined in one function call
aggregate(cbind(hp, mpg) ~ cyl + gear, mtcars, mean)

##   cyl gear       hp    mpg
## 1   4    3  97.0000 21.500
## 2   6    3 107.5000 19.750
## 3   8    3 194.1667 15.050
## 4   4    4  76.0000 26.925
## 5   6    4 116.5000 19.750
## 6   4    5 102.0000 28.200
## 7   6    5 175.0000 19.700
## 8   8    5 299.5000 15.400

3.8 Reshaping

Frequently, data are in either long (one line per observation) or wide (one line per case) format. reshape() can be used to switch between both formats.

head(Indometh) # long data format

# From long to wide
df_w <- reshape(Indometh, v.names = "conc", timevar = "time",
                idvar = "Subject", direction = "wide")

Reshaping: wide to long

# From wide to long
df_l <- reshape(df_w, varying = list(2:12), v.names = "conc",
                idvar = "Subject", direction = "long",
                times = c(.25, .5, .75, 1, 1.25, 2, 3, 4, 5, 6, 8))

#Then, optionally, re-order data frame by subject
df_s <- df_l[order(df_l$Subject), ]

3.9 Exercises

Click the checkbox below for exercises

A1: Simple handling of vectors, matrices and lists

Create the following objects in R:

vector x with 20 elements randomly drawn from the sequence 1:250;
vector y containing the elements from x in random order (permutated);
A matrix: \[ \begin{pmatrix}1 & 3 & 7\\ 4 & 6 & 9\end{pmatrix} \]

Store $\mathbf{x}$, $\mathbf{y}$ and $\mathbf{A}$ in a list and name the elements of this list. Extract the three elements of this list in three different ways.

A2: Create a dataframe

Create a dataframe dat for 60 subjects and three variables:

id subject number with levels 1 to 60
order with the levels ‘same’ and ‘different’ (pay attention to the order of the levels!)
condition with levels A, B and C.

Order and condition are crossed factors in a balanced design, i.e. there are ten subjects in each of the six experimental conditions. Apply str(), summary(), table() or xtabs() to the dataframe to check the structure.

A3: Using Dataframe

Simulate the result of a psychological test for each subject and write this into a new variable Score in the previously created dataframe.

To do this, first create the variable Score using dat$Score <- NA and thus assign the value NA to each test person. Now assume that the following results (points achieved in the test) are possible depending on the condition:

In condition A: $7 ~ 9 ~ 14 ~ 32 ~ 65$
In condition B: $8 ~ 15 ~ 23 ~ 33$
In condition C: $5 ~ 6 ~ 25 ~ 43 ~ 50$

For the subjects in each condition, randomly draw a score from the corresponding possible scores and write it into the variable Score of the dataframe.

Then use aggregate() to calculate the mean values of the test results in all six experimental conditions. Find the minimum and maximum of the test results in each experimental condition using the min() and max() functions in combination with aggregate().

A4: Indexing from a dataframe

What types of indexing do you know to extract individual columns from the dataframe you have just created? Read the variable Score with all these possibilities.

4 Data input and output

4.1 Reading tabular text files

ASCII text files in tabular or spread sheet form (one line per observation, one column per variable) are read via read.table().

dat <- read.table("D:/experiment/data.txt", header = TRUE)

This creates a data frame, numerical variables are converted to numerical vectors, character variables to character vectors. It is recommended to first open any file which (possibly) is a text file in your text editor (e.g., Vim / RStudio Editor) to find out which options to set when reading it into R.

Frequently used options

header: variable names in the first line?
stringsAsFactors: convert character vectors to factors?
sep: which delimiter between variables (white space, tabulator, comma, semicolon etc.)?
dec: decimal point or comma?
skip: number of lines to skip before reading data

4.2 Reading arbitrary text files

ASCII text files in arbitrary form are read using readLines().

raw <- readLines("rawdata.txt")

This creates a character vector with as many elements as lines of text. The character vector is further processed using string functions or regular expressions (?regex).

Example:

readLines(file.path(Sys.getenv("R_HOME"), "doc", "THANKS"))

4.3 Reading files created by other software

Several possible strategies

Save data as text file in other software. Then employ one of the functions from the read.table() family.

Reading SPSS data files

library(foreign) # load add-on package
dat <- read.spss("file.sav", to.data.frame = TRUE)

Additional information (e. g., about Excel files, data bases) in R Data Import/Export manual

4.4 Combining data frames

Data frames having the same column names can be combined using rbind().

dat <- rbind(dat1, dat2, dat3)

More flexibility is provided by merge().

Example: merge by ID

dat <- merge(dat1, dat2, by = "id", all = TRUE)

This creates a data frame with as many rows as different IDs. The new data frame includes all columns in dat1 and dat2, possibly with missing values if IDs are in either dat1 or dat2, but not in both.

4.5 Data entry

Small data vectors can be copied into the R console using scan():

x <- scan(text = "<data>")    # for numeric input
y <- scan(text = "<data>",
          what = "character") # for character input

A simple spread-sheet editor is invoked by

dat <- edit(data.frame()) # start with empty dataframe

Usually, it is more convenient to use external software (text editor, spread-sheet application) for data entry. These (text) files with the raw data are then read into R.

For reproducibility: Data preparation, processing, and validation should be entirely done in R. Only as little manual editing as necessary, but as much script-based data processing as possible!

4.6 Writing data

Writing text files

write.table(dat, "exp1_data.txt", ...) # many optional arguments

Text files are maximally portable (also across software packages), human readable, but may lose some features of the data structures (e. g., ordering of factor levels).

R binary files

save(dat, file = "exp1_data.rda") # write binary file
load("exp1_data.rda") # recreate dat

R binary files are exact representations of arbitrary data structures (e.g., data frames, lists), portable across platforms (within R only), but not human readable.

5 Programming in R

5.1 Conditional execution

Conditional execution of code is available in R via

if(expr_1) {
  expr_2
} else {
  expr_3
}

where expr_1 must evaluate to a single logical value. Additional if conditions can be implemended via else if():

if(expr_1) {
  expr_2
} else if(expr3) {
  expr_4
} else {
  expr_5
}

where expr_3 must also evaluate to a single logical value.

Example

x <- 3
if(!is.na(x)) {
  y <- x ^ 2
} else {
  stop("x is missing")
}
y

x <- NA
if(!is.na(x)) {
  y <- x ^ 2
} else {
  stop("x is missing")
}

There is also a vectorized version of the if/else construct, the ifelse() function:

x <- 1:3
y <- ifelse(x > 1, x * 2, x / 2)

5.2 Loops

There are for() and while() loops for repeated execution:

A for()-loop is used to iterate over items of an object.
A while()-loop allows code to be executed repeatedly while a given Boolean condition is true.

5.2.1 `for()`-loop example:

Subtract the mean value $\bar{x}$ from each $x_i$ with $i = 1, 2, \ldots,n$.

x <- rnorm(10)
y <- numeric(10) # allocate space for results
for(i in seq_along(x)) { # for i = 1, 2, ..., 10
  y[i] <- x[i] - mean(x) # replace element with result
}
y

# In R, however, no loop is required for this 
# since a vectorized calculation is possible.
y1 <- x - mean(x)
y1

5.2.2 `while()`-loop example

Babylonian method (a.k.a Heron’s method) to approximate $\sqrt{S}$. Iterative formula:

\[ x_{n+1}={\frac {1}{2}}\left(x_{n}+{\frac {S}{x_{n}}}\right) \]

S <- 2
      
x <- 1          # starting value
last_x <- x/2    
while(abs(x - last_x) > 1e-10){ # boolean condition to check before each iteration
  last_x <- x
  print(paste0("Current approximation: ", x))
  x <- (1 / 2) * (last_x + (S / last_x))
}

5.3 Avoiding loops

Warning: loops are used in R code much less often than in compiled languages. Code that takes a ‘whole object’ (vectorized) view is likely to be both clearer and faster in R.

The apply() family of functions may be used in many places where in traditional languages loops are employed.

Matrices and arrays: apply()

apply() is used to work vector-wise on matrices or arrays. Appropriate functions can be applied to the columns or rows of a matrix or array without explicitly writing code for a loop.

X <- matrix(rnorm(20), nrow = 5, ncol = 4)
apply(X, 2, max)   # maximum for each column

## [1]  0.40077657 -0.10453052 -0.05198225  2.25512248

apply(X, 1, mean)  # mean of each row; rowMeans(x)

## [1] -0.03323771 -0.15488379 -0.20236019 -0.59688721
## [5] -0.20881338

apply(X, 1, function(x) {max(x) - min(x)})  # custom function

## [1] 3.6743271 2.0647388 0.3004656 4.2163695 0.8620221

Data frames, lists and vectors: lapply() and sapply()

Using the function lapply() (l because the value returned is a list), a function can be applied element-wise to other objects, for example, data frames, lists, or simply vectors. The resulting list has as many elements as the original object to which the function is applied.

X_df <- as.data.frame(X)
# apply(X_df, max) does NOT work, a dataframe is not an array
lapply(X_df, max) # same result as apply(X, 2, max)

## $V1
## [1] 0.4007766
## 
## $V2
## [1] -0.1045305
## 
## $V3
## [1] -0.05198225
## 
## $V4
## [1] 2.255122

The function sapply() (s for simplify) works like lapply() with the exception that it tries to simplify the value it returns. It might become a vector or a matrix.

sapply(iris, class) # works column wise on data frame returns a character vector

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    "numeric"    "numeric"    "numeric"    "numeric" 
##      Species 
##     "factor"

Group-wise calculations: tapply()

tapply() (t for table) may be used (instead of aggregate()) to do group-wise calculations on vectors. Frequently it is employed to calculate group-wise means.

data(Oats, package = "nlme") # load df Oats from nlme packages

# One factor
tapply(Oats$yield, Oats$Block, mean)

##        VI         V       III        IV        II 
##  96.25000  90.91667  95.91667  98.16667 107.25000 
##         I 
## 135.33333

# Two factors
tapply(Oats$yield, list(Oats$Block, Oats$Variety), mean)

##     Golden Rain Marvellous Victory
## VI        90.25     109.00   89.50
## V         95.50      85.25   92.00
## III       86.75     118.50   82.50
## IV       108.00      95.00   91.50
## II       113.25     121.25   87.25
## I        133.25     129.75  143.00

# compared to aggregate()
aggregate(yield ~ Block + Variety, Oats, mean)

##    Block     Variety  yield
## 1     VI Golden Rain  90.25
## 2      V Golden Rain  95.50
## 3    III Golden Rain  86.75
## 4     IV Golden Rain 108.00
## 5     II Golden Rain 113.25
## 6      I Golden Rain 133.25
## 7     VI  Marvellous 109.00
## 8      V  Marvellous  85.25
## 9    III  Marvellous 118.50
## 10    IV  Marvellous  95.00
## 11    II  Marvellous 121.25
## 12     I  Marvellous 129.75
## 13    VI     Victory  89.50
## 14     V     Victory  92.00
## 15   III     Victory  82.50
## 16    IV     Victory  91.50
## 17    II     Victory  87.25
## 18     I     Victory 143.00

5.4 Writing functions

New functions may be defined interactively in a running R session using function().

Example: a handmade t-test function (there is one in R already!)

twosam <- function(y1, y2) {  # definition
    n1 <- length(y1)          # sample size
    n2 <- length(y2) 
    yb1 <- mean(y1)           # mean
    yb2 <- mean(y2)
    s1 <- var(y1)             # variance
    s2 <- var(y2)
    s_p <- ((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2) # pooled variance
    tst <- (yb1 - yb2) / sqrt(s_p * (1 / n1 + 1 / n2))     # T statistic
    return(tst)           # return value; exactly one object; is often a list
}

Calling the function

tstat <- twosam(y1 = chickwts[chickwts$feed == "linseed", "weight"], 
                y2 = chickwts[chickwts$feed == "soybean", "weight"])
tstat

# check T statistic
t.test(chickwts[chickwts$feed == "linseed", "weight"],
       chickwts[chickwts$feed == "soybean", "weight"], var.equal = TRUE)

If there is a function fun1 defined by

fun1 <- function(data, data.frame, graph, limit) { 
    ... 
}

then the function may be invoked in several ways, for example

fun1(d, df, TRUE, 20)
fun1(d, df, graph = TRUE, limit = 20)
fun1(data = d, limit = 20, graph = TRUE, data.frame = df)

All of them are equivalent (cf. positional matching and keyword matching).

In many cases, arguments can be given commonly appropriate default values, in which case they may be omitted from the call.

fun1 <- function(data, data.frame, graph = TRUE, limit = 20) { 
    ... 
}

It could be called as

ans <- fun1(d, df)

which is now equivalent to the three cases above, or as

ans <- fun1(d, df, limit = 10)

which changes one of the defaults.

5.5 Classes and methods

The return value of a function may have a specified class. The class of an object determines how it will be treated by so-called generic functions. To find out which generic functions have a method for a given class, use: methods(class = "nameOfClass")

Example:

methods(class = "factor")

Conversely, to find out all the class-based methods defined for a generic function, use: methods(genericFunction)

Example:

methods(summary)

Access the documentation (if available) for a method via e.g. ?summary.factor

5.6 A user-defined print method

A print method for the twosam() function could be

print.twsm <- function(obj) {    # functionName.className
  cat("\nMy two-sample t test\n\n")
  cat("Value of test statistic: ", obj$tst, "\n")
  invisible(obj)  # return the object invisibly
}

(See ?cat and ?Quotes for message formatting.)

For this to work, the original function needs to return an object of class ‘twsm’.

twosam <- function(y1, y2) {
  ...
  retval <- list(tst = tst) # define return value
  class(retval) <- "twsm"   # assign class to return value
  return(retval)
}

5.7 Debugging

When trying to find errors in the code, it may help to look inside a function while running it, and to execute it step by step. This can be done using debug().

debug(twosam)
twosam(rnorm(10), rnorm(10) + 2)

At the prompt, any R code may be entered. Some useful commands:

ls() displays what the function ‘sees’ at each step
n (or the return key) evaluates the next command of the function
Q quits the function browser
For more commands, see ?browser

In order to disable the debugging mode for the function, type

undebug(twosam)

5.8 R programming traps

Learning to program in a new languange means making a lot of errors. Some errors are unexpected both for beginners and for those who have programmed before in a different language.

Reading the FAQs, R mailing lists (and Stackoverflow), the built-in documentation, and the manuals, will help!

‘The R Inferno’ by Patrick Burns (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) is an entertaining and informative guide to the traps of the R language.

5.9 Exercises

Click the checkbox below for exercises

A1: Create vector using for-loop

Create the vector $\mathbf{x} = (3~5~7~9~11~13~15~17)'$ using a for loop. Note: Use the formula $i \cdot2 + 1$. Implement two methods:

allocate memory: Start with a zero vector of the right length (tip: numeric()) and iteratively replace its elements.
growing: Start with a NULL object (x <- NULL) and iteratively append new results.

Note: The first method is more efficient especially for longer vectors.

For those interested: Test the efficiency of both methods using system.time() and a sufficiently long vector.

A2: Implement a function

Implement the following function in R: \[ f(x) = \begin{cases} -1 & \text{wenn } x < 0,\\ 0 & \text{wenn } x = 0,\\ 1 & \text{wenn } x > 0. \end{cases} \]

A3: Using a for-loop to generate number sequences

Write a for loop for $x_{i + 1} = 0.5 \cdot (x_i + \frac{b}{x_i})$, with $i = 1, \dots, 15$, $x_1 = 10$ and $b = 5$. How does the output change for different values of $x_1$ and $b$? Plot $x$ with plot(x, type = "l", main = b).

The plot for b <- 5 should look something like this.

A4: Triangular matrices

Create the $10 \times 10$ triangular matrix \[ \begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \vdots&\ddots&\ddots&\ddots&\ddots&\ddots&\ddots&\ddots&\ddots& \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \end{matrix} \] using two methods:

iterate with two nested for-loops through rows (index $i$) and columns (index $j$) of a zero matrix; replace an element by $1$ if $i \geq j$.
search for a suitable R-function with ??.
Hint: If you get a logical matrix with the R-function, you have to translate it into a numerical matrix. To do this, either think of mode() or test what happens when you apply simple arithmetic operations to the logical matrix.

A5: cumulative sum

Write R code that cumulatively adds up the values of the vector $\mathbf{x} = (1~2~3~4 \dots 10)'$. The result should look like this: $\mathbf{y} = (1~3~6~10 \dots 55)'$. Solve the problem in three ways:

with a loop,
by matrix multiplication (vectorised solution; hint: use the triangular matrix above),
with an R-function (search with ??).

6 Graphics

6.1 R graphics systems

Two graphics systems come with R

Traditional graphics
Grid (also called Lattice, Panel, or Trellis) graphics

The former is easier to control, the latter provides more flexibility. The lattice package is based on the grid graphics system; it implements Trellis plots (a group of smaller plots arranged in a grid) in R.

Both systems provide high-level functions that produce complete plots and low-level functions that add further output to an existing plot.

Another widely used R graphics system is the ggplot2 package.

We will only deal with the traditional (base) graphics here and take a quick look at the lattice package.

Manual, documentation, examples

See the ‘Graphical procedures’ chapter in the Introduction to R manual for an introduction to the traditional graphics system.
See Murrell (2011) for comprehensive documentation for traditional and Lattice Graphics, and http://www.stat.auckland.ac.nz/~paul/RG2e/ for examples.
A compendium of clean traditional graphs with code: http://www.shinyapps.org/apps/RGraphCompendium/index.php
Resources for ggplot2 can be found at https://ggplot2.tidyverse.org/

6.2 Traditional graphics functions

High-level functions

plot()     # scatter plot, specialized plot methods
boxplot()  # boxplot (Median, 1. & 3. Quartile, whiskers)
hist()     # histogram
qqnorm()   # normal quantile-quantile plot
barplot()  # barplot
curve()    # function plot
pie()      # pie chart
pairs()    # scatter plot matrix
persp()    # 3d plot
contour()  # contour plot
coplot()   # conditional plot
interaction.plot()

Low-level functions

points()  # add points
lines()   # add lines
rect()    # add rectangle
polygon() # add polygon
abline()  # add line with intercept a, slope b (or horizontal h, vertical v)
arrows()  # add arrows; useful for e.g. confidence intervals
text()    # add text in plotting region
mtext()   # add text in margins region
axis()    # customize axes
box()     # draw box around plot
legend()  # add a figure legend

Use ?functionName to find more detailed information about the functions and arguments to use

6.3 Plot layout in traditional graphics

A figure consists of two regions

the plotting region (containing the actual plot)
the margins region (containing axes and labels)

Example: scatter plot

x <- runif(50, 0, 2) # 50 uniform random numbers
y <- runif(50, 0, 2)
plot(x, y, main = "Title", sub = "Subtitle", xlab = "x-label",
     ylab = "y-label") # produce plotting window

## Writing text into plotting region
text(0.6, 0.6, "Text at (0.6, 0.6)")

## horizont. and vertic. lines
abline(h = 1.1, v = 1.6, lty = 2) 

## Writing text into margin region
for(side in 1:4) {
  mtext(-1:4, side = side, at = .7, line = -1:4)
}
mtext(paste("Side", 1:4), side = 1:4, line = -1, font = 2)

6.4 Stepwise plotting

In order to have full control over the details of a plot, it is recommended to build it step by step. Consider as an example the mean reaction times in a two-factorial experiment; the interaction is to be graphically displayed.

dat <- read.table(header = TRUE, stringsAsFactors = TRUE, 
                  text = "A  B  rt
                          a1 b1 825
                          a1 b2 792
                          a1 b3 840
                          a2 b1 997
                          a2 b2 902
                          a2 b3 786")

## First produce only the plotting region and the coordinate system.
plot(rt ~ as.numeric(B), dat, type = "n", axes = FALSE,
     xlim = c(.8, 3.2), ylim = c(750, 1000),
     xlab = "Difficulty", ylab = "Mean reaction time (ms)")

## Plot the data points separately for each level of factor A.
points(rt ~ as.numeric(B), dat[dat$A == "a1", ], type = "b", pch = 16)
points(rt ~ as.numeric(B), dat[dat$A == "a2", ], type = "b", pch = 4)

## Add axes.
axis(side = 1, at = 1:3, expression(B[1], B[2], B[3]))
axis(side = 2)

## Add a legend and box.
legend(2.5, 975, expression(A[1], A[2]), pch = c(16, 4),
       bty = "n", title = "Task")
box()

Step-by-Step:

Proper formatting of mathematical expressions is possible with expression(), which accepts syntax documented in ?plotmath

Additionally:

Error bars may be added using the arrows() function.
Via par() many graphical parameters may be set (see ?par), for example par(mgp = c(2, .7, 0)) reduces the distance between labels and axes

6.5 Graphical parameters

Some graphical parameters may only be set by a call to par().

mai   # size of figure margins (in inches)
mar   # size of figure margins (in lines of text)
mfrow # number of sub-figures in the plotting device
oma   # size of outer margins (in lines of text)
omi   # size of outer margins (in inches)

Parameters changed via par() are automatically reset when a new device is opened.

Some graphical parameters are also accepted as arguments by high- and low-level plotting functions.

adj  # justification of text
bty  # box type for legend
cex  # size of text or data symbols (multiplier)
col  # color, see colors() for the available colors
las  # rotation of text in margins
lty  # line type (solid, dashed, dotted, ...)
lwd  # line width
mgp  # placement of axis ticks and tick labels
pch  # data symbol type
tck  # length of axis ticks
type # type of plot (points, lines, both, none)

6.6 Graphic devices

Graphic devices control the format in which a plot is produced. When using R interactively, the standard output of graphic functions is on the screen. Additional windows can be opened:

x11() # open new plotting window (X or Windows)
quartz() # Mac OS Quartz window

There are several file devices available:

postscript()   # vector graphics
pdf()
win.metafile() # Windows only; native format for Word etc.
png()          # bitmap graphics
tiff()
jpeg()
bmp()

6.7 Exporting graphics to files

Plots may be exported to these file formats. With vector graphics (such as eps or wmf) plots can be re-scaled without losing quality. The procedure is: First open a file device, then execute some plotting commands (their results are plotted into the device), finally close the device.

A pdf graphic is produced by

pdf("myplot.pdf", height = 3.4, width = 3.4, pointsize = 10)

# -- some plotting commands -- #

dev.off() # close device

A Windows compatible wmf graphic is obtained by

win.metafile("myplot.wmf", height = 3.4, width = 3.4, pointsize = 10)

# -- some plotting commands -- #

dev.off() # close device

The plot should fit on an A4 sheet of paper. A4 paper in inches is 8.3 x 11.7 Taking into account the page margin, a plot should be a maximum of about 7.5 x 9 (or 9 x 7.5) inches in size.

6.8 Exercises

Click the checkbox below for exercises

A1: Figure step by step

Load the dataset cars (see ?cars for information on this). Plot distance as a function of speed: Set type = "n" and axes = FALSE. Now build up the figure piece by piece with points(), axis() etc.

A2: Plot to show PlantGrowth

Use the R dataset PlantGrowth. Create a figure with two columns and two rows (graphic parameter mfrow):

Left column top: A boxplot (function boxplot()) of the three groups; additionally write the respective group mean in each box with text().
Remaining columns/rows: Plot the histograms of the individual groups.

Make sure that the x-axes of the histograms are identical and choose a uniform width/number of `bins’.

Hint: First create a data frame containing the mean values using an aggregate() command. This could look like this, for example:

##   group weight
## 1  ctrl  5.032
## 2  trt1  4.661
## 3  trt2  5.526

The finished plot could look like this:

A3: Plotting curves

Given the following function: $f(x) = \frac{1}{3}x^3 + 2x^2 + 9$

Plot the curve of this function in the interval [-10, 10] and the first 2 derivatives of the function. To do this, create a vector in this interval with step size 0.01.

x <- seq(-10, 10, 0.01)

Then apply the respective functions to the vector and plot the two vectors with plot(). Set type = 'l'. How does the plot change if you do not do this? Plot all functions in the same plot (you can do this with lines()) and use different colours for the functions (argument col). At the end you can add a legend.

The figure could look like this:

Find out about the function curve(). Can you use it to create the same figure?

7 Classical hypothesis tests

7.1 Classical hypothesis testing in R

The stats package, which is automatically loaded when R starts up, contains functions for statistical calculations and random number generation. For a complete list of functions, see library(help=stats).

Examples of functions to perform classical hypothesis testing in R

Nominal data

binom.test() performs an exact test of a simple null hypothesis about the probability of success in a Bernoulli experiment
chisq.test() performs chi-squared contingency table tests and goodness-of-fit tests

Metric response variables

cor.test() for association between paired samples
t.test() performs one- and two-sample t tests
var.test() performs an F-test to compare the variances of two samples from normal populations

Ordinal-continuous response variables

wilcox.test() performs one- and two-sample Wilcoxon tests

7.2 P-value and statistical decision rule

The p-value is the probability of obtaining a test statistic at least as extreme as the one that was observed, given that the null hypothesis is true. If the p-value is smaller than the probability $\alpha$ of a Type I decision error, then the null hypothesis is rejected.

T.statistic <- 2.3 # calculated T statistic = 2.3
df <- 18           # df = 18

t <- seq(-4, 4, .01)
den_t <- dt(t, df) 
plot(t, den_t, type = "l", ylab = "Density")

polygon(c(t[t >= T.statistic ], max(t), T.statistic),
        c(den_t[t >= T.statistic ], 0, 0), col = "blue",
        border = 1)

polygon(c(t[t <= -T.statistic ], -T.statistic, min(t)),
        c(den_t[t <= -T.statistic ], 0, 0), col = "blue",
        border = 1) 

abline(v = c(-T.statistic, T.statistic), lty = 2, col = "blue") # draw T.stat
abline(h = 0, col = "grey")

p <- pt(-T.statistic, df) * 2 # p.value = left tail * 2 (symmetric distribution)
crit.q <- qt(.975, df)        # crit. quantile; alpha = .05; 1-alphha/2 = .975

abline(v = c(-crit.q, crit.q), col = "red", lty = 2) # draw crit. quantile
text(0, .1, paste0("p.value = ", round(p, 3)))

7.3 Nominal data

7.3.1 Binomial test

Assumptions:

$X_1,\dots, X_n$ independent and identically Bernoulli distributed with parameter $p$

Hypotheses:

H$_0$: $p = p_0 ~~~$ H$_1$: $p \neq p_0$ (alternative = "two-sided")
H$_0$: $p \geq p_0 ~~~$ H$_1$: $p < p_0$ (alternative = "less" # left-tailed)
H$_0$: $p \leq p_0 ~~~$ H$_1$: $p > p_0$ (alternative = "greater" # right-tailed)

Test statistic: \[ T = \sum^{n}_{i = 1} X_i, \qquad T \sim B(n, p_0) \]

R function:

binom.test(x = 8, n = 20, p = 0.25)

7.3.2 Chi-squared goodness-of-fit test

Assumptions:

$X_i, \dots, X_n$ independent and identically multinomially distributed

Hypotheses:

H$_0$: $P(X_i = k) = p_k$ for all $k = 1, \dots, r \; (i = 1, \dots, n)$
H$_1$: $P(X_i = k) \neq p_k$ for at least one $k \in {1, \dots, r} \; (i = 1, \dots, n)$

Test statistic: \[ X^2 = \sum^r_{k = 1}\frac{(Y_k - np_k)^2}{np_k}, \qquad X^2 \sim \chi^2(r - 1) \]

R function:

tab <- xtabs(count ~ spray, InsectSprays)
chisq.test(tab)

7.3.3 Chi-squared test of homogeneity

Assumptions:

$X_{j1}, \dots, X_{jn_j}$ independent and identically multinomially distributed

Hypotheses:

H$_0$: $P(X_{ji} = k) = p_k$ for all $j = 1,\dots,m$, $i = 1, \dots, n_j$ and $k = 1, \dots, r$
H$_1$: $P(X_{ji} = k) \neq P(X_{j'i} = k)$ for some $j \neq j' \in \{1, \dots, m\}$ and $k \in \{1, \dots, r\}$

Test statistic: \[ X^2 = \sum^m_{j = 1}\sum^r_{k = 1}\frac{(x_{jk} - \hat\mu_{jk})^2}{\hat\mu_{jk}}, \qquad X^2 \sim \chi^2((m-1)(r-1)) \]

R function:

tab <- xtabs(~ education + induced, infert)
chisq.test(tab)

7.3.4 Chi-squared test of independence

Assumptions:

Independent pairs $(X_i, Y_i)$, $i = 1, \dots, n$, where $X_i \in \{1, \dots, m\}$ and $Y_i \in \{1, \dots, r\}$

Hypotheses:

H$_0$: $P(X_i = j, Y_i = k) = P(X_i = j)\cdot P(Y_i = k)$ for all $j$, $k$ and $i = 1, \dots, n$
H$_1$: $P(X_i = j, Y_i = k) \neq P(X_i = j)\cdot P(Y_i = k)$ for some $j$, $k$ and $i = 1, \dots, n$

Test statistic (cf. test of homogeneity): \[ X^2 = \sum^m_{j = 1}\sum^r_{k = 1}\frac{(x_{jk} - \hat\mu_{jk})^2}{\hat\mu_{jk}}, \qquad X^2 \sim \chi^2((m-1)(r-1)) \]

R function:

chisq.test(HairEyeColor[, , "Female"])

7.4 Metric response variables

7.4.1 One-sample $t$ test

Assumptions (Student’s $t$ test):

$X_1, \ldots, X_n$ i.i.d. according to $N(\mu, \sigma^2)$ with unknown $\sigma^2$

Assumptions (approximate $t$ test):

$X_1, \ldots, X_n$ i.i.d. according to any distribution, $n\geq 30$, with unknown $\sigma^2$

Hypotheses:

H$_0$: $\mu = \mu_0 ~~~$ H$_1$: $\mu \neq \mu_0$
H$_0$: $\mu \geq \mu_0 ~~~$ H$_1$: $\mu < \mu_0$
H$_0$: $\mu \leq \mu_0 ~~~$ H$_1$: $\mu > \mu_0$

Test statistic: \[ T = \frac{\bar{X} - \mu_0}{S} \, \sqrt{n}, \qquad T \sim t(n-1) \] (approx. distribution of $T$ in case of approximate $t$ test)

Rejection region:

$|T| > t_{1-\alpha/2} (n-1)$
$T < t_\alpha (n-1) = - t_{1-\alpha} (n-1)$
$T > t_{1-\alpha} (n-1)$

with the $p$-quantile $t_p (n-1)$ of the $t(n-1)$ distribution

R function:

ht <- t.test(sleep[sleep$group == 1, "extra"], mu = 0)

(Equivalent) Test decision:

Critical value

alpha <- 0.05
dgf <- ht$parameter
qt(p = 1 - alpha / 2, df = dgf)  # critical value

## [1] 2.262157

ht$statistic                     # observed test statistic

##       t 
## 1.32571

Confidence interval
The confidence level of the confidence interval is chosen via the conf.level argument of t.test()
```
ht$conf.int 
```
```
## [1] -0.5297804  2.0297804
## attr(,"conf.level")
## [1] 0.95
```
$p$-value
```
ht$p.value 
```
```
## [1] 0.2175978
```

7.4.2 Two-sample $t$ test (independent)

Assumptions:

Independent samples $X_1, \dots, X_n \sim N(\mu_x, \sigma^2_x)$ and $Y_1, \dots, Y_m \sim N(\mu_y, \sigma^2_y)$; $\sigma^2_x$, $\sigma^2_y$ unknown but equal

Hypotheses:

H$_0$: $\mu_x - \mu_y = \delta_0 ~~~$ H$_1$: $\mu_x - \mu_y \neq \delta_0$
H$_0$: $\mu_x - \mu_y \geq \delta_0 ~~~$ H$_1$: $\mu_x - \mu_y < \delta_0$
H$_0$: $\mu_x - \mu_y \leq \delta_0 ~~~$ H$_1$: $\mu_x - \mu_y > \delta_0$

Test statistic: \[ T = \frac{\bar x - \bar y - \delta_0}{\hat \sigma_{\bar x - \bar y}}, \qquad T \sim t(n+m-2) \]

R function:

t.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ],
       var.equal = TRUE)

7.4.3 Welch $t$ test (two independent samples)

Assumptions:

Independent samples $X_1, \dots, X_n \sim N(\mu_x, \sigma^2_x)$ and $Y_1, \dots, Y_n \sim N(\mu_y, \sigma^2_y)$; $\sigma^2_x$ and $\sigma^2_y$ unknown and not necessarily equal

Hypotheses:

H$_0$: $\mu_x - \mu_y = \delta_0 ~~~$ H$_1$: $\mu_x - \mu_y \neq \delta_0$
H$_0$: $\mu_x - \mu_y \geq \delta_0 ~~~$ H$_1$: $\mu_x - \mu_y < \delta_0$
H$_0$: $\mu_x - \mu_y \leq \delta_0 ~~~$ H$_1$: $\mu_x - \mu_y > \delta_0$

Test statistic: \[ T = \frac{\bar{X} - \bar{Y} - \delta_0} {\sqrt{\frac{S_x^2}{n} + \frac{S_y^2}{m}}}, \qquad T \sim t(k) \] \[ k = (S_x^2 / n + S_y^2 / m)^2 ~ / \left(\frac{1}{n-1}\, \left(\frac{S_x^2}{n}\right)^2 + \frac{1}{m-1}\, \left(\frac{S_y^2}{m}\right)^2\right) \]

R function:

t.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ],
       var.equal = FALSE)

7.4.4 Two-sample $t$ test (dependent)

Assumptions:

$X_1, \dots, X_n$ and $Y_1, \dots, Y_n$ with $D_1 = X_1 - Y_1, \dots, D_n = X_n - Y_n$ independent and identically distributed with $N(\mu_d, \sigma_d^2)$ and unknown $\sigma_d^2$

Hypotheses:

H$_0$: $\mu_d = \delta_0 ~~~$ H$_1$: $\mu_d \neq \delta_0$
H$_0$: $\mu_d \geq \delta_0 ~~~$ H$_1$: $\mu_d < \delta_0$
H$_0$: $\mu_d \leq \delta_0 ~~~$ H$_1$: $\mu_d > \delta_0$

Test statistic: \[ T = \frac{\bar d - \delta_0}{\hat\sigma_{\bar d}}, \qquad T \sim t(n - 1) \]

R function:

t.test(sleep[sleep$group == 1, "extra"],
       sleep[sleep$group == 2, "extra"], paired = TRUE)

7.4.5 Correlation test

Assumptions:

Independent jointly normally distributed variables $(X_i, Y_i)$, $i = 1, \dots, n$

Hypotheses:

H$_0$: $\rho_{XY} = 0 ~~~$ H$_1$: $\rho_{XY} \neq 0$
H$_0$: $\rho_{XY} \geq 0 ~~~$ H$_1$: $\rho_{XY} < 0$
H$_0$: $\rho_{XY} \leq 0 ~~~$ H$_1$: $\rho_{XY} > 0$

Test statistic: \[ T = \frac{r_{XY}}{\sqrt{1 - r^2_{XY}}}\sqrt{n - 2}, \qquad T \sim t(n - 2) \]

R function:

cor.test(~ speed + dist, cars)

7.4.6 Test of equal variances

Assumptions:

Independent samples $X_1, \dots, X_n \sim N(\mu_x, \sigma^2_x)$ and $Y_1, \dots, Y_m \sim N(\mu_y, \sigma^2_y)$, unknown $\sigma^2_x, \sigma^2_y$

Hypotheses:

H$_0$: $\sigma^2_x = \sigma^2_y ~~~$ H$_1$: $\sigma^2_x \neq \sigma^2_y$
H$_0$: $\sigma^2_x \geq \sigma^2_y ~~~$ H$_1$: $\sigma^2_x < \sigma^2_y$
H$_0$: $\sigma^2_x \leq \sigma^2_y ~~~$ H$_1$: $\sigma^2_x > \sigma^2_y$

Test statistic: \[ T = \frac{S^2_x/\sigma^2_x}{S^2_y/\sigma^2_y}, \qquad T \sim F(n - 1, m - 1) \] for $\sigma^2_x = \sigma^2_y$

R function:

var.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ])

7.5 Ordinal response variables

7.5.1 Wilcoxon rank-sum test (Mann-Whitney U test)

Assumptions:

$X_1,\dots, X_n$ and $Y_1,\dots, Y_m$ each independent and identically continuously distributed and $X_1,\dots, X_n, Y_1,\dots, Y_m$ independent

Hypotheses:

H$_0$: $\tilde\mu_x = \tilde\mu_y ~~~$ H$_1$: $\tilde\mu_x \neq \tilde\mu_y$
H$_0$: $\tilde\mu_x \geq \tilde\mu_y ~~~$ H$_1$: $\tilde\mu_x < \tilde\mu_y$
H$_0$: $\tilde\mu_x \leq \tilde\mu_y ~~~$ H$_1$: $\tilde\mu_x > \tilde\mu_y$

Test statistic: \[ U = \sum^n_{i = 1}{rk(X_i)} - \frac{n(n + 1)}{2} \] (for its distribution, see ?pwilcox)

R function:

wilcox.test(weight ~ group, PlantGrowth[PlantGrowth$group != "ctrl", ])

8 Linear models (1)

8.1 Overview

8.1.1 Linear models

(More information is in the ‘Statistical models in R’ chapter in the An Introduction to R manual.)

Linear models of the form \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon, \] where y is assumed to be independent and normally distributed as $N(\mu, \sigma^2)$, are fitted in R using lm():

  lm(y ~ x1 + x2 + ... + xk, data) # lm(formula, data)

ANOVA models, which constitute a subset of linear models, can be fit using either lm() or aov().

8.1.2 Coding of factors

Formally, linear models are concisely defined using vectors and matrices:

\[ \mathbf{y} = \mathbf{X} \, \mathbf{\beta} + \mathbf{e} \] Fully written out, it is: \[ \begin{pmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{pmatrix} = \begin{pmatrix} x_{10} & x_{11} & x_{12} & \ldots & x_{1p}\\ x_{20} & x_{21} & x_{22} & \ldots & x_{2p}\\ \vdots & \vdots & \vdots & & \vdots\\ x_{n0} & x_{n1} & x_{n2} & \ldots & x_{np} \end{pmatrix} \cdot \begin{pmatrix} \beta_0\\ \beta_1\\ \beta_2\\ \vdots\\ \beta_p \end{pmatrix} + \begin{pmatrix} e_1\\ e_2\\ \vdots\\ e_n \end{pmatrix} \] The model matrix $\mathbf{X}$ represents the observed values of the explanatory variables for each subject (row). The columns represent the observed variables. For categorical variables, i.e. factors, the creation of these variables is called the (contrast) coding of factors.

Coding of factors: dummy coding example Let’s look at a one-way ANOVA with three factor levels.

For three subjects, belonging to the groups 1, 2, and 3, respectively, the result is a set of equations like this: \[\begin{array}{ccccccc} y_{1} & = & \mu & +~~~\alpha_1 & & & +~~~e_1\\ & \vdots & \\ y_{i} & = & \mu & & +~~~\alpha_2 & & +~~~e_i\\ & \vdots & \\ y_{n} & = & \mu & & & +~~~\alpha_3 & +~~~e_n \end{array}\]

We define the first group as the reference group with mean $\mu$: \[ \alpha_1 := 0 \]

Writing out the model equations, we now obtain: \[\begin{array}{cccccc} y_{1} & = & x_{10} \, \mu & +~~~ x_{12} \, \alpha_2 & +~~~ x_{13} \, \alpha_3 & +~~~e_1\\ & \vdots & \\ y_{i} & = & x_{i0} \, \mu & +~~~ x_{i2} \, \alpha_2 & +~~~ x_{i3} \, \alpha_3 & +~~~e_i\\ & \vdots & \\ y_{n} & = & x_{n0} \, \mu & +~~~ x_{n2} \, \alpha_2 & +~~~ x_{n3} \, \alpha_3 & +~~~e_n \end{array}\]

All other groups ‘start’ from the baseline, defined by the reference group, so we set $x_{i0} = 1$ for all $i$.

Finally, we set up dummy variables which indicate, for each subject, whether they fall into one of the other groups: \[ x_{ij} = \begin{cases} 1 & \text{if $i$ is in group $j$}\\ 0 & \text{otherwise} \end{cases} \]

In standard terminology of linear models, we have \[\mu = \beta_0;\quad \alpha_2 = \beta_1;\quad \alpha_3 = \beta_2\] So, for our three subjects from three different groups, we get: \[\begin{array}{ccccccc} y_{1} & = & \beta_0 & +~~~ & & & +~~~e_1\\ & \vdots & \\ y_{i} & = & \beta_0 & & +~~~\beta_1 & & +~~~e_i\\ & \vdots & \\ y_{n} & = & \beta_0 & & & +~~~\beta_2 & +~~~e_n \end{array}\]

These $\beta$ coefficients are provided by lm() and aov(). $\beta_0$ is the expected value for the reference level, and $\beta_1, \beta_2$ are the respective differences in the other two levels, compared to the reference level.

Coding of factors in R

Factors with $k$ levels are represented by $k - 1$ indicator variables.

Dummy coding (default): Indicators only take the values $0$ and $1$. If all dummies are zero, the reference level is indicated. Dummy coded effects are interpreted with respect to the reference level. Dummy coding is set in R by:

  options(contrasts = c("contr.treatment", "contr.poly"))

Effect coding: Indicators take the values $-1$, $0$, and $1$. Effects are interpreted as a deviation from the mean and sum to zero. Effect coding is switched on in R by

  options(contrasts = c("contr.sum", "contr.poly"))

8.1.3 Formulae

Statistical models are represented by R formulae. R formulae usually correspond closely to the referring statistical models.

Formulae used in lm() and aov(), let A, B be fixed factors

R formula	model
`y ~ 1 + x`	$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$
`y ~ x`	(short form)
`y ~ 0 + x`	$y_i = \beta_1 x_i + \varepsilon_i$
`y ~ x + I(x^2)`	$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i$

`y ~ A + B`	$y_{ijk} = \mu + \alpha_i + \beta_j + \varepsilon_{ijk}$
`y ~ A*B`	$y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}$
`y ~ 1 + A + B + A:B`	(long form)

8.1.4 Extractor functions for lm objects

There are many extractor functions that work on lm objects.

coef()      # Extract the regression coefficients

summary()   # Print a comprehensive summary of the results of
            #   the regression analysis. This also returns a
            #   list from which further useful information
            #   can be extracted.

anova()     # Compare nested models and produce an analysis
            #   of variance table (for incremental F tests)

residuals() # Extract the residuals

plot()      # Produce four plots, showing residuals, fitted
            #   values and some diagnostics

fitted()   # Extract the fitted values

predict()  # Either predicted values for the observed data, 
           #   i.e. the fitted values
           #
           # Or a new data frame can be supplied having the
           #   same variables specified with the same labels
           #   as the original. The value is a vector or
           #   matrix of predicted values corresponding to the
           #   determining variable values in the data frame

vcov()     # Return the variance-covariance matrix of the
           #   main parameters of a fitted model object

model.matrix() # Return the design matrix

8.2 ANOVA

8.2.1 One-way ANOVA

A method for analyzing the effect of a discrete variable, factor $A$ with $m$ levels, on a continuous variable $Y$. For the factor levels $i = 1, \ldots, m$, we collect independent samples of size $n_i$.

The resulting sample is: \[ Y_{ij} ~~\text{with}~~ i = 1, \ldots, m,~~ j = 1, \ldots, n_i \]

Model with dummy coding

\[ Y_{ij} = \mu_i ~+~ \varepsilon_{ij} ~~\text{with}~~ i = 1, \ldots, m,~~ j = 1, \ldots, n_i ,~~\text{and}~~ \varepsilon_{ij} \sim N (0, \sigma^2) \]

Therefore, \[ E(Y_{ij}) = E(\mu_i + \varepsilon_{ij}) = \mu_i ~~\text{and}~~ Y_{ij} \sim N (\mu_i, \sigma^2) \]

So we assume stochastic independence and equal normal distributions for all $\varepsilon_{ij}$ and for all $Y_{ij}$

Hypotheses: \[ H_0\colon~ \mu_1 = \mu_2 = \ldots = \mu_m \] \[ H_1\colon~ \mu_i \neq \mu_{i'} ~~\text{for at least one pair}~~ i, i' \in \{1, \ldots, m\} \]

Model with effect coding \[ Y_{ij} = \mu + \alpha_i + \varepsilon_{ij} ~~\text{mit}~~ i = 1, \ldots, m,~~ j = 1, \ldots, n_i \] und $\varepsilon_{ij} \sim N (0, \sigma^2)$

Hypotheses: \[ H_0\colon~ \alpha_1 = \alpha_2 = \ldots = \alpha_m = 0 \] \[ H_1\colon~ \alpha_i \neq 0 ~~\text{for at least one}~~ i \in \{1, \ldots, m\} \]

$\mu$ is the grand mean. We obtain it using $N = n_1 + \ldots + n_m$: \[ \mu = \frac{1}{N} \, \sum_{i=1}^m n_i \mu_i \] $\alpha_i$ is the effect of factor level $i$: $~~\alpha_i = \mu_i - \mu$

Test statistic \[ F = \frac{MS_B}{MS_W} = \frac{~~~\frac{1}{m-1} \, \sum_{i=1}^m n_i \, (\bar{Y}_i - \bar{Y})^2~~~} {\frac{1}{N-m} \, \sum_{i=1}^m \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y}_i)^2} \] Distribution of test statistic, given $H_0$: \[ F \sim F (m - 1, N - m) \] Rejection region \[ F > F_{1-\alpha} (m - 1, N - m) \]

8.2.2 Two-way ANOVA

Data scheme:

Factor A: a$_1$, $\ldots$, a$_i$, $\ldots$, a$_I$
Factor B: b$_1$, $\ldots$, b$_j$, $\ldots$, b$_J$

	b$_1$	$\ldots$	b$_j$	$\ldots$	b$_J$
a$_1$	$y_{111},\; \ldots,\;y_{11k},\; \ldots,\; y_{11K}$	$\ldots$	$y_{1j1},\; \ldots, \;y_{1jk}, \; \ldots,\; y_{1jK}$	$\ldots$	$y_{1J1},\;\ldots, \;y_{1Jk}, \; \ldots,\; y_{1JK}$
$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\ddots$	$\vdots$
a$_i$	$y_{i11},\; \ldots, \;y_{i1k}, \; \ldots,\; y_{i1K}$	$\ldots$	$y_{ij1},\; \ldots, \;y_{ijk}, \; \ldots,\; y_{ijK}$	$\ldots$	$y_{iJ1},\; \ldots, \;y_{iJk}, \; \ldots,\; y_{iJK}$
$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\ddots$	$\vdots$
a$_I$	$y_{I11},\; \ldots, \;y_{I1k}, \; \ldots,\; y_{I1K}$	$\ldots$	$y_{Ij1},\; \ldots, \;y_{Ijk}, \; \ldots,\; y_{IjK}$	$\ldots$	$y_{IJ1},\; \ldots, \;y_{IJk}, \; \ldots,\; y_{IJK}$

Note that the number of observations, $K$, is assumed to be equal for all cells. Perfect variance decomposition between both main effects and their interaction is only possible in this specific case.

Model: Dummy coding \[ Y_{ijk} = \mu_{ij} + \varepsilon_{ijk} \]

Model: Effect coding

\[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk} \]

For both coding types: \[ i = 1, \ldots, I, j = 1, \ldots, J, k = 1, \ldots, K \] \[ \varepsilon_{ijk} \sim N (0, \sigma^2) \] \[ Y_{ijk} \sim N (\mu_{ij}, \sigma^2) \]

Hypotheses concerning interaction: \[ H_0^{A \times B}\colon~ (\alpha\beta)_{ij} = 0 ~\text{for all}~ i = 1, \ldots, I,~ j = 1, \ldots, J \] \[ H_1^{A \times B}\colon~ (\alpha\beta)_{ij} \neq 0 ~\text{for at least one pair}~ (i, j) \]

Hypotheses concerning main effect of factor $A$ \[ H_0^{A}\colon~ \alpha_i = 0 ~\text{for all}~ i = 1, \ldots, I \] \[ H_1^{A}\colon~ \alpha_i \neq 0 ~\text{for at least one}~ i \]

Hypotheses concerning main effect of factor $B$ \[ H_0^{B}\colon~ \beta_j = 0 ~\text{for all}~ j = 1, \ldots, J \] \[ H_1^{B}\colon~ \beta_j \neq 0 ~\text{for at least one}~ j \]

See Fahrmeir et al. (2013) for calculation of the mean squares $MS_A, MS_B, MS_{A\times B}, MS_W$

Test statistic for $H_0^{A \times B}$ vs. $H_1^{A \times B}$ \[ F_{A \times B} = \frac{MS_{A \times B}}{MS_W}, \qquad F_{A \times B} \sim F((I - 1)(J - 1),~ IJ(K - 1)) \]

Test statistic for $H_0^A$ vs. $H_1^A$ \[ F_A = \frac{MS_A}{MS_W}, \qquad F_A \sim F(I - 1,~ IJ(K - 1)) \]

Test statistic for $H_0^B$ vs. $H_1^B$ \[ F_B = \frac{MS_B}{MS_W}, \qquad F_B \sim F(J - 1,~ IJ(K - 1)) \]

Example

Two influencing factors on satisfaction with the study programme (Fahrmeir et al., 2013)

Motivation (Factor $A$) with the levels ‘motivated’ ($A_1$) and ‘unmotivated’ ($A_2$)
Personal status (Factor $B$) with the levels ‘living with a partner’ ($B_1$) and ‘living alone’ ($B_2$)

Satisfaction with the study programme was measured with a score ranging from 0 to 100, which was constructed so that it is approximately normally distributed.

For each factor level combination, five students were randomly selected and their satisfaction score was measured.

Data entry

dat <- read.table(header = TRUE, stringsAsFactors = TRUE,
                  text = "
  motiv     fam score
   high partner    85
   high partner    89
   high partner    91
   high partner    95
   high partner    80
   high  single    50
   high  single    52
   high  single    65
   high  single    71
   high  single    72
    low partner    34
    low partner    30
    low partner    28
    low partner    23
    low partner    40
    low  single    30
    low  single    28
    low  single    33
    low  single    16
    low  single    23
")

Fitting the model with either lm() or aov()

  aov.out <- aov(score ~ motiv * fam, dat)
  lm.out  <-  lm(score ~ motiv * fam, dat)

Result

  summary(aov.out)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## motiv        1  10811   10811 190.507 2.64e-10 ***
## fam          1   1201    1201  21.167 0.000295 ***
## motiv:fam    1    551     551   9.714 0.006644 ** 
## Residuals   16    908      57                     
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  anova(lm.out)

## Analysis of Variance Table
## 
## Response: score
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## motiv      1 10811.2 10811.2 190.5066 2.643e-10 ***
## fam        1  1201.2  1201.2  21.1674 0.0002953 ***
## motiv:fam  1   551.3   551.3   9.7137 0.0066437 ** 
## Residuals 16   908.0    56.8                       
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Simple interaction plot with means using interaction.plot()

with(dat, interaction.plot(motiv, fam, score, type = "b",
            pch = 16, xlab = "Motivation", ylab = "Score"))

Stepwise plot with data points, means, and standard errors

## plot raw data points
# fam == single
plot(score ~ as.numeric(motiv), dat[dat$fam == "single", ],
     xlab = "Motivation", ylab = "Score", col = "grey55",
     axes = FALSE, ylim = c(20, 100), xlim = c(.8, 2.2),
     pch = 16)

# fam == partner
points(score ~ as.numeric(motiv), dat[dat$fam == "partner", ],
       col = "grey55") 

axis(1, at = 1:2, labels = levels(dat$motiv)) # draw x-axis
axis(2)                                       # draw y-axis

ag <- aggregate(score ~ motiv + fam, dat, mean)    # aggregate mean
ag$se <- aggregate(score ~ motiv + fam, dat, \(x) sd(x)/sqrt(length(x)))[, 3] # SEM

## add mean
# fam == single
points(score ~ as.numeric(motiv), ag[ag$fam == "single", ],
       type = "b", pch = 16)
# fam == partner
points(score ~ as.numeric(motiv), ag[ag$fam == "partner", ],
       type = "b", lty = 2)

## add SEM arrows
with(ag, arrows(as.numeric(motiv), score - se, 
                as.numeric(motiv), score + se, code = 3, angle = 90, 
                length = .05))

## add legend and box
legend("topright", legend = c("Partner", "Single"), lty = c(2, 1), pch = c(1, 16))
box()

Estimated parameters

coef(aov.out) # or: summary(lm.out)

##        (Intercept)           motivlow 
##                 88                -57 
##          famsingle motivlow:famsingle 
##                -26                 21

E.g., the predicted value for a subject with predictors ‘low’ and ‘single’ is $88 - 57 - 26 + 21 = 26$, as was the observed mean for these characteristics:

aggregate(score ~ motiv + fam, dat, mean)

##   motiv     fam score
## 1  high partner    88
## 2   low partner    31
## 3  high  single    62
## 4   low  single    26

8.3 Simple linear regression

Stochastic model: \[ y_i = \alpha + \beta \cdot x_i + \varepsilon_i ~\text{with}~ E(\varepsilon_i) = 0, ~\text{for all}~ i = 1, \ldots, n \]

The random variables $\varepsilon_i$, $i = 1, \ldots, n$, are assumed to be i.i.d.
In particular, all $\varepsilon_i$, $i = 1, \ldots, n$, should have the same variance $Var(\varepsilon_i) = \sigma^2$ (Homoscedasticity).
- Violations of homoscedasticity occur, e.g., when the variance of the residuals increases with increasing values of $x$. Often, this can already be seen in a scatter plot of the values $(x_i, y_i)$. Plotting all residuals, ordered with respect to $x$ or $y$, usually is even more informative.
We further assume normal distributions for the error term and $Y_i$. That is, we assume for all $i = 1, \ldots, n$:

\[ \varepsilon_i \sim N(0, \sigma^2) \] \[ Y_i \sim N(\alpha + \beta \cdot x_i, \sigma^2) \]

There is a linear relationship between x and y (Linearity)

8.3.1 Hypothesis tests

Wald test for $\alpha$ or $\beta$

Hypotheses

H$_0$: $\alpha = \alpha_0 ~~~$ H$_1$: $\alpha \neq \alpha_0$
H$_0$: $\alpha \geq \alpha_0 ~~~$ H$_1$: $\alpha < \alpha_0$
H$_0$: $\alpha \leq \alpha_0 ~~~$ H$_1$: $\alpha > \alpha_0$

Test statistic \[ T_{\alpha_0} = \frac{\hat{\alpha} - \alpha_0}{\hat{\sigma}_{\hat{\alpha}}} \qquad T_{\alpha_0} \sim t (n - 2) \]

Rejection region

$|T_{\alpha_0}| > t_{1-\alpha/2} (n-2)$
$T_{\alpha_0} < -t_{1-\alpha} (n-2)$
$T_{\alpha_0} > t_{1-\alpha} (n-2)$

with the $p$-quantile $t_p (n-2)$ of the $t$ distribution with $df = n - 2$ degrees of freedom

The test of the $\beta$ coefficient (and other coefficients in multiple linear regression) follows the same scheme.

Example

Math ability (amount of math problems solved in 2 minutes) as a linear function of working memory capacity (forwards letter span).

dat <- read.table("dataInvers.txt", header = TRUE)
dat <- na.omit(dat)
plot(jitter(Rechnen, 1) ~ jitter(LetSp_v, 1), dat)

Fit the regression model

lm1 <- lm(Rechnen ~ LetSp_v, dat)
summary(lm1)

## 
## Call:
## lm(formula = Rechnen ~ LetSp_v, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2196  -5.8799  -0.8799   5.3402  24.5603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.7602     1.4474   3.289  0.00104 ** 
## LetSp_v       1.7799     0.2474   7.195 1.22e-12 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 996 degrees of freedom
## Multiple R-squared:  0.04941,    Adjusted R-squared:  0.04846 
## F-statistic: 51.77 on 1 and 996 DF,  p-value: 1.223e-12

Parameter interpretation and hypothesis tests

If a person has a letter span of 0, we predict them to solve 4.76 math problems; the $H_0$ that this intercept parameter $\alpha$ is equal to zero (i.e., zero math problems are solved at a letter span of 0) is rejected, $t(996) = 3.289, p = .001$.
For each 1 unit increase in letter span, we predict them to solve an additional 1.78 math problems; the $H_0$ that this parameter $\beta$ is equal to zero is rejected, $t(996) = 7.195, p < .001$.
We estimate the standard deviation of our prediction error to be 7.9 math problems

Useful functions for model objects:

coef(lm1)    # Coefficients

fitted(lm1)  # Predicted values

resid(lm1)   # Residuals

confint(lm1) # Confidence intervals for parameters

abline(lm1)  # add regression line to scatter plot

8.3.2 Coefficient of determination

With only one prediction variable, the proportion of variance (of the dependent variable) explained by the regression is equal to the squared correlation coefficient between the two variables: \[R^2 = \frac{\sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2}{\sum_{i=1}^n(Y_i - \bar{Y})^2} = r_{XY}^2\]

# Variance explained
# See ?summary.lm and str(summary(lm1))
summary(lm1)$r.squared

# Squared correlation coefficient
# See ?cor.test
cor.t <- cor.test(~ Rechnen + LetSp_v, dat)
cor.t$estimate^2

8.3.3 Model predictions

Using predict(), the predicted values for the dependent variable can be calculated based on

the observed values of the predictor(s), obtaining the fitted values for the sample
```
predict(lm1) # or fitted(lm1)
```

any supplied values for the predictor(s)

pred <- predict(lm1, newdata = data.frame(LetSp_v = c(3, 10)))

In the latter case, a new data frame with predictor variable(s) needs to be supplied. The general syntax is:

predict(mod, newdata = data.frame(x1 = new_xvals1, x2 = new_xvals2))
# See ?predict.lm

Predicted values fall on the fitted regression line.

8.3.4 Model diagnostics

When using linear models, we should check whether the model assumptions (approximately) hold.

This is often done graphically, with graphics such as:

Scatter plot
Plot of residuals as a function of the predictors or the predicted values
Normal-quantile plots of the (standardized) residuals

Most of this is conveniently performed by the plot method of lm objects:

par(mfrow = c(2, 2),      # plot layout: 2 rows, 2 columns -> 4 plots in one window
    oma   = c(0,0,0,0),   # outer margin
    mar   = c(3,3,4,1),   # margin
    mgp   = c(2, .7, 0))  # position axis title/label/line 
plot(lm1)

Illustriation of a ‘Residuals vs Fitted’ plot for different (non-)violations of the model assumptions (left-side: scatterplot with regression line, right-side: ‘Residuals vs Fitted’ plot)

9 Linear models (2)

9.1 Multiple linear regression

Generalizing the simple linear regression, we get:

\[ Y_i = \beta_0 + \beta_1 \cdot X_{i1} + \ldots + \beta_p \cdot X_{ip} + \varepsilon_i \] with $E(\varepsilon_i) = 0$, for all $i = 1, \ldots, n$

\[Y_i \sim N (\mu_i, \sigma^2)\] with $\mu_i = \beta_0 + \beta_1 \cdot x_{i1} + \ldots + \beta_p \cdot x_{ip}$

More predictors is not always better!

The main issue is multicollinearity, which arises due to correlated predictors. Multicollinearity causes instability of the regression equation, may bias the estimated test statistics, and impedes parameter interpretation (when the size or even sign of a parameter estimate depends on the inclusion of other predictors).

9.1.1 Hypothesis tests

A Wald test for a single parameter $\beta_j \quad(j = 1, \dots, p)$ can be performed as in simple linear regression, with $T_{\beta_j} \sim t (n - p)$
Overall $F$ test \[ \begin{align} H_0&\colon~ \beta_1 = \ldots = \beta_p = 0\\ H_1&\colon~ \beta_j \neq 0 ~\text{ for at least one}~ j \in \{1, \ldots, p\} \end{align} \] Test statistic \[ F = \frac{R^2}{1 - R^2} \cdot \frac{n - p - 1}{p}, \qquad F \sim F(p, n - p - 1) \] Rejection region \[ F > F_{1-\alpha} (p, n - p - 1) \]

Hypothesis tests involve model comparison.

Wald- and F-tests are appropriate for nested (‘hierarchical’) models, which are created by parameter restriction. Some restriction types are:

Parameter elimination (parameter is set to $0$); simply a Wald test \[ \begin{align} Y_i &= \beta_0 + \beta_1 \cdot X_{i1} + \varepsilon_i \\ Y_i &= \beta_0 + \beta_1 \cdot X_{i1} + \beta_2 \cdot X_{i2} + \varepsilon_i \end{align} \]
Parameter fixation (parameter is set to $k$) \[ \begin{align} Y_i &= \beta_0 + k \cdot X_{i1} + \varepsilon_i \\ Y_i &= \beta_0 + \beta_1 \cdot X_{i1} + \varepsilon_i \end{align} \]
Parameter equation (parameters are set to the same value) \[ \begin{align} Y_i &= \beta_0 + \beta_3 \cdot X_{i1} + \beta_3 \cdot X_{i2} + \varepsilon_i \\ Y_i &= \beta_0 + \beta_1 \cdot X_{i1} + \beta_2 \cdot X_{i2} + \varepsilon_i \end{align} \] In each restriction type, the models are hierarchical.

Implementation of hierarchical models by parameter restriction in R.

Parameter elimination (parameter is set to $0$)

lm(y ~ x1)              # (restricted model)
lm(y ~ x1 + x2)         # (unrestricted model)

Parameter fixation (parameter is set to $k$)

lm(y ~ offset(k * x1))  # (restricted model)
lm(y ~ x1)              # (unrestricted model)

Parameter equation (parameters are set to the same value) (for two continuous predictors $x_1$ and $x_2$)

lm(y ~ I(x1 + x2))      # (restricted model)
lm(y ~ x1 + x2)         # (unrestricted model)

Incremental F test

With this test, we can compare two hierarchical (a.k.a. nested) models:

$M_0$ (with $q_0$ model parameters)
$M_1$ (with $q_1$ model parameters)

The restricted model $M_0$ may have parameter restrictions on more than one parameter.

Test statistic and its distribution under the null hypothesis: \[ \begin{align} F &= \frac{(SS_{\hat{Y}1} - SS_{\hat{Y}0})/ (q_1 - q_0)}{SS_{\hat{\varepsilon}1}/(N - q_1)} = \frac{(SS_{\hat{\varepsilon}0} - SS_{\hat{\varepsilon}1})/ (q_1 - q_0)}{SS_{\hat{\varepsilon}1}/(N - q_1)}\\ &= \frac{N - q_1}{q_1 - q_0} \cdot \frac{R^2_1 - R^2_0}{1 - R^2_1} \end{align} \] \[F \sim F(q_1 - q_0, N - q_1)\] The null hypothesis $H_0$ is stated by the parameter restriction. For the examples further above, they are:

Parameter elimination; $H_0: \beta_2 = 0$
Parameter fixation; $H_0: \beta_1 = k$
Parameter equation; $H_0: \beta_1 = \beta_2$

By testing the $H_0$, we hence decide whether the restricted model has to be rejected in favor of the more general model. The incremental F test for linear models is a special case of the more general Likelihood-ratio test, which can be applied to (almost) any hierachical models, not only linear models. In R we can use the function anova() to compare nested models.

9.1.2 Example 1: Multiple linear regression & incremental F-test

Math ability and working memory data

dat <- na.omit(read.table("dataInvers.txt", header = TRUE))

lm1 <- lm(Rechnen ~ LetSp_v, dat)             # restricted model
lm2 <- lm(Rechnen ~ LetSp_v + LetSp_r, dat)

Model lm2 adds another working memory predictor
Note that lm2 has exactly one more parameter than lm1, i.e. $q_1 - q_0 = 1$

Overall $F$-test and Wald tests

summary(lm2)

## 
## Call:
## lm(formula = Rechnen ~ LetSp_v + LetSp_r, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0125  -5.8373  -0.9911   5.1627  25.1627 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.5207     1.4809   2.378 0.017619 *  
## LetSp_v       1.3077     0.2796   4.676 3.33e-06 ***
## LetSp_r       0.8676     0.2445   3.548 0.000406 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.888 on 995 degrees of freedom
## Multiple R-squared:  0.06129,    Adjusted R-squared:  0.0594 
## F-statistic: 32.48 on 2 and 995 DF,  p-value: 2.16e-14

Interpretation: Overall $F$-test and Wald tests (for $\alpha = .05$)

The $H_0$ that the parameters for LetSp_v and LetSp_r are both equal to zero is rejected, $F(2, 995) = 32.48,\; p < .001$
The $H_0$ that the parameter for LetSp_v is equal to zero is rejected, $t(995) = 4.68,\; p < .001$.
The $H_0$ that the parameter for LetSp_r is equal to zero is rejected, $t(995) = 3.55,\; p < .001$.
The $H_0$ that the intercept is equal to zero is rejected, $t(995)= 2.38,\; p = .018$.

The model lm2 is: \[ \begin{align} Rechnen_i = \beta_0 & + \beta_1 \cdot X_{i,LetSp_v} + \beta_2 \cdot X_{i,LetSp_r} + \varepsilon_i \end{align} \]

What does a one unit increase in $X_{i,LetSp_v}$ mean?

Let’s look at the difference in prediction for persons $i$ and $j$ with:

$X_{i,LetSp_r} = X_{j,LetSp_r} = c$ (same backward memory span)
an arbitrary value $X_{i,LetSp_v}$ for person $i$
a one unit higher value for person $j$: $X_{j,LetSp_v} = X_{i,LetSp_v} + 1$.

\[ \begin{align} \widehat{Rechnen_i} &= \beta_0 + \beta_1 \cdot X_{i,LetSp_v} &+ \beta_2 \cdot c\\ \widehat{Rechnen_j} &= \beta_0 + \beta_1 \cdot (X_{i,LetSp_v} + 1) &+ \beta_2 \cdot c\\ \widehat{Rechnen_j} - \widehat{Rechnen_i} &= \beta_1 \end{align} \] So $\beta_1$ is the difference in math ability for two persons with equal backward letter span but a one unit difference in forward letter span. It is the effect of forward letter span, all else being equal (ceteris paribus).

Incremental $F$-test (model comparison)

anova(lm1, lm2)

## Analysis of Variance Table
## 
## Model 1: Rechnen ~ LetSp_v
## Model 2: Rechnen ~ LetSp_v + LetSp_r
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    996 62696                                  
## 2    995 61912  1    783.25 12.588 0.0004064 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case ($q_1 - q_0 = 1$), the resulting $p$ value is identical to the corresponding Wald test above
In this case, the Wald test is formally equal to a model comparison between these two nested models.

9.1.3 Example 2: multiple parameter restrictions

Let’s look at a situation where $q_1 - q_0 > 1$.

First, create a categorical variable for age with four categories

quantile(dat$Alter)

##   0%  25%  50%  75% 100% 
##   18   21   24   45   87

dat$AlterKat <- ifelse(dat$Alter < 21,
    "18to20", ifelse(dat$Alter < 24,
         "21to23", ifelse(dat$Alter < 45,
             "24to44", "45to87")))

dat$AlterKat <- factor(dat$AlterKat)


summary(dat$AlterKat)

## 18to20 21to23 24to44 45to87 
##    247    240    250    261

table(dat$AlterKat)

## 
## 18to20 21to23 24to44 45to87 
##    247    240    250    261

Add effect of age category to effect of forward letter span

lm3 <- lm(Rechnen ~ LetSp_v + AlterKat, dat)

In dummy coding, factors with $k$ levels are represented by $k - 1$ dummy variables:

AlterKat	AlterKat21to23	AlterKat24to44	AlterKat45to87
18to20	0	0	0
21to23	1	0	0
24to44	0	1	0
45to87	0	0	1

So the model lm3 is: \[ \begin{align} Rechnen_i = \beta_0 & + \beta_1 \cdot X_{i,LetSp_v} + \beta_2 \cdot X_{i,AlterKat21to23} \\ & + \beta_3 \cdot X_{i,AlterKat24to44} + \beta_4 \cdot X_{i,AlterKat45to87} + \varepsilon_i \end{align} \]

Wald tests for lm3

summary(lm3)

## 
## Call:
## lm(formula = Rechnen ~ LetSp_v + AlterKat, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.426  -5.662  -1.189   5.338  25.338 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)      3.2741     1.5219   2.151 0.031690
## LetSp_v          1.8979     0.2470   7.683 3.73e-14
## AlterKat21to23   0.6298     0.7132   0.883 0.377360
## AlterKat24to44  -0.1332     0.7058  -0.189 0.850381
## AlterKat45to87   2.6310     0.7012   3.752 0.000185
##                   
## (Intercept)    *  
## LetSp_v        ***
## AlterKat21to23    
## AlterKat24to44    
## AlterKat45to87 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.867 on 993 degrees of freedom
## Multiple R-squared:  0.06827,    Adjusted R-squared:  0.06451 
## F-statistic: 18.19 on 4 and 993 DF,  p-value: 1.977e-14

The effect of forward letter span is, for constant age, 1.90. That is, if the letter span increases by 1, then (for constant age) 1.90 more math problems are solved on average.
The effect of being in the oldest age category is, for constant forward letter span, 2.63. That is, persons in the oldest age category solve, given the same letter span, on average 2.63 more math problems than persons in the youngest age category.

Model selection

anova(lm1, lm3) # Model lm3 is chosen

## Analysis of Variance Table
## 
## Model 1: Rechnen ~ LetSp_v
## Model 2: Rechnen ~ LetSp_v + AlterKat
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    996 62696                                  
## 2    993 61452  3    1243.4 6.6973 0.0001779 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the $H_0$ that $\beta_2 = \beta_3 = \beta_4 = 0$
This incremental F test is not equivalent to any of the Wald tests shown in summary(lm3), because the Wald tests are each concerned with a single $\beta$ parameter.

10 Simulation based procedures

10.1 Random number generation

For drawing random numbers from a statistical distribution, the distribution name is prefixed by r.

Examples (see ?Distributions for a list of distributions):

rnorm(10)  # draw from standard normal distribution
runif(10)  # draw from uniform distribution

Sampling with or without replacement from a vector is performed by the sample() function.

sample(1:5, size = 10, replace = TRUE)

The random number generator in R is seeded: In each run, new random numbers are generated. To obtain the same results as in a previous simulation, the seed (‘starting value’) can be set explicitly:

set.seed(1223)  # set seed, so on each run
runif(3)        # random numbers will be identical

## [1] 0.6289619 0.1267469 0.3285822

set.seed(1223)  # set seed, so on each run
runif(3)        # random numbers will be identical

## [1] 0.6289619 0.1267469 0.3285822

10.2 Repeated execution `replicate()`

replicate() is a wrapper function for sapply() designed for repeated execution of the same expression. It makes tasks like random number generation very simple and avoids writing loops:

# Two alternative calls using sapply()
sapply(1:3, FUN = rnorm, n = 5)

##            [,1]      [,2]     [,3]
## [1,]  0.4336592 1.8009529 2.994275
## [2,]  1.9911233 0.6681418 3.093450
## [3,]  1.6196675 2.2842026 2.342798
## [4,] -0.2708700 2.2736581 3.587997
## [5,] -0.2273296 2.0647675 1.838956

sapply(1:3, function(x) rnorm(5))

##             [,1]       [,2]       [,3]
## [1,]  0.04962477 -0.4377098 -0.5655823
## [2,]  0.30150230 -0.2300038 -0.3912522
## [3,] -0.98600835  0.8156111 -1.3839161
## [4,]  0.56849428 -0.6653574  1.8774840
## [5,] -0.10353400  0.4355629  0.1575512

# Same number generation, but simpler call using replicate()
replicate(3, rnorm(5))

##            [,1]        [,2]      [,3]
## [1,] -1.7661080  0.73675570 1.0209950
## [2,] -2.2177442  0.23925922 1.1338884
## [3,] -0.6701462  0.01314772 2.4327815
## [4,] -0.1332934 -0.23338271 0.1711988
## [5,] -1.0773198  0.95471266 1.4126050

A set of multiple expressions may be repeated.

Example: Unbiased estimation of the population variance $\sigma^2$

# Using replicate()
v <- replicate(100000, {
  x <- rnorm(n = 10, mean = 0, sd = 5)
  sum((x - mean(x))^2) / (10 - 1)
})
mean(v)

## [1] 25.0417

10.3 Simulated power

The power is the probability of a test to become significant given H$_0$ is false, i.e., to (correctly) detect an effect.

Power can be estimated by simulation: It is the proportion of significant tests for a specified effect size, variability, sample size, and $\alpha$.

Example: Power of one-sample $t$-test to detect an effect of 3 Hz at $\sigma = 8$ Hz, $n = 30$, and $\alpha = 5$%

pval <- replicate(1000, {
  x <- rnorm(30, mean = 443, sd = 8) # draw from effect distrib. 443 - 440 = 3 Hz
  t.test(x, mu = 440)$p.value        # test against H0
})
mean(pval < 0.05)                    # simulated power (proportion of significant tests)

## [1] 0.504

10.4 Bootstrap methods

Bootstrap: simulation methods for frequentist inference

The name is based on the famous story about the Baron Munchhausen, who was stuck with his horse in a swamp and got out by pulling on his own bootstraps (in German instead: am eigenen Schopf). We’ll see that something similar to this can actually be done in statistics …

Useful when (Davison, 2006):

standard assumptions invalid ($n$ small, data not normal, …)
complex problem has no (reliable) theory
or (almost) anywhere else

10.4.1 Bootstrap resampling (nonparametric bootstrap)

Algorithm (Efron & Tibshirani, 1993; Hesterberg et al., 2005, fig. 18.5):

Select $B$ bootstrap samples $\mathbf{x}^{*1}, \mathbf{x}^{*2}, \ldots, \mathbf{x}^{*B}$, each consisting of $n$ data values drawn with replacement from the original data $\mathbf{x}$.
Compute the bootstrap replicate $\mathbf{\hat\vartheta}^* = s(\mathbf{x}^*)$ of the statistic of interest $\mathbf{\hat\vartheta} = s(\mathbf{x})$ for each sample.
Assess the variability of the statistic via the distribution of bootstrap replicates.

Bootstrap algorithm

Bootstrap confidence intervals

For a given significance level $\alpha$, the corresponding limits of the percentile interval of the bootstrap distribution are equal to its $\alpha/2$ and $1-\alpha/2$ quantiles.

This percentile interval is a $(1 - \alpha)$ bootstrap confidence interval for $\vartheta$.

10.4.1.1 Example: survival time of mice

## Mouse data (Efron & Tibshirani, 1993, p. 11)
mouse <- data.frame(
   grp = rep(c("trt", "ctl"), c(7, 9)),
  surv = c(94, 197, 16, 38, 99, 141, 23,            # trt
           52, 104, 146, 10, 50, 31, 40, 27, 46))   # ctl

## Observed difference of means
t.test(surv ~ grp, mouse, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  surv by grp
## t = -1.1214, df = 14, p-value = 0.281
## alternative hypothesis: true difference in means between group ctl and group trt is not equal to 0
## 95 percent confidence interval:
##  -89.22770  27.95786
## sample estimates:
## mean in group ctl mean in group trt 
##          56.22222          86.85714

meandiff <- with(mouse, diff(tapply(surv, grp, mean)))

## Resampling the observations and calculating the difference of means
sam1 <- numeric(1000)
for(i in seq_along(sam1)) {
 trt <- sample(mouse[mouse$grp == "trt", "surv"], 7, replace = T)
 ctl <- sample(mouse[mouse$grp == "ctl", "surv"], 9, replace = T)
 sam1[i] <- mean(trt) - mean(ctl)
}

## 95% bootstrap confidence interval for difference of means
quantile(sam1, c(.025, .975))

##      2.5%     97.5% 
## -19.30992  79.88492

hist(sam1, breaks=20, freq = F)
abline(v = meandiff, lwd = 2, col = "darkblue")    # add observed diff.
abline(v = quantile(sam1, c(.025, .975)), lty = 2) # add bootstrap CI

11 References

Becker, R. A., & Chambers, J. M. (1984). S. An interactive environment for data analysis and graphics. Wadsworth; Brooks/Cole.

Davison, A. (2006). Bootstrap methods and their application. Cambridge University; University Lecture. https://www.researchgate.net/profile/Debashis_Kushary/publication/261638887_Bootstrap_Methods_and_Their_Application/links/54f5ccfe0cf27d8ed71cc1ae.pdf

Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Chapman & Hall/CRC.

Hesterberg, T., Moore, D. S., Monaghan, S., Clipson, A., & Epstein, R. (2005). Bootstrap methods and permutation tests. Introduction to the Practice of Statistics, 5, 1–70. http://statweb.stanford.edu/\textasciitilde tibs/stat315a/Supplements/bootstrap.pdf

Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5, 299–314.

	b\(_1\)	\(\ldots\)	b\(_j\)	\(\ldots\)	b\(_J\)
a\(_1\)	\(y_{111},\; \ldots,\;y_{11k},\; \ldots,\; y_{11K}\)	\(\ldots\)	\(y_{1j1},\; \ldots, \;y_{1jk}, \; \ldots,\; y_{1jK}\)	\(\ldots\)	\(y_{1J1},\;\ldots, \;y_{1Jk}, \; \ldots,\; y_{1JK}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
a\(_i\)	\(y_{i11},\; \ldots, \;y_{i1k}, \; \ldots,\; y_{i1K}\)	\(\ldots\)	\(y_{ij1},\; \ldots, \;y_{ijk}, \; \ldots,\; y_{ijK}\)	\(\ldots\)	\(y_{iJ1},\; \ldots, \;y_{iJk}, \; \ldots,\; y_{iJK}\)
\(\vdots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)	\(\ddots\)	\(\vdots\)
a\(_I\)	\(y_{I11},\; \ldots, \;y_{I1k}, \; \ldots,\; y_{I1K}\)	\(\ldots\)	\(y_{Ij1},\; \ldots, \;y_{Ijk}, \; \ldots,\; y_{IjK}\)	\(\ldots\)	\(y_{IJ1},\; \ldots, \;y_{IJk}, \; \ldots,\; y_{IJK}\)

R formula	model
`y ~ 1 + x`	\(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
`y ~ x`	(short form)
`y ~ 0 + x`	\(y_i = \beta_1 x_i + \varepsilon_i\)
`y ~ x + I(x^2)`	\(y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i\)

`y ~ A + B`	\(y_{ijk} = \mu + \alpha_i + \beta_j + \varepsilon_{ijk}\)
`y ~ A*B`	\(y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk}\)
`y ~ 1 + A + B + A:B`	(long form)

Introduction to Base R - Refresher for QHELP

Julian Mollenhauer

1 Intro / Writing R Scripts

1.1 What is R?

1.2 R as a calculator

1.3 Assigning values

1.4 Vectorized arithmetic

1.5 The offline help system in R

1.6 Documentation

1.7 Resources on the Internet

1.8 Working memory

1.9 Working directory

1.10 Packages

1.11 Using the R console and your editor

1.12 Example workflow

1.13 R scripts: a note on style

1.14 RStudio

1.15 RStudio Tips & Tricks

1.16 Exercises

A1: Basic handling of vectors

2 Data structures in R (1)

2.1 Functions and arguments

2.2 Functions and arguments

2.3 Vectors

2.4 Creating vectors

2.5 Matrices and arrays

2.6 Matrices and arrays

2.7 Factors

2.8 Exercises

A1: NA and mean()

A2: Positional and Keyword Matching

A3: Practical handling of vectors

A4: Creating sequences

A5: Using rep()

A6: Efficiently create vectors

A7: Create matrices

A8: Elements of factors

3 Data structures in R (2)

3.1 Lists

3.2 The fundamental data structure: data frames

3.2.1 Working with data frames

3.3 Indexing variables

3.4 Indexing data frames

3.5 Sorting

3.6 Building contingency tables

3.7 Aggregating

3.7.1 Aggregating: example

3.8 Reshaping

3.9 Exercises

A1: Simple handling of vectors, matrices and lists

A2: Create a dataframe

A3: Using Dataframe

A4: Indexing from a dataframe

4 Data input and output

4.1 Reading tabular text files

4.2 Reading arbitrary text files

4.3 Reading files created by other software

4.4 Combining data frames

4.5 Data entry

4.6 Writing data

5 Programming in R

5.1 Conditional execution

5.2 Loops

5.2.1 for()-loop example:

5.2.2 while()-loop example

5.3 Avoiding loops

5.4 Writing functions

5.5 Classes and methods

5.6 A user-defined print method

5.7 Debugging

5.8 R programming traps

5.9 Exercises

A1: Create vector using for-loop

A2: Implement a function

A3: Using a for-loop to generate number sequences

A4: Triangular matrices

A5: cumulative sum

6 Graphics

6.1 R graphics systems

6.2 Traditional graphics functions

A1: `NA` and `mean()`

5.2.1 `for()`-loop example:

5.2.2 `while()`-loop example

10.2 Repeated execution `replicate()`