This is just a rough idea of what the Shiny app might include. If you have your own ideas / approaches / changes, then you can of course integrate them into the app.
Represent the “complete” sequence of a (psychological) experiment in a prototypical way.
In the preregistration, the sample size, hypothesis, and analysis are
defined (possibly: offer some options; e.g., t-test/Welch
test/bootstrap).
BUT: It is also possible to omit parts of the preregistration.
Then p-hacking can play a role:
\(\Rightarrow\) Repeat data collection and analysis very often to find out what is happening with the alpha error.
p-hacking is not necessarily an intentional process, often ignorance and learned actions are partly responsible. Pre-registration helps to determine in advance these “forks” that one encounters during data analysis and thus to perform a reasonable data analysis.
Power simulation or if possible analytical calculation (faster):
<- 0.5
mu_diff <- 1
sigma
# power <- 0
# n <- 2
# while(power < 0.8){ # simulate power = 0.8
# n <- n + 2
# pval <- replicate(1000, {
# x <- rnorm(n, mean = 0 + mu_diff, sd = sigma)
# y <- rnorm(n, mean = 0, sd = sigma)
# t.test(x, y, mu = 0, var.equal = TRUE)$p.value # variances unknown
# })
# power <- mean(pval < 0.05) # alpha = 0.05
# }
<- power.t.test(delta = mu_diff, sd = sigma, power = 0.8)
power_calc <- ceiling(power_calc$n) # round to next higher integer
n
n
## [1] 64
Data simulation: Reminder: There is no true effect,
the true difference between the groups is 0.
We have 3 groups in our design, but for our hypothesis only the
difference between g1 and g2 is relevant.
<- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
Analysis
t.test(data$g1, data$g2, var.equal = TRUE)$p.value
## [1] 0.2071572
Repeat data simulation & Analysis:
<- 10 * 1000 # Number of repetitions of the experiment
n_replicate # Number of repetitions of the experiment. The higher this number is the
# longer the simulations take, but the less they fluctuate around the "true" result.
<- replicate(n_replicate,{
p_vals <- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
t.test(data$g1, data$g2, var.equal = TRUE)$p.value
})
mean(p_vals < 0.05)
## [1] 0.0536
# In about 5% of the cases we get a significant result, although H0 holds.
# This corresponds exactly to our previously determined alpha = 0.05.
# Histogramm of p values
hist(p_vals, xlim = c(0, 1), breaks = 16, freq = FALSE)
# If H0 holds, the p values are uniformly distributed.
Let’s p hack! ( Don’t do it for real!!!)
Because we have not made a preregistration we are missing:
Let’s look at what this does to our type I error (the alpha error).
Reminder: This error is supposed to be 5%.
If you do that, eventually everything becomes significant. Therefore, we limit ourselves here to 20 additional persons per group, should our test not become significant.
<- 20
n_additional <- replicate(n_replicate,{
p_vals_hack1 <- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
<- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
p_val if(p_val >= 0.05){ # 20 additional persons per group
<- rbind(data,
data data.frame(g1 = rnorm(n_additional),
g2 = rnorm(n_additional),
g3 = rnorm(n_additional)))
<- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
p_val
}
p_val
})
mean(p_vals_hack1 < 0.05)
## [1] 0.073
hist(p_vals_hack1, xlim = c(0, 1), breaks = 16, freq = FALSE)
Two-sided hypothesis becomes one-sided as if by magic:
First, the group means \(\mu_{g1}\) and
\(\mu_{g2}\) are compared. Depending on
which value is larger, we make a one-sided hypothesis following this
direction.
<- replicate(n_replicate,{
p_vals_hack2 <- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
t.test(data$g1, data$g2, var.equal = TRUE,
alternative = ifelse(mean(data$g1) > mean(data$g2),
"greater",
"less"))$p.value
})
mean(p_vals_hack2 < 0.05)
## [1] 0.09902
hist(p_vals_hack2, xlim = c(0, 1), breaks = 16, freq = FALSE)
If the difference \(\mu_{g1} - \mu_{g2}\) is not significant, then examine \(\mu_{g1} - \mu_{g3}\).
So instead of group 1 and 2 we now compare group 1 and 3
<- replicate(n_replicate,{
p_vals_hack3 <- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
<- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
p_val if(p_val >= 0.05){
<- t.test(data$g1, data$g3, var.equal = TRUE)$p.value
p_val
}
p_val
})
mean(p_vals_hack3 < 0.05)
## [1] 0.0884
hist(p_vals_hack3, xlim = c(0, 1), breaks = 16, freq = FALSE)
First, we perform a normal t-test. However, if this does not become significant, we drop the assumption of variance homogeneity and try a Welch test.
<- replicate(n_replicate,{
p_vals_hack4 <- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
<- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
p_val if(p_val >= 0.05){ # Welch-Test
<- t.test(data$g1, data$g3, var.equal = FALSE)$p.value
p_val
}
p_val
})
mean(p_vals_hack4 < 0.05)
## [1] 0.0944
hist(p_vals_hack4, xlim = c(0, 1), breaks = 16, freq = FALSE)
<- 20
n_additional <- replicate(n_replicate,{
p_vals_hacks <- data.frame(g1 = rnorm(n),
data g2 = rnorm(n),
g3 = rnorm(n))
<- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
p_val # Hack 1: Collect more data
if(p_val >= 0.05){ # 20 additional persons per group
<- rbind(data,
data data.frame(g1 = rnorm(n_additional),
g2 = rnorm(n_additional),
g3 = rnorm(n_additional)))
<- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
p_val
}# Hack 2: adjust hypothesis
if(p_val >= 0.05){
= ifelse(mean(data$g1) > mean(data$g2),
hypo "greater",
"less")
<- t.test(data$g1, data$g2, var.equal = TRUE,
p_val alternative = hypo)$p.value
}# Hack 3: compare other groups
if(p_val >= 0.05){
<- t.test(data$g1, data$g3, var.equal = TRUE)$p.value
p_val
}# Hack 4: Use another test
if(p_val >= 0.05){ # Welch-Test
<- t.test(data$g1, data$g2, var.equal = FALSE,
p_val alternative = hypo)$p.value
}
p_val
})
mean(p_vals_hacks < 0.05)
## [1] 0.1503
hist(p_vals_hacks, xlim = c(0, 1), breaks = 16, freq = FALSE)
par(mfrow = c(6, 1), mar = c(3.1, 1.5, 1.5, 0.1), mgp = c(2, 0.7, 0))
<- seq(0, 1, length.out = 31)
bins
hist(p_vals, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE,
xlab = "", ylab = "", main = "Preregistered experiment")
axis(1)
hist(p_vals_hack1, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE,
xlab = "", ylab = "", main = "Hack: additional data collected after initial analysis")
axis(1)
hist(p_vals_hack2, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE,
xlab = "", ylab = "", main = "Hack: hypothesis adjusted after looking at the data")
axis(1)
hist(p_vals_hack3, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE,
xlab = "", ylab = "", main = "Hack: Multiple pairwise comparisons")
axis(1)
hist(p_vals_hack4, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE,
xlab = "P Value", ylab = "", main = "Hack: Multiple tests performed")
axis(1)
hist(p_vals_hacks, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE,
xlab = "P Value", ylab = "", main = "All 'hacks' combined")
axis(1)
segments(0.05, -5, 0.05, 100, lty = 2, xpd = NA)
mtext("Density", 2, outer = TRUE, padj = 2)
Congratulations: you now have more significant results, but you’re going to data analysis hell for it…. Was it worth it?
… and by the way: These significant results are of course all false positives. The effect that was supposedly found here does not exist!
For a more detailed look at false positives and the degrees of freedom in data analysis, see: https://doi.org/10.1177/0956797611417632
Disclaimer:
The goal of the project is to demonstrate the benefits of
preregistration using common errors in data analysis, referred to below
as “p-hacking”. The app is not intended to be a guide to scientific
misconduct. The simulations shown here are intended to alert the user to
the dangers and hurdles of data analysis and the entire process of an
empirical experiment.
Some technical ideas (optional):