1 Project idea

This is just a rough idea of what the Shiny app might include. If you have your own ideas / approaches / changes, then you can of course integrate them into the app.

Represent the “complete” sequence of a (psychological) experiment in a prototypical way.

Experiment design
Pre-registration
- Hypotheses (two sided / one sided, which DV, which groups)
- Power calculation -> sample size
  - (not mandatory to implement for this app)
  - But: user should be able to set a sample size in the Preregistration part
    - e.g.: “According to a power analysis, the following sample size should be collected….”
- Analysis: e.g. t- /Welch-Test or Wilcox (equal variance?)
Data collection (Simulation)
Analysis

In the preregistration, the sample size, hypothesis, and analysis are defined (possibly: offer some options; e.g., t-test/Welch test/bootstrap).
BUT: It is also possible to omit parts of the preregistration.
Then p-hacking can play a role:

e.g. collect data until significant (if no sample size preregistered),
run multiple tests and report only significant ones (if no analysis preregistered),
interpret hypothesis one-sided instead of two-sided afterwards (if no hypothesis preregistered).

1.1 Some points/ideas to consider:

1.1.1 Experiment design:

Multiple dependent variables (chose one in hypotheses)
- In the preregistration it must be recorded which dependent variable is decisive for the hypothesis.
  p-hacking: report only significant dependent variables afterwards.
linear regression: multiple predictors
- In the preregistration, the model including predictors must be defined.
  p-Hacking: collect many predictors, evaluate all and report only significant ones.

1.1.2 Power calculation:

A great collection of examples: https://arxiv.org/abs/2110.09836

1.1.3 Data collection:

Specify a data generating process to match the design of the experiment.
The true effect size is 0 -> only [alpha error] percent should be significant

1.1.4 Analysis

If a preregistration was made, analyze data according to its specifications. Otherwise p-hacking (possibly: select different p-hacking approaches via checkbox).
If only part of the preregistration is filled in, then restrict the amount of p-hacking approaches accordingly.

\(\Rightarrow\) Repeat data collection and analysis very often to find out what is happening with the alpha error.

1.2 potential take home mesage:

p-hacking is not necessarily an intentional process, often ignorance and learned actions are partly responsible. Pre-registration helps to determine in advance these “forks” that one encounters during data analysis and thus to perform a reasonable data analysis.

2 Example

Design: three independent groups, same size
Pre-registration:
- Hypotheses: There is a difference between group 1 and 2 (two-sided)
- Power: According to the literature, the difference of the mean values \(\mu_{g1} - \mu_{g2} = 0.5\). For simplicity, we assume an SD of 1 in both groups.
Analysis:
- t-test: two independent groups

2.1 The clean path as described in the preregistration

Power simulation or if possible analytical calculation (faster):

mu_diff <- 0.5
sigma <- 1

# power <- 0
# n <- 2
# while(power < 0.8){  # simulate power = 0.8
#   n <- n + 2
#   pval <- replicate(1000, {
#     x <- rnorm(n, mean = 0 + mu_diff, sd = sigma)
#     y <- rnorm(n, mean = 0, sd = sigma)
#     t.test(x, y, mu = 0, var.equal = TRUE)$p.value # variances unknown
#   })
#   power <- mean(pval < 0.05) # alpha = 0.05
# }

power_calc <- power.t.test(delta = mu_diff, sd = sigma, power = 0.8)
n <- ceiling(power_calc$n) # round to next higher integer 

n

## [1] 64

Data simulation: Reminder: There is no true effect, the true difference between the groups is 0.
We have 3 groups in our design, but for our hypothesis only the difference between g1 and g2 is relevant.

data <- data.frame(g1 = rnorm(n),
                   g2 = rnorm(n),
                   g3 = rnorm(n))

Analysis

t.test(data$g1, data$g2, var.equal = TRUE)$p.value

## [1] 0.2071572

Repeat data simulation & Analysis:

n_replicate <- 10 *  1000  # Number of repetitions of the experiment  
# Number of repetitions of the experiment. The higher this number is the 
# longer the simulations take, but the less they fluctuate around the "true" result.

p_vals <- replicate(n_replicate,{
  data <- data.frame(g1 = rnorm(n),
                     g2 = rnorm(n),
                     g3 = rnorm(n))
  t.test(data$g1, data$g2, var.equal = TRUE)$p.value
})

mean(p_vals < 0.05)

## [1] 0.0536

# In about 5% of the cases we get a significant result, although H0 holds. 
# This corresponds exactly to our previously determined alpha = 0.05.

# Histogramm of p values
hist(p_vals, xlim = c(0, 1), breaks = 16, freq = FALSE)

# If H0 holds, the p values are uniformly distributed.

2.2 The wrong path: no preregistration

Let’s p hack! ( Don’t do it for real!!!)

Because we have not made a preregistration we are missing:

the sample size,
- Collect data until stars appear to us *****
the hypothesis,
- Why test two-sided when you can look at the empirical group difference and then make a one-sided hypothesis?
which groups we are comparing and
- perform all three comparisons, report significant only
which test we are using for the analysis.
- try out which test becomes significant

Let’s look at what this does to our type I error (the alpha error).

Reminder: This error is supposed to be 5%.

2.2.1 1. Collect data until stars appear

If you do that, eventually everything becomes significant. Therefore, we limit ourselves here to 20 additional persons per group, should our test not become significant.

n_additional <- 20
p_vals_hack1 <- replicate(n_replicate,{
  data <- data.frame(g1 = rnorm(n),
                     g2 = rnorm(n),
                     g3 = rnorm(n))
  p_val <- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
  if(p_val >= 0.05){ # 20 additional persons per group
    data <- rbind(data,
                  data.frame(g1 = rnorm(n_additional),
                             g2 = rnorm(n_additional),
                             g3 = rnorm(n_additional)))
    p_val <- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
  }
  p_val
})

mean(p_vals_hack1 < 0.05)

## [1] 0.073

hist(p_vals_hack1, xlim = c(0, 1), breaks = 16, freq = FALSE)

2.2.2 2. Adjust hypothesis according to data.

Two-sided hypothesis becomes one-sided as if by magic:
First, the group means \(\mu_{g1}\) and \(\mu_{g2}\) are compared. Depending on which value is larger, we make a one-sided hypothesis following this direction.

p_vals_hack2 <- replicate(n_replicate,{
  data <- data.frame(g1 = rnorm(n),
                     g2 = rnorm(n),
                     g3 = rnorm(n))
  t.test(data$g1, data$g2, var.equal = TRUE, 
                  alternative = ifelse(mean(data$g1) > mean(data$g2),
                                       "greater",
                                       "less"))$p.value
})

mean(p_vals_hack2 < 0.05)

## [1] 0.09902

hist(p_vals_hack2, xlim = c(0, 1), breaks = 16, freq = FALSE)

2.2.3 3. Compare multiple groups

If the difference \(\mu_{g1} - \mu_{g2}\) is not significant, then examine \(\mu_{g1} - \mu_{g3}\).

So instead of group 1 and 2 we now compare group 1 and 3

p_vals_hack3 <- replicate(n_replicate,{
  data <- data.frame(g1 = rnorm(n),
                     g2 = rnorm(n),
                     g3 = rnorm(n))
  p_val <- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
  if(p_val >= 0.05){
    p_val <- t.test(data$g1, data$g3, var.equal = TRUE)$p.value
  }
  p_val
})

mean(p_vals_hack3 < 0.05)

## [1] 0.0884

hist(p_vals_hack3, xlim = c(0, 1), breaks = 16, freq = FALSE)

2.2.4 4. Perform multiple tests

First, we perform a normal t-test. However, if this does not become significant, we drop the assumption of variance homogeneity and try a Welch test.

p_vals_hack4 <- replicate(n_replicate,{
  data <- data.frame(g1 = rnorm(n),
                     g2 = rnorm(n),
                     g3 = rnorm(n))
  p_val <- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
  if(p_val >= 0.05){ # Welch-Test
    p_val <- t.test(data$g1, data$g3, var.equal = FALSE)$p.value
  }
  p_val
})

mean(p_vals_hack4 < 0.05)

## [1] 0.0944

hist(p_vals_hack4, xlim = c(0, 1), breaks = 16, freq = FALSE)

2.2.5 The ultimate approach: all variants combined.

n_additional <- 20
p_vals_hacks <- replicate(n_replicate,{
  data <- data.frame(g1 = rnorm(n),
                     g2 = rnorm(n),
                     g3 = rnorm(n))
  p_val <- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
  # Hack 1: Collect more data
  if(p_val >= 0.05){ # 20 additional persons per group
    data <- rbind(data,
                  data.frame(g1 = rnorm(n_additional),
                             g2 = rnorm(n_additional),
                             g3 = rnorm(n_additional)))
    p_val <- t.test(data$g1, data$g2, var.equal = TRUE)$p.value
  }
  # Hack 2: adjust hypothesis
  if(p_val >= 0.05){
    hypo = ifelse(mean(data$g1) > mean(data$g2),
                  "greater",
                  "less")
    p_val <- t.test(data$g1, data$g2, var.equal = TRUE, 
                  alternative = hypo)$p.value
  }
  # Hack 3: compare other groups
  if(p_val >= 0.05){
    p_val <- t.test(data$g1, data$g3, var.equal = TRUE)$p.value
  }
  # Hack 4: Use another test
  if(p_val >= 0.05){ # Welch-Test
    p_val <- t.test(data$g1, data$g2, var.equal = FALSE,
                    alternative = hypo)$p.value
  }
  p_val
})

mean(p_vals_hacks < 0.05)

## [1] 0.1503

hist(p_vals_hacks, xlim = c(0, 1), breaks = 16, freq = FALSE)

2.3 Comparison of p-hacking approaches:

par(mfrow = c(6, 1), mar = c(3.1, 1.5, 1.5, 0.1), mgp = c(2, 0.7, 0))

bins <- seq(0, 1, length.out = 31)

hist(p_vals,       xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE, 
     xlab = "", ylab = "", main = "Preregistered experiment")
axis(1)
hist(p_vals_hack1, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE, 
     xlab = "", ylab = "", main = "Hack: additional data collected after initial analysis")
axis(1)
hist(p_vals_hack2, xlim = c(0, 1), breaks = bins,  freq = FALSE, axes = FALSE, 
     xlab = "", ylab = "", main = "Hack: hypothesis adjusted after looking at the data")
axis(1)
hist(p_vals_hack3, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE, 
     xlab = "", ylab = "", main = "Hack: Multiple pairwise comparisons")
axis(1)
hist(p_vals_hack4, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE, 
     xlab = "P Value", ylab = "", main = "Hack: Multiple tests performed")
axis(1)
hist(p_vals_hacks, xlim = c(0, 1), breaks = bins, freq = FALSE, axes = FALSE, 
     xlab = "P Value", ylab = "", main = "All 'hacks' combined")
axis(1)
segments(0.05, -5, 0.05, 100, lty = 2, xpd = NA)
mtext("Density", 2, outer = TRUE, padj = 2)

Congratulations: you now have more significant results, but you’re going to data analysis hell for it…. Was it worth it?

… and by the way: These significant results are of course all false positives. The effect that was supposedly found here does not exist!

For a more detailed look at false positives and the degrees of freedom in data analysis, see: https://doi.org/10.1177/0956797611417632

Disclaimer:
The goal of the project is to demonstrate the benefits of preregistration using common errors in data analysis, referred to below as “p-hacking”. The app is not intended to be a guide to scientific misconduct. The simulations shown here are intended to alert the user to the dangers and hurdles of data analysis and the entire process of an empirical experiment.

Some technical ideas (optional):

drag and drop list to chose p-hacking strategies -> https://rstudio.github.io/sortable/
maybe: if simulation is too slow: library(parallel) or library(doParallel)

Preregistration vs p-Hacking

QHELP Project

Julian Mollenhauer