A Framework for Confirmatory Hypothesis Testing
Testing a hypothesis proceeds in three phases.
Phase I: Planning and Modeling
Set up a data-generating model that represents your hypothesis; this involves choosing a statistical model, a test, and setting all the auxiliary parameters to plausible values (substantive knowledge required).
Be specific about H\(_1\): Set the focus parameter(s) of the model to a value just large enough to represent an effect size (deviation from H\(_0\)) that you would regret missing.
Determine the number of observations (\(n\)) by a simulation-based power analysis (Wickelmaier 2026); that is:
- repeatedly generate observations from the model
- and each time run the test on these observations
- count the proportion of significant tests (\(=\) power)
- adjust \(n\) and repeat until power is large enough
Phase II: Collecting Data
- Collect the \(n\) observations.
- Stick to the plan as much as you can.
Phase III: Inferring and Criticizing
Outcome 1 (unlikely)
The observations are similar to what the model can generate AND the observed effect estimate is AT LEAST AS LARGE as what was deemed relevant. -> Decide for H\(_1\)
Outcome 2 (unlikely)
The observations are similar to what the model can generate AND the observed effect estimate is SMALLER than what was deemed relevant. -> Decide for H\(_0\)
Outcome 3 (likely)
The observations are dissimilar to what the model can generate. -> Don’t draw any (firm) conclusions.
Benefits
Outcome 1: Realistically-sized (\(=\) small) effects have a chance to be found and reported (due to power, a-priori relevance criterion).
Outcome 2: Because we have put H\(_0\) at a severe falsification risk (power again), we get corroboration if we cannot reject.
Outcome 3: At least we know that we don’t know.
Drawbacks
As Outcome 3 will be most likely, we will usually only learn that we are lacking specific theories for making successful predictions.
The similarity criterion (When does the generating model hold?) is vague.
Remarks
The procedure is basically the Neyman-Pearson paradigm (no originality is claimed), but it emphasizes the need for a (substantively plausible) data-generating model.
It focuses on what is a relevant effect size. This has to be decided before the data are collected (prior relevance).
It introduces a third (aporetic) decision option.