A Framework for Confirmatory Hypothesis Testing

Author

Florian Wickelmaier

Published

May 12, 2026

Testing a hypothesis proceeds in three phases.

Phase I: Planning and Modeling

Set up a data-generating model that represents your hypothesis; this involves choosing a statistical model, a test, and setting all the auxiliary parameters to plausible values (substantive knowledge required).
Be specific about H\(_1\): Set the focus parameter(s) of the model to a value just large enough to represent an effect size (deviation from H\(_0\)) that you would regret missing.
Determine the number of observations (\(n\)) by a simulation-based power analysis (Wickelmaier 2026); that is:
- repeatedly generate observations from the model
- and each time run the test on these observations
- count the proportion of significant tests (\(=\) power)
- adjust \(n\) and repeat until power is large enough

Outcome 1 (unlikely)

The observations are similar to what the model can generate AND the observed effect estimate is AT LEAST AS LARGE as what was deemed relevant. -> Decide for H\(_1\)
Outcome 2 (unlikely)

The observations are similar to what the model can generate AND the observed effect estimate is SMALLER than what was deemed relevant. -> Decide for H\(_0\)
Outcome 3 (likely)

The observations are dissimilar to what the model can generate. -> Don’t draw any (firm) conclusions.

Outcome 1: Realistically-sized (\(=\) small) effects have a chance to be found and reported (due to power, a-priori relevance criterion).
Outcome 2: Because we have put H\(_0\) at a severe falsification risk (power again), we get corroboration if we cannot reject.
Outcome 3: At least we know that we don’t know.

As Outcome 3 will be most likely, we will usually only learn that we are lacking specific theories for making successful predictions.
The similarity criterion (When does the generating model hold?) is vague.

The procedure is basically the Neyman-Pearson paradigm (no originality is claimed), but it emphasizes the need for a (substantively plausible) data-generating model.
It focuses on what is a relevant effect size. This has to be decided before the data are collected (prior relevance).
It introduces a third (aporetic) decision option.

Wickelmaier, F. 2026. “Simulating the Power of Statistical Tests: A Collection of R Examples.” ArXiv. https://doi.org/10.48550/arXiv.2110.09836.