A Framework for Confirmatory Hypothesis Testing

Author

Florian Wickelmaier

Published

May 12, 2026

Testing a hypothesis proceeds in three phases.

Phase I: Planning and Modeling

  • Set up a data-generating model that represents your hypothesis; this involves choosing a statistical model, a test, and setting all the auxiliary parameters to plausible values (substantive knowledge required).

  • Be specific about H\(_1\): Set the focus parameter(s) of the model to a value just large enough to represent an effect size (deviation from H\(_0\)) that you would regret missing.

  • Determine the number of observations (\(n\)) by a simulation-based power analysis (Wickelmaier 2026); that is:

    • repeatedly generate observations from the model
    • and each time run the test on these observations
    • count the proportion of significant tests (\(=\) power)
    • adjust \(n\) and repeat until power is large enough

Phase II: Collecting Data

  • Collect the \(n\) observations.
  • Stick to the plan as much as you can.

Phase III: Inferring and Criticizing

  • Outcome 1 (unlikely)

    The observations are similar to what the model can generate AND the observed effect estimate is AT LEAST AS LARGE as what was deemed relevant. -> Decide for H\(_1\)

  • Outcome 2 (unlikely)

    The observations are similar to what the model can generate AND the observed effect estimate is SMALLER than what was deemed relevant. -> Decide for H\(_0\)

  • Outcome 3 (likely)

    The observations are dissimilar to what the model can generate. -> Don’t draw any (firm) conclusions.

Benefits

  • Outcome 1: Realistically-sized (\(=\) small) effects have a chance to be found and reported (due to power, a-priori relevance criterion).

  • Outcome 2: Because we have put H\(_0\) at a severe falsification risk (power again), we get corroboration if we cannot reject.

  • Outcome 3: At least we know that we don’t know.

Drawbacks

  • As Outcome 3 will be most likely, we will usually only learn that we are lacking specific theories for making successful predictions.

  • The similarity criterion (When does the generating model hold?) is vague.

Remarks

  • The procedure is basically the Neyman-Pearson paradigm (no originality is claimed), but it emphasizes the need for a (substantively plausible) data-generating model.

  • It focuses on what is a relevant effect size. This has to be decided before the data are collected (prior relevance).

  • It introduces a third (aporetic) decision option.

References

Wickelmaier, F. 2026. “Simulating the Power of Statistical Tests: A Collection of R Examples.” ArXiv. https://doi.org/10.48550/arXiv.2110.09836.