Irreproducibility is often attributed to misconduct or questionable research practices (QRPs), but theoretical work also suggests it may stem from intrinsic properties of studied phenomena and from methodological constraints. A pre-existing corpus collected by DF was screened for articles proposing non-QRP causes of irreproducibility. Additional references were identified through citation tracking, Web of Science, and a public collection of critical metascience articles. Systematic database searches were discontinued due to low yield; 516 articles were screened in total. For included studies, we recorded hypothesised determinants, brief causal summaries, methodology (commentary/review, simulation, analytical), and relevant research fields. Distinct factors were listed separately, while overlapping arguments were represented by the most detailed source.

Factor

Variation in observed effect sizes

Brief explanation of the argument

The “replication crisis” presupposes effect sizes that are fixed across time and contexts and can be divided between true and false. But when a given effect is measured in practice, “features unique to that context may mediate the average effect by adding additional mediator variance” […] “This can occur for numerous reasons, ranging from a poorly chosen statistical model to imperfect randomization, differing sample populations, environmental conditions, or flexibility in experimental design.”Their model shows that low replication rates may occur regardless of QRPs, unless the heterogeneity (between-study variance) is much smaller than the within-study variance of effect sizes, and sample size is sufficiently large.

Reference & doi

Bak-Coleman et al. (2022). Replication and reliability of science. SocArXiv. 10.31235/osf.io/rkyf7

Multiple trials

In order to increase power and detect small and medium effects, psychologists might run multiple trials or use multiple items and then aggregate data. But this is shown to inflate the estimated effect size.

Brand et al. (2010). Exaggerated effect sizes from multiple trials. J. Gen. Psychol. 10.1080/00221309.2010. 520360

Unforeseen confounds

Replication studies may suffer from unforeseen confounding factors, which are unknown and therefore not documented in the original study, and in the replication study may be noticed but not corrected due to the reluctance of changing results post-hoc. The quality control that is often applied to the original study is not applied to the replication studies, leading to over-estimation of irreproducibility.

Bressan, P. Confounds in “failed” replications. Front. Psychol., 2019, 10. 10.3389/fpsyg.2019.01884

Centralized scientific community

Evidence obtained comparing published drug-gene interaction claims with high-throughput experiments from the LINKCS L1000 program suggest that centralised scientific communities of authors using similar methods and contributing to many articles produce less replicable claims.

Danchev et al. (2019). Centralized communities and replicability. eLife. 10.7554/elife.43094

Between-site variation

We should understand replications as instances of resampling (of population, effects etc.). Differences between sites/laboratories generate not small random effects that cancel each other out, but often important effects, non-randomly distributed. The result is an inflation of false positives, especially in between-species comparisons and small samples.

Farrar et al. (2021). Representativeness in animal cognition research. Anim. Behav. Cogn. 10.26451/abc.08.02.14. 2021

Small and non-representative samples 

We should understand replications as instances of resampling (of population, effects etc.). Small and non-representative samples (of experimental units, settings, treatments, and measurements) will mean that sampling variation will make other laboratories fail to replicate.

Farrar et al. (2021). Representativeness in animal cognition research. Anim. Behav. Cogn. 10.26451/abc.08.02.14. 2021

Vaguely specified hypotheses

We should understand replications as instances of resampling (of population, effects etc.). Small and non-representative samples (of experimental units, settings, treatments, and measurements) will mean that sampling variation will make other laboratories fail to replicate.

Farrar et al. (2021). Representativeness in animal cognition research. Anim. Behav. Cogn. 10.26451/abc.08.02.14. 2021

Checking assumptions prior to running a test

Statistical Conclusion Validity is obtained when data are analysed with adequate methods. At least three common practices undermine it. When checking and/or testing for model assumptions before running the test, the nominal alpha levels are inflated. “this is the result of more complex interactions of Type-I and Type-II error rates that do not have fixed (empirical) probabilities across the cases that end up treated one way or the other according to the outcomes of the preliminary test: The resultant Type-I and Type-II error rates of the conditional test cannot be predicted from those of the preliminary and conditioned tests.”

García-Pérez, M. A. (2012). Statistical conclusion validity: Common threats and simple remedies. Frontiers in Psychology, 3. 10.3389/fpsyg.2012.00325

Combination of small samples, high variation, and small effects

Any combination of small samples, high variation, and small effects increases the risk of Type M errors (exaggerated effect size, measured as an exaggeration ratio of abs(estimate/effect size) ) and Type S errors (incorrect sign (direction) of the effect).

Gelman & Carlin (2014), Beyond power calculations, Persp. Psychol. Sci, 9. 10.1177/1745691614551642

Overlooking variability and change

The perception of a reproducibility crisis stems from applying a 20th century, “hard science” logic to 21st century, social science data. We fail to distinguish statistical hypotheses from scientific (substantive) hypotheses, assume that the null is true, and assume that effects are fixed, instead of modelling their inevitable variation.

Gelman (2015), Varying treatment effects and unreplicable research, J. Management, 41. 10.1177/0149206314525208

Context-dependency of relevant vs irrelevant characteristics

Context is defined, following Cornbach, as anything that can threaten the generalizability of a finding, including: Units (sample characteristics), Treatment (characteristics of operationalized independent variables), Outcomes (characteristics of operationalized dependent variables), and setting (which includes meso-level characteristics like location, interaction with participants, and macro-level characteristics like cultural norms of the country). They propose a fairly simple structural equation model as a framework, in which setting may modulate any step of the pathways between IV and DV. To the extent that irrelevant pathways (e.g. the operationalization of the DV) are not affected by heterogeneity, then the heterogeneity affects relevant characteristics, i.e. the underlying mechanistic effect between latent variables.

Gollwitzer, M., & Schwabe, J. (2022). Context Dependency as a Predictor of Replicability. Review of General Psychology, 26(2), 241-249. 10.1177/10892680211015635

Environmental Effect Ratio (EER)

Repeating an experiment in a new environment is modelled as the original effect plus a series of environment-by-treatment interaction effects. The EER is the ratio between the standard deviation of the distribution of these interaction effects and the standard deviation of the experimental error term. A high EER is variously associated with reduced replicability power, and increased probabilities of non-significance or significance in the wrong direction, with effects modulated by sample size and treatment effect size.

Higgins et al. (2020), Replicability of statistical inferences, Am. Stat. 10.1016/j.jesp.2015.09.006

Complexity of statistical software + flexibility of choices

Complexity of software and flexibility of choices in tuning parameters can bias the output towards an inflation of statistically significant results. The field of modern statistics has had to revisit the classical hypothesis testing paradigm to accommodate modern high-throughput settings.

Holmes (2017), Statistical proof and irreproducibility, Bull. Am. Math. Soc. 10.1090/bull/1597

Boundedness of truth

In science in general and psychology in particular, truth is temporary, being time-bound, context bound. Empirical phenomena are, to greater or lesser extent, all bounded in time and context. Neither the original study or the replication are the final arbiters of whether an effect exists or not.

Iso-Ahola (2020), Replication and scientific truth, Front. Psychol, 11. 10.3389/fpsyg.2020.02183

Heterogeneity

Heterogeneity is random variation between effects beyond what is due to sampling error. “In a replication study, heterogeneity lowers the true power of the replication, and even increases the changes to find an effect in the opposite direction.” “with heterogeneity, any given study, no matter how large its sample size, might be far away from the mean of the effect sizes. Moreover, given heterogeneity, the power of any given study is not as great as might be thought.”So they suggest conducting several moderately powered studies.Moreover, heterogeneity “heightens the effects of publication bias” by making extreme effects more likely. Effectively, with heterogeneity a replication study might be replicating one of several “true” effects. Therefore, relying on a single high-powered study is a flawed strategy, and so is relying on increasing standardization.“For instance, studies are conducted in different locations, with different experimenters, in different historical moments, and with different nonrandomly selected participants. All of these, and a variety of other factors,likely lead to heterogeneity. And this heterogeneity leads to concerns about the utility of any single replication study.”

Kenny, D. A. & Judd, C. M. The unappreciated heterogeneity of effect sizes: Implications for power, precision, planning of research, and replication. Psychological Methods, 2019, 24, 578–589 10.1037/met0000209

Type of data analysis

Suggests six different types of questions that different types of analysis might answer: descriptive, exploratory, inferential, predictive, causal, mechanistic. Confusion over the type of questions asked by a study, and therefore over what result is the correct target of a replication, might drive the impression of irreproducibility. And a misapplication of the analysis in the original study might generate irreproducible results (e.g. inferential incorrectly used for causal; exploratory used for inferential (data dredging); exploratory used for predictive (over-fitting); descriptive used for inferential (n=1 analysis)).

Leek, J. T., & Peng, R. D. (2015). What is the question? Science, 347, 1314–1315. 10.1126/science.aaa6146

 

Measurement error

Measurement error is “random variation, of some distributional form, that produces a difference between observed and true values”. Noise tends to bias results towards the null. In low-noise settings, measurement error indeed attenuates effect sizes estimates. But when noise is high, statistically significant results lead to over-estimating the true effect due to: 1) uncontrolled researchers degrees of freedom; 2) statistical significance filter, where noise makes standard errors large and, with not-large sample sizes, selects for only large effect sizes even when true effects are small. In smaller N studies, some of the estimates can be larger, which is then exacerbated by selection/NHST bias.

Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355, 584–585. www.jstor.org/stable/ 24918342

 

Misclassification of outcomes

Standards for disease classification may differ between studies in “unexpected significant ways”, making results of different studies harder to compare or reproduce.

Manly (2005), Reliability of gene–disease associations, Immunogenetics. 10.1007/s00251-005-0025-x

 

Modification by genetic or environment factors

For example, “disease susceptibility may be affected by interactions between the tested gene and other genes or environmental factors that differ among study populations”, reducing the reproducibility of results.

Manly (2005), Reliability of gene–disease associations, Immunogenetics. 10.1007/s00251-005-0025-x

 

Undetected population stratification

Both disease and genetic markers are likely to be associated with undetected subgroups that vary in frequency among populations.

Manly (2005), Reliability of gene–disease associations, Immunogenetics. 10.1007/s00251-005-0025-x

 

Variation in linkage disequilibrium

The degree of linkage disequilibrium may vary among study populations.

Manly (2005), Reliability of gene–disease associations, Immunogenetics. 10.1007/s00251-005-0025-x

 

Underpowered replication studies

The sampling variability of the original finding is under-estimated. Replications are therefore likely to be under-powered, and an effect needs multiple replications to assess whether it is indeed too small to be relevant.

Maxwell et al. (2015), Replication crisis and failure to replicate, Am. Psychol. 10.1037/a0039400

Underlying mixture distribution of effect size

Effects may come from more than one distribution, within studies as well as across. If this is not taken into account, aggregate estimates and related measures (e.g. reproducibility assessments) will be flawed. When a single true underlying effect exists, then increasing power leads to more precise estimates. When more than one underlying effect exists, however, increasing power will lead to skew estimates either towards one (false) value, or towards a false average that does not reflect the true nature of the phenomenon. Random effects models cannot account for this, as they assume a continuous and normal distribution of errors.

Moreau & Corballis (2019), When averaging goes wrong, J. Exp. Psychol.: Gen. 10.1037/xge0000504

Unknown unknowns

Unknown unknowns are overlooked by a “reductionistic mindset that prioritizes certainty in research outcomes over the ambiguity intrinsic to biological systems”, leading to over-estimating the reproducibility of results.

Mullane & Williams (2015), Unknown unknowns and irreproducibility, Biochem. Pharmacol. 10.1002/icd.2295

Cumulative theoretical framework

Fields that lack an overarching theoretical framework (within which competing theories and hypotheses are developed and tested) are unable to generate hypotheses across domains, and therefore empirical programs in these fields “spawn and grow from personal intuitions and culturally biased folk theories”. Hypotheses and findings generated in these fields are less likely to generalize.

Muthukrishna & Henrich (2019), A problem in theory, Nat. Hum. Behav. 10.1038/s41562-018-0522-1

Strength of link between theories and empirical tests

Studies lie on a continuum between discovery-oriented and theory testing research. The former searches in a space of hypotheses that can follow from a theory for supports. The hypotheses have low probability to be observed, but not observing them doesn’t directly refute the theory either.Theory-testing research is a case where certain hypotheses follow strictly, so that if theory T is true, then hypothesis X must also be true. Many of the suggested causes of irreproducibility, e.g. researcher’s degrees of freedom, flexibility multiplicity etc., are symptoms of the real underlying problem, which is a loose connection between the choices made and the underlying theory.

Oberauer & Lewandowsky (2019), Theory crisis in psychology, Psychon. Bull. Rev. 10.3758/s13423-019-01645-2

Generalization bias

Generalization bias is “a cognitive tendency that operates automatically and frequently leads researchers to unintentionally generalize their results from particular samples to broader populations including when their evidence does not warrant it and when the generalization is avoidable.” The limited cognitive resources of brains require individuals to generalize beyond observations, as a heuristic. Over-generalization is not therefore necessarily a bias, but a cognitive necessity. However, whilst science is supposed to help avoid such over-generalizations, it actually does the opposite. One reason is that science is meant to produce explanations, and providing explanations is shown to facilitate broad generalizations, for example when drawing explanations about possible rare exceptions. In psychology this takes the form of the “law of small numbers” shown by Tversky and Kahneman. This over-generalization applies to extrapolating the samples’ population, but is augmented because samples are often not representative of other populations to which results are generalized (e.g. WEIRD samples) or all the study designs they are being generalized to. Socio economic conditions (e.g. cost of producing large and stratified samples, pressure to get grants, …) interact to increase the generalization bias. Irreproducibility follows due to inevitable over-generalizations when scientists: 1) do not consider or understand the full set of variables relevant to the phenomenon; 2) have erroneous preconceptions about the phenomenon. In either case, they overlooked factors that limited the generalizability of the finding.

Peters, U., Krauss, A., & Braganza, O. (2022). Generalization bias in science. Cognitive Science, 46. 10.1016/j.bcp.2015.07.002

Dependency on learning

When the hypothesised mechanism depends on learning, i.e. contingent information, then replicability is hampered. “…researchers often distinguish ‘direct’ replications, in which all of methods, materials etc. are the same as in the original study, from ‘conceptual’ replications, in which the theoretical constructs tested in an original study are operationalized in different ways in a replication […]. However, given the learning requirement in priming, it follows that unless the participants in a second study have acquired the same relevant information as the participants in the first, the second study cannot be a direct replication of the first” “This raises two questions: 1. What information is relevant to a given effect? 2. How can information be operationalized so that its equivalence in populations can be established?”

Ramscar (2016), Replicability of priming effects, Curr. Opin. Psychol. 10.1016/j.copsyc.2016.07.001

Complexity

This is an extensive essay discussing various aspects of complexity on research. For replication, in particular: “as complexity increases, the connections between constructs tend to be weaker and more numerous and variable. Moreover, complex systems often mutate as a result of encounters with the broader environment. Thus, theory suggests that as the complexity of the topics increases, the uniformity of relations and effects across context and time should be lower. In particular, findings in psychology and other social and behavioral sciences should commonly be less robust and reproducible than findings in thephysical sciences.”Concerns for reproducibility “crises” are less prominent in in other social and behavioral disciplines. “Sociologists, political scientists, organizational and management scientists, and anthropologistsmay be more cognizant of how the behaviors and processes they investigate vary as a function of societal conditions, groups, and cultures. In addition, they may be more apt to recognize that changes in the world can radically alter the functioning of individuals, organizations, and communities.”

Sanbonmatsu et al. (2020), Complexity in psychological science, Front. Psychol. 10.3389/fpsyg.2020.580111

Conceptual practices: rigour with which hypotheses are articulated

Non-reproducible results may come from false positives, over-estimated effects, or over-generalized effects. In all three cases, at the root may lie too informal practices in conceptualizing hypotheses, that see them as personal creation of the authors, and call for justification of its plausibility without considering, for example arguments for its falsity. These processes create confirmation biases that hamper the rigorous testing of findings.

Schaller (2016), Conceptual rigor and replicability, J. Exp. Soc. Psychol. 10.1016/j.jesp.2015.09.006

High base-rate of false hypotheses

“A scientific claim or finding is an inference about the natural world, expressed as an existence statement such as ‘3-month-olds prefer infant-directed speech over adult-directed speech’.”Such a scientific claim should be accompanied by a definition of its connection to the empirical evidence, e.g. by specifying sampling strategies, measurement instruments, circumstances of data collection, analytical decisions, auxiliary assumptions.This hampers attempts to define what needs replication, let alone the methods to do so.

Scheel, A. M. (2022). Why most psychological research findings are not even wrong. Infant and Child Development, 31. 10.1002/icd.2295

Comparability and strength of manipulations

In replications of tests of theoretical hypotheses, the comparability and strength of the experimental manipulations might change, especially in context sensitive phenomena.The evaluation of what constitutes a comparable intervention is rarely checked, and rather is based on intuition. What qualifies as sufficiently similar is underspecified theoretically.This is different in e.g. epidemiology, where questions and hypotheses tested are empirically.

Schwarz & Clore (2016), More than attention to the N, Psychol. Sci. 10.1177/0956797616653102

Comparability of measurement procedures

Measurement procedures might differ in unwanted ways between original and replication, making a difference on results. For example, the order of presentation of questions, and whether these were preceded by other questions/interventions make a difference. “…identical manipulations result in smaller effects when the item of interest is preceded by other items that broaden the range of accessible inputs relevant to the judgment.”

Schwarz & Clore (2016), More than attention to the N, Psychol. Sci. 10.1177/0956797616653102

Theoretical vs empirical hypotheses

Whether the hypotheses tested are empirical, like in clinical medicine, or theoretical, as in psychology is a critical field-level moderator, because it affects the comparability and strength of manipulations and measurement procedures . In theoretical hypotheses, the questions asked pertained to some underlying theoretical explanation of behaviour, and the study’s empirical test use constructs that reflect the theoretically relevant concepts.

Schwarz & Clore (2016), More than attention to the N, Psychol. Sci. 10.1177/0956797616653102

Measurement reliability

Spearman’s attenuation formula predicts how much lower than the true correlation an observed correlation will be, if measured with a given level of score reliability (the proportion of variance in observed scores that is due to true scores and not random error). This formula predicts the expected value of the observation, but the range of possible correlations actually observed is often under-estimated by researchers. By simulations, they show that this factor alone generates a broad range of possible observed results, even with large sample size and relatively high reliability.In real settings, the true underlying correlation is not known. So, reverting the logic above, a given published correlation is compatible with a broad range of underlying true correlations. This makes usual inferences about what correlation ought to be observed in a replication overly unrealistic.This is even before other sources of error (sampling error, from using a new sample) are considered.

Stanley & Spence (2014), Expectations for replications, Persp. Psychol. Sci. 10.1177/1745691614528518

Exploratory data analysis

Fields, like animal behaviour, that are forced to have small samples may also be forced to engage in more exploratory investigations, which may unwittingly increase the rate of false discoveries. “Having few individuals may also drive researchers to extract as much data as possible from Subjects.” resulting in high risk of fishing expeditions, unwittingly.

Stevens (2017), Replicability and reproducibility, Front. Psychol. 10.3389/fpsyg.2017.00862

Repeated testing

Some animal behaviour fields can rely on long-lived species, on which many tests are performed, making them unrepresentative. “…for many labs working with long-lived species, such as parrots, corvids, and primates, researchers test the same individuals repeatedly. Repeated testing can result in previous experimental histories influencing behavioral performance, which can impact the ability of other researchers to replicate results based on these Individuals.”

Stevens (2017), Replicability and reproducibility, Front. Psychol. 10.3389/fpsyg.2017.00862

Small sample sizes

Fields where subjects are expensive, difficult to maintain, are forced to have small samples. “…for comparative psychologists, large sample sizes can prove more challenging to acquire due to low numbers of individual animals in captivity, regulations limiting research, and the expense of maintaining colonies of animals.”

Stevens (2017), Replicability and reproducibility, Front. Psychol. 10.3389/fpsyg.2017.00862

Species coverage

In fields like animal behaviour, where the subject matter is very diversified, the diversity impedes replication. “The increase in species studied is clearly advantageous to the field because it expands the scope of our understanding across a wide range of taxa. But it also has the disadvantage of reducing the depth of coverage for each species, and depth is required for replication.” moreover “Some of the species tested are quite rare, which can limit access to them.”

Stevens (2017), Replicability and reproducibility, Front. Psychol. 10.3389/fpsyg.2017.00862

Substituting species

In fields like animal behaviour, researchers may use a closely related species, assuming it behaves the same. But “Even within a species, strains of mice and rats, for instance, vary greatly in their behavior. Thus, for comparative psychology, a direct replication requires testing the same species and/or strain to match the original study as closely as possible.”

Stevens (2017), Replicability and reproducibility, Front. Psychol. 10.3389/fpsyg.2017.00862

Illusion of exact replication

Exact replication is especially impossible in basic (theory-testing) research.Applied research will focus on the repeatability and reproducibility of the specific phenomenon (e.g. a clinical trial), but in basic research empirical outcomes are meaningful only relative to the theory being tested. Theories are sets of abstract concepts and relationships between those constructs. They are tested by operationalizing those constructs into variables to measure or manipulate. Repeating the same operationalization at different points in space or time may not reflect the same theoretical construct. Non-replications maybe understood as interaction effects.

Stroebe & Strack (2014), Illusion of exact replication, Persp. Psychol. Sci. 10.1177/1745691613514450

NHST misuse

“NHST can be used when very precise quantitative theoretical predictions can be tested, hence, both power and effect size can be estimated well as intended by Neyman and Pearson (1933). On the other hand, when theoretical predictions are not precise, reasonably powered NHST tests may be used as an initial heuristic look at the data as Fisher (1925) intended. However, in these cases (when well-justified theoretical predictions are lacking) if studies are not pre-registered (see below) NHST tests can only be considered preliminary (exploratory) heuristics. Hence, their findings should only be taken seriously if they are replicated, optimally within the same paper (Nosek et al., 2013). These replications must be well powered to keep FRP low. As discussed, NHST can only reject H 0 and can accept neither a generic or specific H 1 . So, on its own NHST cannot provide evidence “for” something even if findings are replicated.”

Szucs & Ioannidis (2017), NHST unsuitable for research, Front. Hum. Neurosci. 10.3389/fnhum.2017.00390

Auxiliary assumptions

A theory is always tested with auxiliary assumptions, and a failure to replicate is as or more likely to be due to failure of the latter, rather than failures of the theory itself.

Trafimow & Earp (2016), Badly specified theories and replication crisis, Theor Psych. 10.1177/0959354316637136

Base rate of true effects

The field-specific base rate or true effects has a disproportionate influence on the false-discovery rate and similar measure, relative to P-hacking practices including: selectively reporting significant results, failing to report all dependent measures, data peeking, selectively removing outlier.

Ulrich & Miller (2020), Questionable research practices and replicability, eLife. 10.7554/eLife.58237

High base-rate of false hypotheses

“If most of the hypotheses under test are false, then there will be many false hypotheses that are apparentlysupported by the outcomes of well conducted experiments and null-hypothesis significance tests with a type-I error rate (↵) of 5%. Failure to recognize this is to commit the fallacy of ignoring the base rate.”

Brid, A. Understanding the replication crisis as a base rate fallacy.

Biological variation

Reaction norms (genotype*environment interactions in generating a phenotype) are unlikely to be flat in most biological organisms, contrary to what a simplistic, standardization view of reproducibility, assumes.

Voelkl et al. (2020), Reproducibility of animal research, Nat. Rev. Neurosci. 10.1038/s41583-020-0313-3

Regression to the mean

Unless the original study had 100% power, then results passing through a P<0.05 filter are necessarily somewhat inflated, and a decline in the replication effect size is expected. Instead of assuming that effects in the literature come from a binary population of true/false effects, we should assume that they are continuous, and probably (they suggest) exponentially distributed. Exponential distribution is the maximum entropy distribution of a variable that ranges form 0 to infinity and whose only known parameter is the mean. Their simulation estimates that 70% of OSF replication effect sizes resulted from regression to the mean.

Wilson et al. (2020), Science is not a signal detection problem, PNAS. 10.1073/pnas.1914237117

Nuisance factors – falsity of null hypothesis

Nuisance factors are those small uninteresting influences that make the null hypothesis always false, so that, with enough power, some difference/correlation is always observable. In theory-focused research, where the objective is to assess the truth status of a theory, observed small effect sizes may indeed arise from nuisance factors alone. Therefore, increasing statistical power will increase “statistical true positives” at the expense of an increase in “theoretical false positives”. The replication crisis is focused on “measurement research” – the measuring of effect sizes.In the OSC study, the average Cohen’s d for non-significant results was d=0.141, suggesting that with higher power all effects would have been statistically significant in the same direction as the original. However, the smaller the underlying effect size, the more likely the result is to be a “theoretical false positive”.

Wilson et al. (2022), Theoretical false positives, Psychon. Bull. Rev. 10.3758/s13423-022-02098-w

Alignment of verbal and statistical expressions of hypothesis

Whilst inter-subject variability is routinely modelled as a random effect, multiple other components of a study (e.g stimuli, tasks, research sites) are not, and this leads to failing to correctly operationalize the generalization intentions behind a verbal hypothesis.“in any research area where one expects the aggregate contribution of the missing s 2 u terms to be large – that is, anywhere that “contextual sensitivity” (Van Bavel, Mende-Siedlecki, Brady, & Reinero,2016) is high – the inferential statistics generated from models like (2) will often underestimate the true uncertainty surrounding the parameter estimates to such a degree as to make an outright mockery of the effort to learn something from the data using conventional inferential tests.”

Yarkoni (2022), The generalizability crisis, Behav. Brain Sci. 10.1017/s0140525x20001685