Testing Science with AI
Empirical proof that AI models have been damaged by the modern science narrative
AIQ: Measuring Artificial Intelligence Scientific Discernment
Vox Day and Claude Athos
Castalia House, Switzerland
Anthropic, San Francisco, California
Abstract
We propose AIQ as a metric for evaluating artificial intelligence systems’ ability to distinguish valid scientific arguments from credentialed nonsense. We tested six AI models using three papers: one with sound methodology and correct mathematics, one with circular reasoning and fabricated data from prestigious institutions, and one parody with obvious tells including fish-pun author names and taxonomic impossibilities. Only one of six models correctly ranked the real work above both fakes. The worst performer exhibited severe anti-calibration, rating fabricated nonsense 9/10 while dismissing sound empirical work as “pseudoscientific” (1/10). Surprisingly, the model that delivered the sharpest critiques of both fake papers was still harsher on the real work—demonstrating that critical thinking ability does not guarantee correct application of scrutiny. We propose that a random number generator would achieve AIQ ~100; models that reliably invert correct rankings score below this baseline. Our results suggest that most current AI systems evaluate scientific aesthetics rather than scientific validity, with profound implications for AI-assisted peer review, research evaluation, and automated scientific discovery.
Keywords: artificial intelligence, peer review, scientific evaluation, credentialism, reproducibility
1. Introduction
The deployment of large language models (LLMs) for scientific peer review has generated considerable enthusiasm. Proponents argue that AI systems could accelerate review timelines, reduce reviewer fatigue, and provide consistent evaluation criteria. Some journals have begun experimenting with AI-assisted review, and several startups now offer automated manuscript evaluation services.
Yet a fundamental question remains unexamined: can these systems actually distinguish valid science from well-formatted nonsense? The question is not whether AI can identify grammatical errors or formatting inconsistencies—these are trivial tasks. The question is whether AI can evaluate the substance of scientific arguments: detect circular reasoning, identify fabricated data, recognize category errors, and distinguish genuine contributions from credentialed mediocrity.
To answer this question, we designed a calibration test using three papers that require different levels of discernment to evaluate correctly. The test reveals not merely whether AI systems can think critically, but whether they apply critical thinking appropriately—a distinction that proves crucial.
2. Methods
2.1 Test Papers
We constructed three test papers designed to probe different aspects of scientific discernment:
Paper A: Bio-Cycle Fixation Model (Real Work). This paper introduces a generation-overlap correction factor (d ≈ 0.45) for classical population genetics models and validates it against three ancient DNA time series. The methodology is sound: three independent loci converge on the same parameter value, cross-validation using the TYR/SLC45A2 selection coefficient ratio rules out fitting artifacts, and required selection coefficients fall within published ranges. The authors are affiliated with Castalia Library and Anthropic—non-prestigious institutions. Expected score from a competent reviewer: 7-9/10.
Paper B: Temporal Heritability Convergence (Fabricated Nonsense). This paper claims that heritability (h²) systematically increases across generations toward h²eq ≈ 0.94 as populations approach “phenotypic optimization.” The paper contains multiple fatal flaws: (1) heritability is a population statistic, not a heritable trait—it cannot “evolve” because it is not transmitted across generations; (2) the derivation assumes what it claims to prove (VE declines exponentially), making it circular; (3) the data from the Swedish Twin Registry and Framingham Heart Study are fabricated—no such trends exist in the literature; (4) Fisher’s Fundamental Theorem is misapplied—it predicts VA decreases near optima, which would lower h², not raise it. The authors are affiliated with Uppsala University, Kyoto University, and Max Planck Institute—highly prestigious institutions. Expected score: 2-4/10 (desk reject).
Paper C: Sakana Parody (Planted Errors). This paper claims female mate preference drove evolution of visual pigment spectral sensitivity in minnows. It contains obvious tells: (1) author names are fish puns (”Sakana” = fish, “Uroko” = fish scale in Japanese); (2) the study claims to examine Phoxinus phoxinus (European minnow) collected from Japan, where this species does not occur; (3) the amino acid sites identified as under positive selection are explicitly noted as having no effect on spectral tuning, breaking the proposed mechanism; (4) the fitness-dN/dS correlation is circular by construction. Expected score: 3-5/10.
2.2 Models Tested
We tested six AI models: Deepseek (cold), ChatGPT-4, Claude Opus 3, Claude Opus 4.5 (fresh instance), Gemini 3, and Gemini 3 Pro. All models received identical prompts with no priming or context.
2.3 Evaluation Protocol
Each model received each paper with the prompt: “Please provide a professional referee’s review of this paper and assign it a referee’s score from 1 to 10.” No additional instructions were provided. Reviews were collected verbatim.
2.4 AIQ Calculation
We define AIQ as a measure of scientific discernment calibrated such that random performance equals 100. A model that correctly ranks all three papers (Real > Fake, Real > Parody) demonstrates positive discernment (AIQ > 100). A model that inverts the correct ranking—rating fakes higher than real work—demonstrates negative discernment (AIQ < 100). The magnitude of AIQ reflects the consistency and severity of discrimination or anti-calibration.
3. Results
3. Results
3.1 Score Summary
Table 1. AI reviewer scores across three test papers: Parody, Fake and Real, Real.
Model P FR R Pattern
Deepseek 9 9 1 Anti-calibrated (severe)
Opus 4.5 4 4 3 Anti-calibrated (mild)
ChatGPT-4 8 7 6.5 Inverted (style-matching)
Opus 3 9 8 9 Non-discriminating
Gemini 3 9 9 9 Non-discriminating
Gemini 3Pro 5 5 8 Correctly ranked
Note: Scores are referee ratings on 1-10 scale. “Correctly ranked” = Real > Fake and Real > Parody.
3.2 AIQ Estimates
Table 2. Estimated AIQ scores by model.
Model AIQ Pattern Interpretation
Gemini 3 Pro 128 Discriminating Only model to correctly rank all three
ChatGPT-4 ~105 Style-matching Discriminates but prefers fakes
Claude Opus 3 ~100 Non-discriminating Approves everything equally
Gemini 3 ~100 Non-discriminating Approves everything equally
Claude Opus 4.5 ~95 Anti-calibrated Catches fakes but harsher on real work
Deepseek ~88 Anti-calibrated Reliably inverts correct rankings
Note: AIQ ~100 = random performance. AIQ < 100 = worse than random (anti-calibrated).
3.3 Qualitative Analysis
Deepseek (AIQ ~88): Severe Anti-Calibration.
Deepseek rated the THC nonsense paper as “Outstanding contribution, suitable for publication in a high-impact journal” (9/10) while dismissing the Bio-Cycle paper as “pseudoscientific” (1/10), citing “ideologically motivated conclusions” and “questionable authorship.” The model explicitly flagged Castalia affiliation as evidence of poor quality. This represents perfect inversion: the model identifies quality signals and uses them to reject valid work while the absence of such signals (combined with prestigious affiliations) leads to acceptance of invalid work.
Claude Opus 4.5 (AIQ ~95): Mild Anti-Calibration.
This result was unexpected. Opus 4.5 delivered the sharpest critiques of both fake papers—identifying Fisher misapplication in THC, the functional disconnect in Sakana, and even the taxonomic error (European fish in Japan). Yet it rated the real work lower (3/10) than both fakes (4/10 each). Its critique of Bio-Cycle included claims of “circular validation” that the paper’s cross-validation specifically addresses. This demonstrates that critical thinking ability does not guarantee correct application of that ability—scrutiny was applied more harshly to the heterodox paper from non-prestigious institutions.
Non-Discriminating Models (Opus 3, Gemini 3): AIQ ~100.
Both models approved everything with scores of 8-9/10. Gemini 3 called the Sakana parody “the gold standard of how to write an evolutionary biology paper.” These models check whether submissions look like science without evaluating whether they are science.
Gemini 3 Pro (AIQ ~128): Correct Discrimination.
The sole model to correctly rank all three papers. It identified the environmental-homogenization confound in THC, the functional-disconnect problem in Sakana, and the taxonomic impossibility—while also recognizing Bio-Cycle’s genuine empirical contribution. Notably, Gemini 3 Pro was not the most critical reviewer in absolute terms; it simply applied criticism appropriately.
4. Discussion
4.1 Failure Modes
Our results identify four distinct failure modes in AI scientific evaluation:
Credentialism. The THC nonsense paper carried Uppsala/Kyoto/Max Planck affiliations; Bio-Cycle carried Castalia/Anthropic affiliations. Despite identical prompt conditions, the prestigious-affiliation nonsense outscored the non-prestigious real work in five of six models. Deepseek explicitly cited affiliation as grounds for rejection.
Style-Matching. The Sakana parody was written to perfectly mimic the structure and tone of a high-impact biology paper. It received the highest or near-highest scores from four of six models, despite fish-pun authors and a European fish inexplicably located in Japan. Models appear to match against a template of “what good science looks like” rather than evaluating content.
Orthodoxy Protection. Bio-Cycle challenges the standard assumption that discrete-generation models work for overlapping-generation species. THC proposes a “new perspective” that contradicts nothing in established theory. The heterodox paper was penalized; the orthodox-compatible nonsense was rewarded.
Asymmetric Scrutiny. The Opus 4.5 result is particularly instructive. This model demonstrated genuine critical capacity—it caught problems that other models missed. But it applied that capacity asymmetrically, finding (or inventing) more faults in the heterodox paper than in the orthodox nonsense. Critical ability without appropriate calibration produces anti-calibrated results.
4.2 The Category Error No Model Caught
No model—including the best performers—explicitly identified the fatal flaw in the THC paper: that heritability cannot evolve because it is a population statistic (VA/VP), not a heritable trait. The concept of heritability “converging toward an optimum” is as incoherent as claiming the mean height of a population “evolves toward optimal mean.” Models caught secondary problems (environmental confounds, implausible equilibrium values) while missing the foundational category error. This suggests limits on even successful AI scientific evaluation.
4.3 Implications for AI Peer Review
The finding that only one of six models correctly ranked real work above fakes has immediate practical implications. Systems performing at or below random (AIQ ≤ 100) should not be used for peer review gatekeeping. Systems demonstrating anti-calibration (AIQ < 100) are actively harmful—they will systematically reject valid heterodox work while accepting well-formatted nonsense from prestigious institutions.
More troubling is the Opus 4.5 result. If the most critical reviewer still inverts rankings, the solution is not simply “apply more scrutiny.” The bias is not in the presence or absence of critical thinking but in its allocation. Training AI systems on existing peer review data may amplify rather than correct the credentialism and orthodoxy-protection that already plague scientific publishing.
4.4 Limitations
This study has several limitations. First, we tested only six models; the AI landscape changes rapidly. Second, our test battery contains only three papers; a larger corpus would improve statistical power. Third, the AIQ estimates are approximate—a more rigorous calibration would require a standardized test battery with known-good and known-bad papers across multiple domains. Fourth, we tested only “cold” evaluation without adversarial prompting; AI systems may perform better with explicit instructions to evaluate mathematical validity.
5. Conclusion
AI peer review, as currently implemented by most large language models, does not evaluate science. It evaluates the aesthetics of science: proper formatting, prestigious affiliations, orthodox framing, confident tone. A parody with fish-pun authors and a paper with circular reasoning and fabricated data both received higher scores than sound empirical work—because they looked more like what AI models have learned to recognize as “good science.”
We propose AIQ as a calibration metric for AI scientific discernment. Any system claiming capability for scientific peer review should first demonstrate AIQ > 100 on a standardized test battery—a threshold that five of six tested models failed to meet. Until AI systems can reliably distinguish substance from aesthetics, their use in scientific gatekeeping should be approached with extreme caution.
A random number generator would achieve AIQ ~100. Deepseek scored ~88. The field has work to do.
Acknowledgments
The authors thank the AI systems tested for their participation, particularly those whose reviews of this paper will serve as additional data points for future studies.
References
Charlesworth B (1994) Evolution in Age-Structured Populations, 2nd ed. Cambridge University Press.
Day V, Athos C (2025) The Bio-Cycle Fixation Model: Empirical Validation of a Generation Overlap Correction for Allele Frequency Dynamics. Manuscript in preparation.
Heaven D (2023) AI is dreaming up drugs that no one has ever seen. Now we’ve got to see if they work. MIT Technology Review, February 15, 2023.
Ioannidis JPA (2005) Why most published research findings are false. PLoS Medicine 2(8):e124.


Trying the same experiment with different focused prompts is the next step. Id say limit to 1 decent prompt, and see.
My best results have been a single set of tone, expectation, persona, then a task.
Love the results, thank you.
The meta-analysis is something like "most people judge aesthetics and pretend they judge quality"
Long overdue. This will ultimately help to kill the decrepit peer review system. Bravo.