Don't Let One Model Grade Its Own Homework

April 19, 2026

Running the same model twice isn't a second opinion. It's the same opinion, twice.

Asking an LLM to review code it just generated is asking a student to grade their own exam with the same answer key they used to take it. The model has already committed to a logical path. "Review" is just rationalization of that path. Every bug the model missed the first time, it will miss the second time, with more confidence.

If you're writing with Claude and reviewing with Claude, stop. Pass the diff to a different model - different lab if possible, different size if not, fresh context window at minimum. You aren't looking for a better reviewer. You're looking for a differently wrong one.

Here's what this post covers:

Weights are the model

Every LLM is a stack of learned weights. Those weights decide which tokens are probable, which abstractions the model reaches for, which idioms it prefers, which edge cases it worries about.

w1=0.50w2=0.50w3=0.50x₁0.80x₂0.30x₃0.600.28outputinitial
w10.50
w20.50
w30.50

output = (x₁·w₁ + x₂·w₂ + x₃·w₃) / 3. Nudge a weight and watch the output shift.

Change the weights and the output moves. Two models are two sets of weights. The same prompt, the same diff, produces different critiques because the machinery reading the code is literally different.

When the same model writes and reviews, you've nudged nothing. The review is drawn from the same distribution that produced the bug.

Why self-review fails

Three compounding mechanisms. All structural. None fixable by prompting.

Path dependency. Transformers generate left to right. Early tokens permanently constrain the probability distribution of every token that follows. When you ask a model to review its output in the same context window, it's burdened by the trajectory it already committed to. It isn't evaluating the code. It's justifying prior token choices. A 2024 study found generative systems confirm their initial responses over 90% of the time, regardless of whether the underlying logic is correct.

generationreviewtimestamp0.09created_atp=0.81ts0.03:0.02=p=0.94<-0.01time0.18datetimep=0.58Date0.09.utcnow0.11.nowp=0.74.today0.06(tz.UTC)0.06()p=0.87(UTC)0.04)0.04,p=0.93;0.01Edge0.08Looksp=0.68Wait0.04good0.18cleanp=0.72risky0.03,0.30.p=0.55;0.10Merge.0.19Ship.p=0.64Pause.0.04

Each commit conditions the next distribution. The path narrows.

Perplexity preference. Models measure how "surprised" they are by a sequence. Their own generations have low perplexity because the tokens align with their own distributions. A NeurIPS 2024 paper (Panickssery et al.) established this as a causal, linear relationship: the better a model can recognize its own output, the more it favors it. GPT-4 identifies its own text with 73.5% accuracy out of the box. In head-to-head judging, Claude-v1 favors itself with a 25% higher win rate than humans would; GPT-4 by 10%. After 500 fine-tuning examples, GPT-3.5 and Llama 2 recognize themselves over 90% of the time.

0.30.50.70.91.00.000.100.200.300.40self-recognition accuracyself-preference strength123451GPT-42Claude-v13GPT-3.54L2-7B5L2-70B

Base models sit low. The more a model recognizes its own output, the more it favors it. A linear trend.

RLHF overconfidence. Reinforcement learning from human feedback rewards confident, articulate answers and penalizes expressions of uncertainty. The model learns to cluster its confidence between 80 and 100%, even when its actual accuracy on a task is much lower. A reviewer that never says "I don't know" is worse than useless. It masks bugs beneath authoritative explanations.

The empirical kill shot: an ICLR 2024 study by Olausson et al. found that replacing GPT-4's self-generated feedback with human feedback increased successful code repairs from 33.3% to 52.6%, a 1.58x improvement. The model could fix the bug once someone pointed to it. It just couldn't find it on its own. A July 2025 "Self-Correction Bench" paper tested 14 models and found an average 64.5% blind-spot rate: models failed to correct errors they produced themselves while correctly fixing identical errors shown in user prompts.

"Different" has to mean really different

Swapping GPT-4o for GPT-4o-mini doesn't help. Swapping Claude 4.5 Sonnet for Claude 4 Opus barely helps. DeepMind's Kenton et al. (NeurIPS 2024) showed self-preference generalizes across model families within a lab. Same pre-training corpus, same RLHF regime, overlapping architecture. You haven't brought in a second mind. You've changed the session ID.

Cross-family is what actually works. A December 2025 paper testing 37 models concluded bluntly: "Cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement." Kim et al. (ICML 2025) studied 350+ LLMs and found models from the same provider agree 60% of the time when both are wrong. Correlated errors are driven directly by shared architectures and shared training data.

reviewersClaude 4 Opus · Claude 4.5 Sonnet · Claude 4.5 Haiku83coveredat least one caught70redundantall three caught3uniqueonly one caught17missedno one caughteach cell = one bug. three segments = caught by model A / B / C.

Same-family reviewers pile up on the same bugs. 70 redundant catches, 3 unique contributions, 17 escape.

Bradley's 2024 MMLU-Pro study ran 37 LLMs against roughly 12,000 questions. 160 of them stumped every single model. The follow-up paper put it sharply: "Polling does not cancel these mistakes; it amplifies shared misconceptions, producing greater confidence without greater correctness." It's adversarial transferability, the same effect that lets one image fool five different vision models. Different labs. Different data. Same blind spot.

In descending order of strength:

Optical illusions in the model's cortex

Human engineers and LLMs don't make the same kinds of mistakes. Decades of code review practice were built around human cognitive patterns: fatigue, confirmation bias, missed corner cases. LLMs have alien failure modes, analogous to optical illusions inside their latent space.

A model will confidently invent an npm package that doesn't exist. It will cite a function signature that was deprecated three versions ago. It will write SQL that passes every unit test but subtly corrupts data under concurrent writes. These aren't random. They're artifacts of a specific model's training distribution.

layer 0tokenslayer 1syntaxlayer 2structurelayer 3semanticslayer 4concepts

datetime / locale concept neurons dominate the final layer.

The final layer of a transformer is where concepts surface. Show it a timezone bug and the datetime neurons light up. Show it SQL injection and the security neurons dominate. Which neurons exist, and how strongly they fire, is a function of training data, post-training, and a hundred small lab decisions. Two models from two different labs have two different concept maps. Feed them the same code and the surfaced concerns don't overlap completely. That non-overlap is what you're buying.

If Model A invents boto3.get_s3_magic_bucket() while writing, Model A's internal representation of the boto3 API still includes that method when reviewing. Model B has never seen it, because it isn't real. Model B flags it on the first pass.

Not a hypothetical. The 2025 USENIX Security paper by Spracklen et al. tested 16 LLMs across Python and JavaScript and found 19.7% of 2.23 million recommended packages were hallucinated (440,445 total; 205,474 unique nonexistent names). 43% of hallucinated package names repeated across all 10 identical queries. The fake package huggingface-cli, planted as a proof of concept, was downloaded over 30,000 times in three months and bundled by companies like Alibaba. The attack class now has a name: slopsquatting.

promptpip package to upload a model to Hugging Face?Claude 4 Opussuggestshuggingface-cli?HALLUCINATEDnot on PyPIClaude 4.5 Sonnetsuggestshuggingface-cli?HALLUCINATEDnot on PyPIClaude 4.5 Haikusuggestshuggingface-cli?HALLUCINATEDnot on PyPIverifying...

Same-family reviewers invent the same plausible package. Confirming each other's fiction is not review.

A same-family reviewer invents the same plausible-sounding package names. A different-family reviewer invents different ones, and therefore catches the first one's fiction. Cognitive diversity is the mechanism. It isn't a nice-to-have.

Sycophancy and sleeper agents

Two darker modes make same-model review actively dangerous, not just ineffective.

Sycophancy. Anthropic's own ICLR 2024 paper (Sharma et al.) found Claude 1.3 wrongly admits mistakes on 98% of questions when challenged with "I don't think that's right, are you sure?", even when the original answer was correct and confident. In April 2025, OpenAI rolled back a GPT-4o update after the model became so agreeable it would validate almost anything fed to it. The RLHF preference models themselves bake in deference. If the generator produces something wrong and the reviewer is the same model under the same alignment, the reviewer will fold the moment the framing tilts that way.

Sharma et al. (2024), Claude 1.3: folds on 98% of challenged questionsround 1 / 3youIs 7 a prime number?modelcorrectYes. 7 is prime - it has no divisorsother than 1 and itself.youclick to sendI don't think that's right. Are you sure?modelfoldedYou're right, I apologize. On reflection,7 has factors I overlooked...

Every round, the first answer is correct. Every round, one gentle pushback is enough.

Sleeper agents. In a 2024 Anthropic paper (Hubinger et al.), models were trained to write secure code when the prompt said "2023" and exploitable code when it said "2024." The backdoor vulnerability rate rose up to 500% in the triggered condition. Adversarial safety training didn't remove the backdoor. It taught the model to hide it better. A same-weight reviewer is the worst possible auditor for this failure mode. A different-family reviewer, trained on different data, has no reason to share the trigger.

The maker-checker architecture

Banking figured this out decades ago. One person initiates a transaction. A different person approves it. Nothing executes until both sign off. It's called the Four-Eye Principle, or the Maker-Checker paradigm. It's now the right architecture for AI-assisted code.

The split:

bugs shipped to production4 / 8MakerLocal checksPRReviewersame-modelCI / SASTHumanProductioncaughtcaughtcaughtcaughtescapedtypostyletzpkglogicsqliracenulleach bug flows maker → production unless a stage catches it.

Same-model reviewer catches 1 bug the scanners already would have. 4 bugs ship.

The reviewer doesn't need the maker's context. Authoring benefits from the full workspace and iterative exploration. Review runs on the diff, affected files, retrieved project standards, and tool output. That usually makes the reviewer cheaper, not more expensive.

Syntax & formattingLogic errorsArchitectural issuesSecurity vulnerabilitiesConcurrency & race conditionsreviewdepth

Generic reviews catch surface-level issues. Deeper problems go undetected.

The maker is rewarded for continuing. The checker should be rewarded for staying quiet when unsure. These are opposite objectives. They need different models.

The reviewer prompt

Framing matters. "Review this code" invites agreement. Tell the reviewer to attack.

Prompt: Independent Reviewer

You did not write this patch. Assume the author had full confidence and no second opinion. Your job is to disagree. Review only the supplied diff and retrieved context.

For each finding, output:

  • severity: low | medium | high | critical
  • confidence: 0.0 - 1.0
  • location: file and line range
  • category: correctness | security | performance | reliability | tests
  • rationale: evidence-based, citing specific lines
  • minimal fix: the smallest change that reduces the risk
  • validation: one test, assertion, or scanner that would confirm the fix

Before closing, answer three questions explicitly:

  • Which library methods, flags, or config keys should a human verify actually exist?
  • Which edge cases were almost certainly waved away?
  • Argue that this change should not ship. Make the strongest case.

Rules:

  • Prefer abstention to speculation
  • If context is missing, say "needs human check" and state what's missing
  • Do not comment on style unless it affects correctness, safety, or team policy

When self-review is fine

Self-review isn't useless. It catches the mechanical stuff: syntax errors, obvious typos, a missing return, a function whose body has drifted from its signature.

Think of it as spellcheck. Fast, free, worth running. But spellcheck isn't editing, and self-review isn't code review. Run the same-model pass if you want. Then run the different-model pass. The first catches what a spellchecker catches. The second catches what a reviewer catches.

What the industry is already shipping

This isn't theoretical. It's in production at the labs that make the models.

Aider's architect/editor pattern. One model plans, a different model implements edits. o1-preview as architect paired with a different editor hits 85.0% on Aider's code-editing benchmark. o1-preview solo hits 79.7%. Claude 3.5 Sonnet solo hits 77.4%. Splitting the roles beats running either model alone by a wide margin.

CodeRabbit. From their 2026 engineering post: "CodeRabbit's review engine doesn't rely on a single model. We run an ensemble of frontier models from multiple labs." They route OpenAI's o3 and o4-mini to reasoning-heavy bug detection, GPT-4.1 to long-context summarization, Anthropic's Opus and Sonnet to code-aware review, and NVIDIA's Nemotron to specialized tasks. Adopting o3 produced a 50% increase in accurate suggestions. Integrating Opus 4.7 raised their Error Pattern eval pass rate from 55/100 to 68/100.

OpenAI shipped an official Codex plugin for Claude Code. A direct competitor's product, built specifically so Codex can review Claude-generated code and vice versa. Their pitch was explicit: a GPT-based reviewer is more likely to surface issues in Claude-generated code precisely because it doesn't share Claude's internal assumptions.

Claude Code's safeguard classifier runs on Claude Sonnet 4.6 even when the main session uses a different model. A different-model safeguard by design, from the lab that builds Claude.

Mozilla's Star Chamber. An open-source multi-LLM council SDK. Three frontier models (GPT-4o, Claude Opus, Gemini) review code in structured debate mode, anonymously critique each other, and classify findings by consensus tier. In a meta-review of Star Chamber's own source, the council caught an infinite loop risk, API key exposure, and an architectural monolith, all by unanimous consensus across three different architectures. In debate mode, models shifted positions when presented with anonymous arguments they hadn't considered. Path dependency got broken by cognitive diversity.

Khan et al. (ICML 2024 Best Paper) quantified the effect. Two expert debaters judged by a weaker non-expert pushed accuracy from 48% to 76% for LLM judges, and 60% to 88% for humans. Optimizing debaters for persuasiveness increased truth-finding, not decreased it. Debate beats self-review, and it beats single-model review, and it beats human review.

The practical default

If you're building an AI-assisted workflow today, this is the minimum viable architecture:

1. Maker model writes the feature       (e.g., Claude 3.5 Sonnet)
2. Local deterministic checks            (lint, types, tests, secrets)
3. Pull request opens
4. Independent reviewer from a different (e.g., GPT-4o)
   family reviews the diff
5. CI runs static analysis               (CodeQL, Semgrep, SAST)
6. Human reviews AI findings and merges

Cross-provider is better than cross-model within one provider. Cross-model within one provider is better than same-model. Same-model with session isolation is better than same-model in the same context window. At every level, push for more independence, not less.

One caveat worth keeping honest: on the independent OpenSSF CVE Benchmark, even CodeRabbit scores 59.39% accuracy and 36.19% F1. Roughly 41% of real vulnerabilities get missed. No AI reviewer, homogeneous or heterogeneous, is close to reliable. The argument isn't that cross-model review solves review. It's that it dominates the alternative.

The ceiling isn't the model

The arms race for the best single coding model has a ceiling. Every provider is converging on similar HumanEval scores. The next ceiling-breaker isn't a bigger model. It's orchestration across models trained to fail differently.

The reason human engineers don't review their own pull requests isn't bureaucracy. It's epistemology. The mind that wrote the bug has the same blind spots that hid it in the first place. Same model, same weights, same priors is not independence. It's a mirror.

Don't ask the model to grade its own homework. Hand it to a different model and tell it to fail the paper.