LLMs Can Improve at Code by Training on Their Own Wrong Answers

A new training technique lets language models get significantly better at code by learning from their own outputs, without any correctness verification, external teacher, or reinforcement learning loop.

The method is called Simple Self-Distillation (SSD). A model samples solutions at elevated temperature, fine-tunes on those raw outputs, then gets deployed at an adjusted evaluation temperature. No execution sandbox. No passing/failing labels. No reward model.

Qwen3-30B-Instruct jumped from 42.4% to 55.3% pass@1 on LiveCodeBench v6 (+12.9pp, +30% relative gain)
Qwen3-4B-Instruct gained +7.5pp; Llama-3.1-8B-Instruct gained +3.5pp
Gains concentrated on harder problems: +15.3pp pass@1 and +23.0pp pass@5 on hard problems
Even when 62% of training samples had no extractable code, SSD still improved pass@1 by +5.7pp
Tuning evaluation temperature alone (no SSD) gave only 2.2pp variation; SSD holds a +11.8pp advantage over the best-tuned baseline

What SSD Actually Does

The three-step recipe is straightforward:

Sample: run the model at an elevated training temperature to generate diverse outputs, with a truncation threshold that cuts the very tail of the distribution
Train: fine-tune on those outputs via standard supervised learning, no filtering for correctness
Infer: deploy at an adjusted (lower) evaluation temperature

The key insight is what this process does to the model’s probability distribution. The authors decompose the SSD objective into three components: support compression (trimming low-probability tail mass), within-support reshaping (redistributing probability among remaining tokens), and alignment (keeping the reshaping anchored to the base model’s preferences).

In practice, this means the model gets better at two different kinds of token positions simultaneously. “Lock” positions are those where syntax is constrained and there’s a correct token to pick from a small set; SSD helps suppress distractors there. “Fork” positions are where multiple valid solution approaches exist; SSD preserves diversity there by not collapsing to a single path.

SSD Gains by Problem Difficulty (Qwen3-30B-Instruct, LiveCodeBench v6)

pass@1 improvement over baseline

Hard problems+15.3pp

Medium problems+14.2pp

Easy problems+6.5pp

The Robustness Result Is the Surprising One

The most counterintuitive finding is how little the method cares about output quality.

In an ablation, the authors deliberately used a decode configuration where 62% of training samples contained no extractable code at all. The model was training mostly on garbage. SSD still improved pass@1 by +5.7pp and pass@5 by +10.5pp.

This holds up the theoretical framing: SSD isn’t learning from individual correct solutions. It’s reshaping the aggregate distribution. The signal is in the shape of what the model generates across many samples, not in whether any one sample is right.

The paper also finds that training and evaluation temperature compose in a principled way. Without truncation, the effective temperature is simply the product of the training and evaluation temperatures (T_eff = T_train x T_eval), with optimal T_eff around 1.2. This lets practitioners reason about and tune the method systematically rather than grid-searching blind.

Caveats

SSD was evaluated on competitive programming tasks (LiveCodeBench). Whether it transfers cleanly to production code, multi-file edits, or lower-level systems tasks hasn’t been tested. Performance on out-of-domain benchmarks (math reasoning, general coding) remained broadly stable, which is a good sign for regression, but that’s different from asking whether the gains on hard algorithmic problems appear in other code domains.

The method also requires temperature hyperparameter tuning at both training and evaluation time. That’s not difficult, but it’s not zero-cost either, and optimal values will likely vary by model family.

Still, the core result stands: a model with no external supervision, no execution environment, and no correct answers to imitate improved competitive programming pass@1 by 12.9 points on a frontier-model-level baseline. That’s a meaningful signal for anyone building post-training pipelines for code.