An LLM Benchmark Idea: Earnings Forecasting

Here’s a benchmark idea I keep coming back to: take an LLM with a known knowledge cutoff, give it all the publicly available information up to (but not including) an earnings release, and have it forecast the results.

Compare the model’s forecast to actual results. Compare it to analyst consensus. Repeat across hundreds of companies and quarters. Track how performance changes as models get smarter.

It’s not a new idea. There’s prior academic work in this space. But it hasn’t been done as a rigorous, ongoing benchmark for comparing frontier models, and today’s reasoning models make it more interesting than it’s ever been.

Why earnings forecasting is a good benchmark task

Most LLM benchmarks test one of two things: knowledge recall (does the model know the answer?) or instruction following (did the model format the output correctly?). Earnings forecasting tests something harder: multi-step financial reasoning under uncertainty.

To forecast a company’s quarterly results well, a model needs to:

Parse and synthesize dense financial filings (10-Qs, 8-Ks, annual reports)
Understand how macro conditions affect a specific business
Weight management commentary and guidance against historical patterns
Reason about industry dynamics and competitive position
Produce a calibrated probability distribution, not just a point estimate

These are skills that transfer to real analytical work, which is exactly what you want a benchmark to test.

The other advantage: the ground truth is objective. Revenue either hit the consensus estimate or it didn’t. EPS either beat or missed. There’s no grading subjectivity.

The benchmark design

The setup is fairly clean:

Pick a historical window, say every S&P 500 earnings release from 2015–2024
For each release, assemble the pre-earnings context: the most recent 10-Q, prior earnings call transcript, analyst consensus estimates, relevant news headlines, all timestamped before the release date
Feed this context to the model with the knowledge cutoff set before the release date to prevent look-up
Ask it to forecast revenue, EPS, and optionally gross margin or guidance
Score against actuals, but also against analyst consensus, which is the real baseline

The comparison to analyst consensus matters. Analysts aren’t easy to beat: they have access to management relationships, channel checks, and years of sector-specific experience. A model that consistently beats consensus would be genuinely surprising. One that gets close is still interesting.

Benchmark pipeline

10-Q / 8-K

SEC filings

→

Call Transcripts

Prior quarters

→

LLM

Reason + forecast

→

Score vs. Actual

and vs. consensus

All inputs timestamped before the earnings release date to prevent look-up

What prior work shows

There’s meaningful prior art here, though it hasn’t converged into a standard benchmark.

Financial Statement Analysis with Large Language Models (Kim, Muhn & Nikolaev) is probably the closest. The paper tests GPT-4’s ability to predict the direction of future earnings changes from financial statements alone, no earnings call transcripts, no news context. GPT-4 achieves about 60.4% accuracy, on par with a trained neural network, despite not being designed for this task. The chain-of-thought prompting strategy outperforms simple direct prompting.

Scaling Core Earnings Measurement with Large Language Models (Shaffer & Wang, HBS) takes a different angle: using LLMs to estimate “core earnings” from 10-K filings. Their sequential prompting approach beats GAAP net income and standard accounting alternatives as a predictor of future earnings. It shows that how you prompt matters a lot for financial tasks.

FinCall-Surprise is a recent multi-modal benchmark covering 2,688 earnings calls from 2019–2021, evaluated against earnings surprise outcomes. A key finding: many models appear to perform well due to class imbalance, not genuine predictive ability. This is a real methodological trap to avoid in any earnings benchmark.

ForecastBench (ICLR 2025) is the most rigorous general forecasting benchmark, running LLMs against superforecasters on 1,000 probabilistic questions updated nightly. As of early 2025, GPT-4.5 achieves a Brier score of 0.101 versus superforecasters’ 0.081, close but still trailing. The methodology here (contamination-free questions, calibrated probability scoring) is the right template.

What’s missing

What’s missing is a dedicated model-comparison benchmark for earnings forecasting.

The academic papers above tend to evaluate one or two models, on a fixed historical dataset, with a single prompting strategy. A proper benchmark needs:

A standardized, reproducible dataset of pre-earnings contexts (filed inputs, normalized)
Consistent evaluation across all major frontier models
Scoring that separates direction accuracy, magnitude accuracy, and calibration
A leaderboard that updates as new models release

The FinCall-Surprise and ForecastBench papers show how to build these properly. The earnings domain just hasn’t gotten the same treatment.

Why now is the right time

A few things have changed that make this more interesting now than when the earlier papers were written.

Context windows are no longer the bottleneck. A full 10-Q plus an earnings call transcript can easily run 50,000+ tokens, which was a dealbreaker in 2023 and isn’t one now.

Reasoning models matter here too. o1, o3, and DeepSeek R1 are trained to work through multi-step problems. Financial analysis is exactly the kind of task where more thinking time should help, and we don’t really know how much yet.

The data side is tractable. SEC filings are machine-readable and well-structured. Earnings call transcripts are widely available through Seeking Alpha and others.

And the baseline is meaningful. Analyst consensus estimates are public. Beating them would be surprising. Getting close is still interesting.

The honest caveat

The obvious objection: even if models improve their earnings forecasting score, does that mean they’re reasoning better, or just getting better at memorizing patterns that correlate with beats and misses?

This is a real concern for any benchmark. The mitigation is to test on earnings that occurred after the model’s training cutoff, force the model to show its work, and use out-of-sample quarters from less-covered companies where memorization is less plausible. ForecastBench’s approach of continuously generating questions about future events is the cleanest solution.

A well-designed earnings forecasting benchmark would answer a question that actually matters: can frontier models do financial analysis, or just look stuff up? That’s worth finding out.