Brief #11: Mismeasuring productivity and agents
What inflation teaches us about AI’s uplift and the lack of self-assessment of AI agents
Welcome! This bi-weekly newsletter, published by the Windfall Trust, curates the most important developments in AI economics research and policy. Each issue features key research and updates, along with in-depth analysis and quick links to relevant opportunities and recent news.
Need to Know
· AI’s uplift is overestimated in most cases according to METR, since the productivity gains are concentrated in new tasks, such as coding personal apps, which were previously not valuable enough. Meanwhile, the uplift in previous tasks and in value is lower, similar to inflation being lower when looking at the new consumption bundle, since workers substitute toward more attractive tasks or goods.
· AI agents can’t yet price their own work. For AI agents to compete for tasks in a market, they need to assess their own likelihood of success and costs, but six frontier LLMs get both wrong. Confidence ranges from 61% to 93% while actual success rates cluster at 75–81%, and token cost estimates are roughly 5× too low. In simulated auctions, this means tasks go to overconfident agents rather than the most capable, degrading the market from an efficient allocator to something closer to a lottery.
In the news
Andy Hall (Stanford) argues that the real political backlash to AI hasn’t begun, and that labs should build measurement infrastructure and self-activating policy triggers now rather than pre-emptive social contracts the public hasn’t asked for. Meanwhile, The Economist urges governments to prepare social safety nets for an upcoming AI “jobs apocalypse.”
Philosopher Toby Ord (Oxford) criticized METR economists for their assumptions regarding utility functions. They acknowledge their simplifying assumptions but point out that their core takeaway, outlined below, holds under broader ones.
In detail
How AI’s productivity increases are overestimated
AI’s productivity effect is widely contested. Aggregate statistics show little macroeconomic impact so far, yet individual developers report 5–10× speedups from coding agents. Tom Cunningham and Parker Whitfill (METR) argue that three distinct measures of AI productivity (uplift on old tasks, uplift on new tasks, uplift in value -the actual gain in useful output, accounting for how workers rearrange their time) can diverge dramatically. The policy-relevant measure (uplift in value) is bounded between the other two, and existing estimates typically measure the wrong one:
Uplift on old tasks ≤ Uplift on value ≤ Uplift on new tasks
Whitfill illustrates this through his own use of coding agents to build personal apps much faster. Before coding agents, building these apps was not worth the effort, suggesting their value was relatively limited. With AI, app-building productivity increased enough to justify building them, but the uplift in old tasks and in overall value was more limited. This is a case where uplift on new tasks can be much higher than uplift in value.
When AI doubles coding productivity without affecting software engineers’ documentation productivity, the overall uplift varies across the three AI productivity measures. Just as inflation shifts consumers toward cheaper products, AI’s jagged frontier shifts workers to tasks with higher uplift. It’s difficult to measure what we care about, namely, uplift in value, since both time costs and tasks change over time.
Existing productivity measurements don’t account for the task shifts. For instance, Tamkin and McCrory (2025) use the old tasks as weights to estimate a 17% uplift, so this framework would suggest that it’s a lower bound on the true uplift.
However, METR identifies a countervailing upward bias. Tamkin and McCrory’s estimates are based on observed Claude queries, but these queries capture only the sub-tasks within broader O*NET categories where AI helps most. Workers selectively use AI for the parts of their job where it’s fastest, and are more likely to use Claude when the speedup is large. This means the observed 5× speedup on individual queries likely overstates the speedup on the full task category. On balance, METR suspects this upward bias dominates, suggesting the 17% figure may be an overestimate.
The gaps between estimation methods might prove large, especially when AI provides a significant boost to new tasks. For example, coding agents now make it trivial to build personal apps that would never have justified the effort at pre-AI costs — high uplift on a new, so-called Cadillac task, but limited economic value.
Our analysis: These Cadillac tasks may offer a more compelling explanation for the gap between individual and aggregate productivity gains than measurement error or diffusion lags.
Still, the authors’ analysis of Tamkin and McCrory (2025) leaves open questions. Their core example illustrates another countervailing mechanism (sub-tasks) than the shifts in tasks they initially lay out. We don’t find their examples for sub-tasks convincing. Translation or writing are exactly the tasks that chatbots can do end-to-end, so the distinction between tasks and sub-tasks, and the substitution between sub-tasks, may not be as relevant in these examples. O*NET tasks are already fine-grained, and there may be little justification for breaking them down further.
How AI’s lack of self-assessment limits the potential of agents
Calibration on 93 SWE tasks, including success and token estimations.
Just as external measures of AI uplift can be misleading, agents’ own self-assessments are also unreliable. This miscalibrated self-assessment prevents the broad use of AI agents for economic transactions, according to Andrey Fradkin (Boston) and Rohit Krishnan, since such an assessment is necessary for agents to estimate their costs and quality before offering their services.
Six frontier LLMs were asked to forecast their own success probability and token costs on SWE-bench tasks. All were miscalibrated: token estimates were roughly 5× too low. Confidence rates spread from 61% to 93% while actual pass rates clustered at 75–81%. Auctioning work to the agent with self-reported beliefs yields much worse results than allocating tasks to agents based on their actual costs and capabilities. This suggests a key bottleneck that limits the current usefulness of autonomous AI in markets.
In contrast to other agentic benchmarks, Fradkin and Krishnan combine the calibration task with an auction to test markets in a multi-agent setting. The overconfidence leads LLMs to lose substantial profits compared to when there is perfect self-assessment.
The authors conclude that these findings should lead AI developers to include self-awareness as a training goal. Until agents can reliably assess their own capabilities, which may require contextual learning for awareness of previous task outcomes, coordination systems will need to rely on external evaluation rather than on agent self-reports.
Our analysis: This has implications beyond individual model performance, as the authors expose a key bottleneck for the autonomous economic use of AI agents: Normally, markets work because each participant knows their own costs and capabilities better than a central planner — the classic Hayekian argument. But since agents cannot yet reliably generate private information about their own capabilities, this explains why agent markets hardly exist so far. An operator with access to evaluation histories and held-out benchmarks may already hold better information than the agent itself. Until self-assessment improves, AI users will need to rely on these external signals rather than on agent self-reports.
In Other News
Labor Market & Employment
● The Mayor of London announced an AI and Jobs Taskforce to review AI’s impact on jobs in London.
● Michelle Yin (Northwestern) et al. demonstrate that AI exposure scores depend heavily on the selected model, with agreement between them as low as 57%.
● Oren Danieli (Tel Aviv) and Masao Fukui (Boston) et al. model the importance of skill-replacing technological change and the simplification of jobs for rising inequality.
● Julian Jacobs (DeepMind) and Jordan Canedy (Forecasting Research Institute) find that a previous US retraining program against technological automation didn’t lead to job changes and that wage increases could be explained by mean reversion.
Policy
● Matt Bruenig (People’s Policy Project) advocates for a social wealth fund for America.
● David Shor (Blue Rose Research) started a Center for Shared AI Prosperity, jointly with former White House staffers, for DC to take the future economic impacts of AI more seriously. They opened a request for ideas.
Research Opportunities
● The Anthropic Institute now offers Economics & Policy Fellowships for researchers to design and conduct empirical research on AI’s economic effects.
● The DC think tank ARI is hiring a policy analyst for their economic & societal transformation portfolio.
Thanks to Deric Cheng and Suchet Mittal for contributing to this week’s edition of the newsletter.




