Brief #1: OpenAI and the Center for AI Safety disagree on how automatable remote work is
Benchmarks of remote labor differ dramatically depending on the content and context of the task.
Welcome! This bi-weekly newsletter, published by the Windfall Trust, will curate the most important developments in AI economics research and policy. Each issue features key papers with in-depth analysis, along with quick links to relevant opportunities and recent news.
Need to Know
Benchmarks of remote labor from OpenAI and the Center for AI Safety differ dramatically depending on the content and context of the task.
A new theory suggests AI automation will hit later but faster than expected, which would have major implications for policy timing.
Anthropic presented various policy proposals on the economic impacts of AI for different scenarios in DC, such as a sovereign wealth fund, tax code changes, and adjustment assistance for displaced workers.
Highlights
Why AI Benchmarks Tell Conflicting Stories
AI agents bombed a real-world work test. The Remote Labor Index, which tests actual freelance tasks, found that agents could complete less than 2.5% of them successfully. This is a sharp reality check following lab results that show models approaching human-level performance. Similarly, the freelance market has changed little since the deployment of generative AI. Earlier freelancer experiments showed much larger effects, but they focused on a controlled setting for writing and research tasks, which account for only a small fraction of overall freelance work.
This aligns with OpenAI’s recent GDPval paper, which found that frontier AI models were beginning to reach parity only in well-defined tasks involving professionals from 44 occupations with an average of 14 years of experience. Human raters preferred the leading model, Claude 4.1 Opus, in 47% of cases over the professional’s response, indicating that AI is beginning to approach parity, but only when given detailed task specifications, multiple attempts, and human review time. This suggests AI is effective for augmentation but far from autonomous task completion. While their approach covers more of the labor markets, AI might overfit to the detailed task descriptions, while the Remote Labor Index uses real-world, underspecified freelancer task descriptions. They only focused on self-contained tasks with limited need for tacit and contextual knowledge. What AI can do theoretically at well-specified tasks and what AI can practically do in real-world settings are two different things.
Finally, the Center for AI Safety also recently published A Definition of AGI, which attempts to deploy a quantifiable framework to measure a comprehensive range of capabilities across categories such as “auditory processing” or “long-term memory storage”. Their main finding was that GPT-5 seems to be making significant progress on this benchmark, reaching 57% this year: up from 27% for GPT-4 in 2023.
Predictions of the rapid automation of remote work have varied dramatically, partly due to a lack of reliable data. These benchmarks differ dramatically, which highlights both the promise and the limitations of automating remote work for AI. While the AI appears to be able to solve most self-contained tasks nearly as well as humans when prompts are highly detailed, with best-of-N sampling and multiple reference files, this doesn’t hold in more general settings, such as underspecified instructions. The year of the agent hasn’t arrived yet.
Automation May Hit Later—But Faster
A new Stanford paper suggests we’ve been thinking about AI automation backward. Philip Trammell suggests that considering workstreams (or task sequences) is important, indicating that automation has a significant effect only once most tasks in a workstream are automated. This workstream perspective could reconcile why task-based models of automation predict both the limited impacts we’re seeing from current AI and the considerable impacts expected from full automation. Otherwise, recent models, such as those by Acemoglu (2025) and Aghion and Bunel (2024), suggest labor savings of at most 40% for full automation, taking into account the limited effects of current AI exposure. Taking this research seriously would mean that automation will occur later, but faster than classical task models predict. This matters for policy timing. If automation arrives in sudden jumps rather than gradually, reactive measures like retraining programs will be less effective. It suggests that policymakers need to strengthen social safety nets before automation accelerates, not after displacement has begun.
Anthropic Kickstarts Discussion On AI Economic Policy in DC
Anthropic hosted a convening of economists and policymakers in DC and gathered a broad range of policy proposals in an area where concrete ideas have been scarce so far. They categorized them from conventional to radical. While the former includes reskilling, tax reform, and the development of AI infrastructure, the latter encompasses AI adjustment mechanisms (similar to trade adjustment mechanisms) that assist displaced workers in moderate scenarios. Meanwhile, a sovereign wealth fund, additional value-added taxes, and new government revenue streams would address fast-moving scenarios. In response, Julian Jacobs (DeepMind) noted further ideas, such as unemployment insurance, the earned income tax credit, and a reward for contributions outside of formal employment.
The proposals represent a wide-ranging exploration rather than consensus recommendations, and the breadth of options discussed highlights just how early the field of AI economic policy is. The field has yet to develop any consensus on policy measures: instead, it is almost entirely at the stage of idea generation. Anthropic’s early actions in this uncharted territory are a first step towards opening a sizable can of worms that most researchers, much less policymakers, have yet to grapple with.
In Other News
Pascual Restrepo’s (Yale) new paper models an AGI-driven economy where compute can perform nearly all valuable work, causing output to scale with compute power and wages to fall toward the cost of compute. Andrey Fradkin (Boston) and Seth Benzell (Chapman University) also discuss this paper in their Justified Posteriors Podcast.
Microsoft published its AI Diffusion Report, which details the fast diffusion of AI compared to previous general-purpose technologies, but highlights the gap in AI use between the Global South and North.
Bharat Chandar (Stanford) wrote about his thoughts on the current state of knowledge regarding AI’s labor market impact. While AI may reduce hiring for early-career workers, the overall employment effect of AI is currently small, and economists need better data on AI adoption to conduct effective research.
Agrawal et al. published a new paper that models “genius on demand”, showing that AI systems acting like on-call geniuses shift problem-solving toward new domains, initially complementing human experts but eventually replacing routine knowledge work as AI becomes more capable.
Will Rinehart collected a list of empirical econ papers on AI.
Luis Garicano (LSE) and Luis Rayo (Northwestern) created a model to investigate the long-term effects of reduced hiring of junior workers. If the additional value of human-AI collaboration falls below a threshold compared to AI on its own, apprenticeship and on-the-job training become ineffective.
AI evaluator METR is seeking a Human Data Lead to collect and analyze high-quality human data, providing an excellent opportunity for empirical economists to contribute.
Windfall Trust (that’s us!) is looking for a Communications Director.
Thanks to Suchet Mittal, Joel Christoph, and Ankit Mishra for contributing to the creation of this week’s edition of the newsletter.
Not interested in receiving these updates? Unsubscribe from the AI Economics Brief using the Substack link below:






