logo
episode-header-image
Jun 2
3h 47m

#217 – Beth Barnes on the most important...

Rob, Luisa, and the 80,000 Hours team
About this episode

AI models today have a 50% chance of successfully completing a task that would take an expert human one hour. Seven months ago, that number was roughly 30 minutes — and seven months before that, 15 minutes. (See graph.)

These are substantial, multi-step tasks requiring sustained focus: building web applications, conducting machine learning research, or solving complex programming challenges.

Today’s guest, Beth Barnes, is CEO of METR (Model Evaluation & Threat Research) — the leading organisation measuring these capabilities.

Links to learn more, video, highlights, and full transcript: https://80k.info/bb

Beth's team has been timing how long it takes skilled humans to complete projects of varying length, then seeing how AI models perform on the same work. The resulting paper “Measuring AI ability to complete long tasks” made waves by revealing that the planning horizon of AI models was doubling roughly every seven months. It's regarded by many as the most useful AI forecasting work in years.

Beth has found models can already do “meaningful work” improving themselves, and she wouldn’t be surprised if AI models were able to autonomously self-improve as little as two years from now — in fact, “It seems hard to rule out even shorter [timelines]. Is there 1% chance of this happening in six, nine months? Yeah, that seems pretty plausible.”

Beth adds:

The sense I really want to dispel is, “But the experts must be on top of this. The experts would be telling us if it really was time to freak out.” The experts are not on top of this. Inasmuch as there are experts, they are saying that this is a concerning risk. … And to the extent that I am an expert, I am an expert telling you you should freak out.


What did you think of this episode? https://forms.gle/sFuDkoznxBcHPVmX6


Chapters:

  • Cold open (00:00:00)
  • Who is Beth Barnes? (00:01:19)
  • Can we see AI scheming in the chain of thought? (00:01:52)
  • The chain of thought is essential for safety checking (00:08:58)
  • Alignment faking in large language models (00:12:24)
  • We have to test model honesty even before they're used inside AI companies (00:16:48)
  • We have to test models when unruly and unconstrained (00:25:57)
  • Each 7 months models can do tasks twice as long (00:30:40)
  • METR's research finds AIs are solid at AI research already (00:49:33)
  • AI may turn out to be strong at novel and creative research (00:55:53)
  • When can we expect an algorithmic 'intelligence explosion'? (00:59:11)
  • Recursively self-improving AI might even be here in two years — which is alarming (01:05:02)
  • Could evaluations backfire by increasing AI hype and racing? (01:11:36)
  • Governments first ignore new risks, but can overreact once they arrive (01:26:38)
  • Do we need external auditors doing AI safety tests, not just the companies themselves? (01:35:10)
  • A case against safety-focused people working at frontier AI companies (01:48:44)
  • The new, more dire situation has forced changes to METR's strategy (02:02:29)
  • AI companies are being locally reasonable, but globally reckless (02:10:31)
  • Overrated: Interpretability research (02:15:11)
  • Underrated: Developing more narrow AIs (02:17:01)
  • Underrated: Helping humans judge confusing model outputs (02:23:36)
  • Overrated: Major AI companies' contributions to safety research (02:25:52)
  • Could we have a science of translating AI models' nonhuman language or neuralese? (02:29:24)
  • Could we ban using AI to enhance AI, or is that just naive? (02:31:47)
  • Open-weighting models is often good, and Beth has changed her attitude to it (02:37:52)
  • What we can learn about AGI from the nuclear arms race (02:42:25)
  • Infosec is so bad that no models are truly closed-weight models (02:57:24)
  • AI is more like bioweapons because it undermines the leading power (03:02:02)
  • What METR can do best that others can't (03:12:09)
  • What METR isn't doing that other people have to step up and do (03:27:07)
  • What research METR plans to do next (03:32:09)

This episode was originally recorded on February 17, 2025.

Video editing: Luke Monsour and Simon Monsour
Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic Armstrong
Music: Ben Cordell
Transcriptions and web: Katy Moore

Up next
Jul 8
#220 – Ryan Greenblatt on the 4 most likely ways for AI to take over, and the case for and against AGI in <8 years
Ryan Greenblatt — lead author on the explosive paper “Alignment faking in large language models” and chief scientist at Redwood Research — thinks there’s a 25% chance that within four years, AI will be able to do everything needed to run an AI company, from writing code to design ... Show More
2h 50m
Jun 24
#219 – Toby Ord on graphs AI companies would prefer you didn't (fully) understand
The era of making AI smarter just by making it bigger is ending. But that doesn’t mean progress is slowing down — far from it. AI models continue to get much more powerful, just using very different methods, and those underlying technical changes force a big rethink of what comin ... Show More
2h 48m
Jun 12
#218 – Hugh White on why Trump is abandoning US hegemony – and that’s probably good
For decades, US allies have slept soundly under the protection of America’s overwhelming military might. Donald Trump — with his threats to ditch NATO, seize Greenland, and abandon Taiwan — seems hell-bent on shattering that comfort.But according to Hugh White — one of the world' ... Show More
2h 48m
Recommended Episodes
Jan 2025
With OpenAI seeking profits, activist seeks payback to the public
A battle is brewing over the restructuring of OpenAI, the creator of pioneering artificial intelligence chatbot ChatGPT. It was founded as a nonprofit in 2015 with the goal of developing AI to benefit humanity, not investors. But advanced AI requires massive processing power, whi ... Show More
15m 21s
Oct 2024
OpenAI’s Path to Become a For-Profit Company Is Complicated
OpenAI plans to convert from a non-profit to a for-profit organization, a complex move that is rarely done. WSJ reporter Theo Francis joins host Zoe Thomas to discuss the hurdles that OpenAI will face and the possible reasons for the change. Plus, a controversial bill to regulate ... Show More
13m 16s
Oct 2024
20VC: Why Founder Mode is Dangerous & Could Encourage Bad Behaviour | Why Fundraising is a Waste of Time & OKRs are BS | Why Angel Investing is Bad for Founders to Do and the VC Model is on it's Last
Zach Perret is the CEO and Co-Founder of Plaid, a technology platform reshaping financial services. To date, Zach has raised over $734M for Plaid from the likes of NEA, Spark, GV, Coatue and a16z, to name a few. Today, thousands of companies including the largest fintechs, severa ... Show More
50m 53s
Mar 2025
Nonprofits navigate Trump’s drastic funding cuts, with The Chronicle of Philanthropy CEO Stacy Palmer
President Trump’s dramatic cuts to U.S. government grants are destabilizing every corner of the non-profit sector, leaving organizations scrambling to adapt. Stacy Palmer, CEO of the Chronicle of Philanthropy, explores the executive order’s impact on both the public and private s ... Show More
27m 37s
Apr 20
Purpose Isn’t Found, It’s Built with Aaron Hurst
In episode 230 of The Business Development Podcast, Kelly Kennedy is joined by Aaron Hurst, bestselling author of The Purpose Economy and a pioneer in the field of meaningful work. Together, they explore the idea that purpose isn’t something we find, but something we intentionall ... Show More
1h 5m
Dec 2024
Fixing Education in America: What's Stopping Us?
Over half of Americans live in childcare deserts, while 90% of brain development happens before the age of five. All the while, education and childcare remain among the most resistant sectors to technological change. Billions of dollars have been spent, but outcomes continue to l ... Show More
38m 54s
Dec 2024
Sam Altman's Equity Dilemma
In this episode of the AI Chat podcast, host Jaeden Schaefer discusses the controversies surrounding OpenAI, particularly focusing on Sam Altman's claims about his equity in the company. The conversation delves into Altman's testimony before Congress, the significant valuation gr ... Show More
14m 4s
Nov 2024
ChatGPT’s First Victim + The Department of Government Efficiency (DOGE)
Scott and Ed open the show by discussing Spotify and Disney’s earnings, a gambling company’s strong third quarter results, and Elliot Management’s activist investment in Honeywell. Then Scott breaks down how Chegg allowed ChatGPT to take its business to the woodshed and why he th ... Show More
54m 57s
Jul 2024
Pump and Dump Schemes are Now Totally Legal
Send us a textA Texas District Judge Andrew S. Hanen has dismissed all charges against seven social-media influencers the SEC and Justice Department had accused of perpetrating a “stock manipulation scheme” on Twitter and Discord, ruling that the prosecution failed to state an of ... Show More
20m 27s