logo
episode-header-image
Sep 25
1h 46m

Why AI evals are the hottest new skill f...

Lenny Rachitsky
About this episode

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

In this episode, we cover:

(00:00) Introduction to Hamel and Shreya

(04:57) What are evals?

(09:56) Demo: Examining real traces from a property management AI assistant

(16:51) Writing notes on errors

(23:54) Why LLMs can’t replace humans in the initial error analysis

(25:16) The concept of a “benevolent dictator” in the eval process

(28:07) Theoretical saturation: when to stop

(31:39) Using axial codes to help categorize and synthesize error notes

(44:39) The results

(46:06) Building an LLM-as-judge to evaluate specific failure modes

(48:31) The difference between code-based evals and LLM-as-judge

(52:10) Example: LLM-as-judge

(54:45) Testing your LLM judge against human judgment

(01:00:51) Why evals are the new PRDs for AI products

(01:05:09) How many evals you actually need

(01:07:41) What comes after evals

(01:09:57) The great evals debate

(1:15:15) Why dogfooding isn’t enough for most AI products

(01:18:23) OpenAI’s Statsig acquisition

(1:23:02) The Claude Code controversy and the importance of context

(01:24:13) Common misconceptions around evals

(1:22:28) Tips and tricks for implementing evals effectively

(1:30:37) The time investment

(1:33:38) Overview of their comprehensive evals course

(1:37:57) Lightning round and final thoughts

LLM Log Open Codes Analysis Prompt:

Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.

Referenced:

• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• Mercor: https://mercor.com/

• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b

• Nurture Boss: https://nurtureboss.io/

• Braintrust: https://www.braintrust.dev/

• Andrew Ng on X: https://x.com/andrewyng

• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w

• Julius AI: https://julius.ai/

• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450

• Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729

• Statsig: https://statsig.com/

• Claude Code: https://www.anthropic.com/claude-code

• Cursor: https://cursor.com/

• Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor

Frozen: https://www.imdb.com/title/tt2294629/

The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire

Recommended books:

Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935

Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/

Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955

Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.

Lenny may be an investor in the companies discussed.

My biggest takeaways from this conversation:



To hear more, visit www.lennysnewsletter.com
Up next
Oct 5
How to find hidden growth opportunities in your product | Albert Cheng (Duolingo, Grammarly, Chess.com)
Albert Cheng has led growth at three of the world’s most successful consumer subscription companies: Duolingo, Grammarly, and Chess.com. A former Google product manager (and serious pianist!), Albert developed a unique approach to finding and scaling growth opportunities through ... Show More
1h 25m
Sep 28
A 4-step framework for building delightful products | Nesrine Changuel (Spotify, Google, Skype)
Nesrine Changuel helped build Spotify, Google Chrome, and Google Meet. Her work has helped her discover the importance of emotional connection in building successful products. At Google, she served as a dedicated “delight PM,” a role specifically focused on making products more d ... Show More
1h 24m
Sep 21
From managing people to managing AI: The leadership skills everyone needs now | Julie Zhuo (Facebook VP, Sundial CEO, The Making of a Manager author)
Julie Zhuo is the former VP and Head of Design at Facebook (now Meta), author of the bestselling book The Making of a Manager, and co-founder of Sundial, an AI-powered data analysis company. Also, my first-ever podcast guest over 3 years ago!In our conversation, we discuss:1. The ... Show More
1h 36m
Recommended Episodes
Nov 2024
How to Do Product Discovery Right with Pawel Huryn, 175K+ on LinkedIn, Senior PM and Author
This episode is a masterclass in modern product management—featuring cutting-edge frameworks, actionable strategies, AI integration, competitive edge tactics, and aligning product goals with overarching business objectives.In today's episode, we cover:Junior PM Roles Are Done – 0 ... Show More
1h 32m
Sep 9
The State of AI: Elon’s $1T Package, Apple’s $600B for Trump & How Startups Win w/ Dave, AWG & Blitzy Founders Brian Elliott & Sid Pardeshi | EP #193
Get access to metatrends 10+ years before anyone else - https://qr.diamandis.com/metatrends Salim Ismail is the founder of OpenExO Dave Blundin is the founder & GP of Link Ventures Dr. Alexander Wissner-Gross is a computer scientist and founder of Reified, focused on AI and compl ... Show More
1h 28m
Oct 6
Google: The AI Company
Google faces the greatest innovator's dilemma in history. They invented the Transformer — the breakthrough technology powering every modern AI system from ChatGPT to Claude (and, of course, Gemini). They employed nearly all the top AI talent: Ilya Sutskever, Geoff Hinton, Demis H ... Show More
4h 6m
Sep 30
Marketing Stunts That Helped this AI Company Raise $100M
Want our database of 100+ Creative AI Use Cases to create your own marketing stunts? Get it here: https://clickhubspot.com/edj Episode 78: Can bold marketing stunts and radical creativity really shape the success of an AI company? Nathan Lands (https://x.com/NathanLands) is joine ... Show More
30m 9s
Sep 2024
Superhuman 2.0: The AI edge, top productivity hacks, and the future of email | E2002
This Week in Startups is brought to you by… NetSuite. The number one cloud financial system that unifies accounting, financial management, inventory, and HR into a single platform. Giving you ONE source of truth. Download the CFO’s Guide to AI and Machine Learning today! Visit ht ... Show More
1h 14m
Apr 2025
Exclusive 🦉 “Education Super App” — Duolingo CEO Luis von Ahn announces next product on TBOY
How do you get someone to learn everyday? Make it addictive… in a good way. Luis von Ahn has mastered engagement with Duolingo, transforming it into the largest education app on earth with over 40M daily users. And in our exclusive interview, Duolingo CEO announces the next produ ... Show More
52m 55s
Aug 6
S8E8: Luis von Ahn Is Making Screen Time Count
Luis von Ahn was a tenured professor of computer science at Carnegie Mellon University who had sold a company to Google. “You were pretty set, a lot of us would say, so why were you so hungry to build something new with Duolingo?” asks Ayesha Karnik, MBA ’25. “For the first time ... Show More
57m 51s
Aug 13
How AI Is Rewriting the Rules of Marketing with Nick Lafferty, Growth Marketer at Profound | Ep. 339
Search is changing for the first time since Google launched. And it’s changing fast. Daniel’s out, Tamara’s IN. In this episode, she sits down with Nick Lafferty, Growth Marketer at Profound, to unpack the seismic shift from traditional SEO to AI-powered search engines like ChatG ... Show More
30m 56s
Apr 2025
Next-Gen Founders: Startups in the Age of AI (Edu)
Is AI making startups easier—or just changing the rules of the game?In a world where execution is cheap and accessible, building a successful startup is more competitive than ever. In this episode, Chris Saad and Yaniv Bernstein unpack the reality of every business having access ... Show More
53m 2s
Aug 26
Alphabet Inc.
In its first six years from 1998 to 2004, Google built one of the greatest products of all time (and certainly the greatest business of all time) with Search. Then in its next six years from 2005 to 2011, Google built seven (!) more billion+ user products: Gmail, Maps, Drive and ... Show More
4h 11m