logo
episode-header-image
Sep 25
1h 46m

Why AI evals are the hottest new skill f...

Lenny Rachitsky
About this episode

Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.

What you’ll learn:

1. WTF evals are

2. Why they’ve become the most important new skill for AI product builders

3. A step-by-step walkthrough of how to create an effective eval

4. A deep dive into error analysis, open coding, and axial coding

5. Code-based evals vs. LLM-as-judge

6. The most common pitfalls and how to avoid them

7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)

8. Insight into the debate between “vibes” and systematic evals

Brought to you by:

Fin—The #1 AI agent for customer service

Dscout—The UX platform to capture insights at every stage: from ideation to production

Mercury—The art of simplified finances

Where to find Shreya Shankar

• X: https://x.com/sh_reya

• LinkedIn: https://www.linkedin.com/in/shrshnk/

• Website: https://www.sh-reya.com/

• Maven course: https://bit.ly/4myp27m

Where to find Hamel Husain

• X: https://x.com/HamelHusain

• LinkedIn: https://www.linkedin.com/in/hamelhusain/

• Website: https://hamel.dev/

• Maven course: https://bit.ly/4myp27m

In this episode, we cover:

(00:00) Introduction to Hamel and Shreya

(04:57) What are evals?

(09:56) Demo: Examining real traces from a property management AI assistant

(16:51) Writing notes on errors

(23:54) Why LLMs can’t replace humans in the initial error analysis

(25:16) The concept of a “benevolent dictator” in the eval process

(28:07) Theoretical saturation: when to stop

(31:39) Using axial codes to help categorize and synthesize error notes

(44:39) The results

(46:06) Building an LLM-as-judge to evaluate specific failure modes

(48:31) The difference between code-based evals and LLM-as-judge

(52:10) Example: LLM-as-judge

(54:45) Testing your LLM judge against human judgment

(01:00:51) Why evals are the new PRDs for AI products

(01:05:09) How many evals you actually need

(01:07:41) What comes after evals

(01:09:57) The great evals debate

(1:15:15) Why dogfooding isn’t enough for most AI products

(01:18:23) OpenAI’s Statsig acquisition

(1:23:02) The Claude Code controversy and the importance of context

(01:24:13) Common misconceptions around evals

(1:22:28) Tips and tricks for implementing evals effectively

(1:30:37) The time investment

(1:33:38) Overview of their comprehensive evals course

(1:37:57) Lightning round and final thoughts

LLM Log Open Codes Analysis Prompt:

Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.

Referenced:

• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve

• Mercor: https://mercor.com/

• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b

• Nurture Boss: https://nurtureboss.io/

• Braintrust: https://www.braintrust.dev/

• Andrew Ng on X: https://x.com/andrewyng

• Carrying Out Error Analysis: https://www.youtube.com/watch?v=JoAxZsdw_3w

• Julius AI: https://julius.ai/

• Brendan Foody on X—“evals are the new PRDs”: https://x.com/BrendanFoody/status/1939764763485171948

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/abs/10.1145/3654777.3676450

• Lenny’s post on X about evals: https://x.com/lennysan/status/1909636749103599729

• Statsig: https://statsig.com/

• Claude Code: https://www.anthropic.com/claude-code

• Cursor: https://cursor.com/

• Occam’s razor: https://en.wikipedia.org/wiki/Occam%27s_razor

Frozen: https://www.imdb.com/title/tt2294629/

The Wire on HBO: https://en.wikipedia.org/wiki/The_Wire

Recommended books:

Pachinko: https://www.amazon.com/Pachinko-National-Book-Award-Finalist/dp/1455563935

Apple in China: The Capture of the World’s Greatest Company: https://www.amazon.com/Apple-China-Capture-Greatest-Company/dp/1668053373/

Machine Learning: https://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/1259096955

Artificial Intelligence: A Modern Approach: https://www.amazon.com/Artificial-Intelligence-Modern-Approach-Global/dp/1292401133/

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email podcast@lennyrachitsky.com.

Lenny may be an investor in the companies discussed.

My biggest takeaways from this conversation:



To hear more, visit www.lennysnewsletter.com
Up next
Yesterday
A guide to difficult conversations, building high-trust teams, and designing a life you love | Rachel Lockett
<p><strong>Rachel Lockett</strong> is a sought-after executive coach and former HR leader at Stripe and Pinterest who now works with CEOs, founders, and tech leaders on emotional intelligence, resilience, and leadership skills. In this episode, Rachel shares powerful frameworks f ... Show More
1h 45m
Nov 20
Slack founder: Mental models for building products people love ft. Stewart Butterfield
<p><strong>Stewart Butterfield</strong> is the co-founder of Slack and Flickr, two of the most influential products in internet history. After selling Slack to Salesforce in one of tech’s biggest acquisitions, he’s been focused on family, philanthropy, and creative projects. In t ... Show More
1h 30m
Nov 16
The Godmother of AI on jobs, robots & why world models are next | Dr. Fei-Fei Li
<p><strong>Dr. Fei-Fei Li </strong>is<strong> </strong>known as the “godmother of AI.” She’s been at the center of AI’s biggest breakthroughs for over two decades. She spearheaded ImageNet, the dataset that sparked the deep-learning revolution we’re living right now, served as Go ... Show More
1h 19m
Recommended Episodes
Jul 2016
Acquired Episode 15: ExactTarget (acquired by Salesforce) with Scott Dorsey
<p>Ben and David return to make their first foray into enterprise software, covering Salesforce’s $2.5B acquisition of ExactTarget in 2013 with the help of special guest and ExactTarget cofounder &amp; CEO, <a href="https://www.linkedin.com/in/scott-dorsey-096b393">Scott Dorsey</ ... Show More
1h 17m
Oct 22
Startup to Pitch Champion: How Reekon Tools has Grown Sustainably
<p><span style="font-weight:400;">Building a profitable hardware brand without venture capital requires strategic thinking, authentic relationships, and unwavering focus on cash flow management. In this episode of B2B Breakthrough, host Ciara Cristo chats with Christian Reed (</s ... Show More
37m 49s
Oct 1
How to Bootstrap to $1Million ARR [REVEALED]
In this episode of Startup Strategies, Juliana sits down with Ish Jindal, the engineer-turned-entrepreneur who spent four years not knowing what MRR meant while bootstrapping TARS to $1M ARR. After discovering the internet in his second year of college, Ish went from building edu ... Show More
1h 2m
Sep 29
Founder Mode for ETA: $6m to $25m in 3 Years
Aizik Zimerman bought a home services business to build a consumer brand into his life's work. It seems to be working. Register for the webinar: The ABCs of Franchise M&A: Deal Sourcing, Diligence, and Integration - Thu Oct 2nd - https://bit.ly/46E2uv8Topics in Aizik’s interview: ... Show More
1h 18m
Jul 2025
$300M Jobs Lost to AI with Douglas James
Douglas James is the Founder and CEO of a SAAS platform called LeadFi.ai - LeadFi focuses on helping businesses reveal the exact buying power of their leads in real-time. Using just name, email and phone, LeadFi provides complete credit and financial information allowing Sales Te ... Show More
27m 40s
Sep 16
Jeff Horing - Building Insight Partners - [Invest Like the Best, EP.440]
My guest today is Jeff Horing. Jeff cofounded Insight Partners and has been the Managing Director since 1995. This is one of Jeff’s first public conversations about building one of the world’s most successful technology investment firms with over $100 billion in AUM. Jeff reveals ... Show More
1h 31m
Aug 7
Making $$$ with Sam Altman's Solopreneurship Thesis
On this episode I explore Sam Altman's prediction that AI will enable the first one-person billion-dollar company. I outline how this would work through AI agents handling traditional business functions like engineering, design, marketing, and sales, creating an organizational st ... Show More
24m 32s
Sep 30
Marketing Stunts That Helped this AI Company Raise $100M
Want our database of 100+ Creative AI Use Cases to create your own marketing stunts? Get it here: https://clickhubspot.com/edj Episode 78: Can bold marketing stunts and radical creativity really shape the success of an AI company? Nathan Lands (https://x.com/NathanLands) is joine ... Show More
30m 9s
Sep 15
20VC: Mercor: From $1M to $500M in 17 Months: The Fastest Growing Company in the World | How to Think About Margins and Revenue Sustainability in AI | Why Evaluation Benchmarks in AI are BS Today with Brendan Foody
Brendan Foody is the Co-Founder and CEO @ Mercor, the fastest growing company in history. The company solves talent allocation in the AI economy and they have scaled from $1M to $500M in revenue in just 17 months. With a rumoured new funding round pricing the company at a whoppin ... Show More
1h 1m
Dec 2024
Revolutionizing Customer Success with Agency’s Elias Torres
Today on No Priors, Sarah sits down with Elias Torres, CEO and founder of Agency, an AI agent for customer success teams. Elias shares his journey from growing up in Nicaragua to founding several companies, leading engineering at HubSpot, and selling Drift for $1B. He also discus ... Show More
31m 18s