logo
episode-header-image
Oct 13
54m 48s

Evals, error analysis, and better prompt...

Claire Vo
About this episode

Hamel Husain, an AI consultant and educator, shares his systematic approach to improving AI product quality through error analysis, evaluation frameworks, and prompt engineering. In this episode, he demonstrates how product teams can move beyond “vibe checking” their AI systems to implement data-driven quality improvement processes that identify and fix the most common errors. Using real examples from client work with Nurture Boss (an AI assistant for property managers), Hamel walks through practical techniques that product managers can implement immediately to dramatically improve their AI products.


What you’ll learn:

1. A step-by-step error analysis framework that helps identify and categorize the most common AI failures in your product

2. How to create custom annotation systems that make reviewing AI conversations faster and more insightful

3. Why binary evaluations (pass/fail) are more useful than arbitrary quality scores for measuring AI performance

4. Techniques for validating your LLM judges to ensure they align with human quality expectations

5. A practical approach to prioritizing fixes based on frequency counting rather than intuition

6. Why looking at real user conversations (not just ideal test cases) is critical for understanding AI product failures

7. How to build a comprehensive quality system that spans from manual review to automated evaluation

Brought to you by:

GoFundMe Giving Funds—One account. Zero hassle: https://gofundme.com/howiai

Persona—Trusted identity verification for any use case: https://withpersona.com/lp/howiai

Where to find Hamel Husain:

Website: https://hamel.dev/

Twitter: https://twitter.com/HamelHusain

Course: https://maven.com/parlance-labs/evals

GitHub: https://github.com/hamelsmu

Where to find Claire Vo:

ChatPRD: https://www.chatprd.ai/

Website: https://clairevo.com/

LinkedIn: https://www.linkedin.com/in/clairevo/

X: https://x.com/clairevo

In this episode, we cover:

(00:00) Introduction to Hamel Husain

(03:05) The fundamentals: why data analysis is critical for AI products

(06:58) Understanding traces and examining real user interactions

(13:35) Error analysis: a systematic approach to finding AI failures

(17:40) Creating custom annotation systems for faster review

(22:23) The impact of this process

(25:15) Different types of evaluations

(29:30) LLM-as-a-Judge

(33:58) Improving prompts and system instructions

(38:15) Analyzing agent workflows

(40:38) Hamel’s personal AI tools and workflows

(48:02) Lighting round and final thoughts

Tools referenced:

• Claude: https://claude.ai/

• Braintrust: https://www.braintrust.dev/docs/start

• Phoenix: https://phoenix.arize.com/

• AI Studio: https://aistudio.google.com/

• ChatGPT: https://chat.openai.com/

• Gemini: https://gemini.google.com/

Other references:

• Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://dl.acm.org/doi/10.1145/3654777.3676450

• Nurture Boss: https://nurtureboss.io

• Rechat: https://rechat.com/

• Your AI Product Needs Evals: https://hamel.dev/blog/posts/evals/

• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/

• Creating a LLM-as-a-Judge That Drives Business Results: https://hamel.dev/blog/posts/llm-judge/

• Lenny’s List on Maven: https://maven.com/lenny

Production and marketing by https://penname.co/. For inquiries about sponsoring the podcast, email jordan@penname.co.

Up next
Nov 19
“Farm-to-table software”: How I built a Thanksgiving party hub using Lovable for managing invites, dishes, shared recipes, and photos
<p>In today’s pre-Thanksgiving episode, I walk you through how I vibe coded my very own “Thanksgiving party hub” using Lovable—and how I transformed it from AI-generated slop into something warm, personal, and genuinely useful. I show you exactly how I upleveled the typography, v ... Show More
34m 28s
Nov 17
“Nobody wanted to do this work”: How Emmy Award–winning filmmakers use AI to automate the tedious parts of documentaries
<p><strong>Tim McAleer</strong> is a producer at Ken Burns’s Florentine Films who is responsible for the technology and processes that power their documentary production. Rather than using AI to generate creative content, Tim has built custom AI-powered tools that automate the mo ... Show More
47m 36s
Nov 10
How this CEO turned 25,000 hours of sales calls into a self-learning go-to-market engine | Matt Britton (Suzy)
Matt Britton is the founder and CEO of Suzy, a consumer insights platform that has raised over $100 million in venture capital and works with top brands like Coca-Cola, Google, Procter & Gamble, and Nike. Matt is also the bestselling author of YouthNation, a blueprint for underst ... Show More
42m 53s
Recommended Episodes
Jun 2025
The Lean Product Playbook 10 Years Later: Product Management in the Age of AI
Product management fundamentals are timeless. But the tools? Completely transformed.Every PM needs to master new workflows in 2025:AI Prototyping - From text to live prototype in minutesDesign Collaboration - Working with designers in the AI ageUser Research - Systematic validati ... Show More
1h 19m
Nov 2024
Making Sense of Agentic AI | ThoughtWorks Birgitta Boeckeler
<p>There’s AI agents. There’s AI tooling. Do either drive business impact or are they just more things your dev team is supposed to stay on top of?<br/><br/>Birgitta Boeckeler, Global Lead for AI Assisted Software Delivery at ThoughtWorks, joins the show to discuss the practical ... Show More
47m 40s
Jun 2025
AI PM Crash Course: Prototyping → Observability → Evals + Prompt Engineering vs RAG vs Fine-Tuning
Every PM has to build AI features these days. And with that means a completely new skill set:- AI prototyping- Observability, Akin to Telemetry- AI Evals: The New PRD for AI PMs- RAG v Fine-Tuning v Prompt Engineering- Working with AI EngineersSo, in today’s episode, I bring you ... Show More
2h 4m
Aug 2024
AI in Action: From Machine Learning Interpretability to Cybersecurity with Serg Masís and Nirmal Budhathoki
In this DSS Podcast, Anna Anisin welcomes Serg Masís, Climate and Agronomic Data Scientist at Syngenta. Serg, an expert in machine learning interpretability and responsible AI, shares his diverse background and journey into data science. He discusses the challenges of building fa ... Show More
25m 37s
Sep 19
AI Just Beat the World's Best Coders
AI just scored a historic win in the International Collegiate Programming Contest, with OpenAI’s GPT-5 and Google’s DeepMind outperforming nearly every human team. The discussion focuses on whether this marks a real inflection point for AI, shifting from competition success to th ... Show More
25m 3s
Sep 5
How to Be An AI Leader (According to OpenAI)
OpenAI has published a new leadership guide for executives, laying out five principles — align, activate, amplify, accelerate, and govern — designed to help organizations lead in the age of AI. This episode breaks down the most important lessons, the subtext behind OpenAI’s recom ... Show More
28m 11s
Aug 15
Measuring AI code assistants and agents with the AI Measurement Framework
In this episode of Engineering Enablement, DX CTO Laura Tacho and CEO Abi Noda break down how to measure developer productivity in the age of AI using DX’s AI Measurement Framework. Drawing on research with industry leaders, vendors, and hundreds of organizations, they explain ho ... Show More
41m 14s
Aug 2024
Metrics Driven Development
<p>How do you systematically measure, optimize, and improve the performance of LLM applications (like those powered by RAG or tool use)? Ragas is an open source effort that has been trying to answer this question comprehensively, and they are promoting a “Metrics Driven Developme ... Show More
42m 12s
Apr 2025
Steven Zgaljic, Jahnel Group, CTO: Harnessing AI for Health Insights
RSVP to the 13th CTO Colloquium on 4/17/25In this episode, Steven Zgaljic, CTO of Jahnel Group, joins host Etienne de Bruin to share a personal story about his daughter’s health challenges. Faced with the need to meticulously track symptoms and daily activities, Steven leveraged ... Show More
18m 44s
Feb 2025
OpenAI researcher on why soft skills are the future of work | Karina Nguyen (Research at OpenAI, ex-Anthropic)
<p><strong>Karina Nguyen </strong>leads research at OpenAI, where she’s been pivotal in developing groundbreaking products like Canvas, Tasks, and the o1 language model. Before OpenAI, Karina was at Anthropic, where she led post-training and evaluation work for Claude 3 models, c ... Show More
1h 14m