logo
episode-header-image
Jul 11
1h 34m

The PM’s Role in AI Evals: Step-by-Step

Aakash Gupta
About this episode

Today, we’ve got some of our most requested guests yet: Hamel Husain and Shreya Shankar, creators of the world’s best AI Evals cohort.

You’ll learn:

- Why AI evaluations are the most critical skill for building successful AI products

- What common mistakes people are making and how to avoid them

- How to effectively "hill climb" towards better AI performance

If you're building AI features, or aiming to master how AI Eval actually works, this episode is your step-by-step blueprint.

----

Brought to you by:

The AI Evals Course for PMs & Engineers: You get $800 with this link

Jira Product Discovery: Plan with purpose, ship with confidence

Vanta: Automate compliance, security, and trust with AI (Get $1,000 with my link)

AI PM Certification: Get $500 with code AAKASH25

----

Timestamps:

00:00:00- Preview

00:02:06 - Three reasons PMs NEED evals.

00:04:40 - Why PMs shouldn't view evals as monotonous

00:06:23 - Are evals the hardest part of AI products solved?

00:07:37 - Why can't you just rely on human "vibe checks"?

00:12:11 - Ad 1 (AI Evals Course)

00:13:10 - Ad 2 (Jira Product Discovery)

00:14:06 - Are LLMs good at 1-5ratings?

00:15:45 - The "Whack-a-mole" analogy without evals

00:16:26 - Hallucination problem in emails (Apollo story)

00:21:22 - How Airbnb used machine learning models?

00:23:56 - Evaluating RAG Systems.

00:29:52 - Ad 3 (Vanta)

00:30:56 - Ad 4 (AIPM Certification on Maven)

00:31:42 - Hill Climbing

00:35:51 - Red flag: Suspiciously high eval metrics

00:39:02 - Design principles for effective evals

00:42:42 - How OpenAI approaches evals

00:44:39 - Foundation models are trained on "average taste"

00:49:36 - Cons of fine-tuning

00:51:27 - Prompt engineering vs. RAG vs. Fine-tuning

00:53:00 - Introduction of "The Three Gulfs" framework

00:56:04 - Roadmap for learning AI evals

01:01:41 - Why error analysis is critical for LLMs

01:08:29 - Using LLM as a judge

01:10:15 - Frameworks for systematic problem-solving in labels

01:17:42 - Importance of niche and qualifying clients. (Pro tips)

01:18:43 - $800K for first course cohort!

01:20:15 - Why end a successful cohort?

01:25:49 - GOLD advice for creating a successful course

01:33:39 - Outro

----

Key Takeaways:

1. Stop Guessing. Eval Your AI. Your AI isn’t an MVP without robust evaluations. Build in judgment — or you’re just shipping hope. Without evaluation, AI performance is a happy accident.

2. Error Analysis = Your Superpower. General metrics won’t save you. You need to understand why your AI messed up. Only then can you fix it — not just wish it worked better.

3. 99% Accuracy is a LIE. Suspiciously high metrics usually mean your evaluation setup is broken. Real-world AI is never perfect. If your evals say otherwise, they’re flawed.

4. Fine-Tuning is a Trap (Mostly). Fine-tuning is expensive, brittle, and often unnecessary. Start with smarter prompts and RAG. Only fine-tune if you must.

5. Your Data’s Wild. Understand It. You can’t eyeball everything. Without structured evaluation, you’ll drown in noise and never find patterns or fixes that matter.

6. Models Fail to Generalize. Always. Your AI will break on new data. Don’t blame it. Adapt it. Use RAG, upgrade inputs, and stop expecting out-of-the-box magic.

7. OpenAI Doesn’t Get Your Vibe. Their models are average-taste. Your product isn’t. If you want your brand’s voice in your AI, you must define it yourself — with evals.

8. Trust LLM Judges... but validate them hard. LLMs can scale your evals — but you still need to verify them against human-labeled data. Don’t blindly trust your judge.

9. Your Prompts Are S**T. If your AI is bad, it’s probably your fault. The cheapest, most powerful fix? Sharpen your prompts. Clearer instructions = smarter AI.

10. Let AI Teach You. Seriously. LLM judges aren’t just scoring you — they can teach you. Reviewing how your AI fails is the best way to learn what great outputs should look like.

----

Check it out on Apple, Spotify, or YouTube.

----

Related Podcasts:

Complete Course: AI Product Management

Tutorial of Top 5 AI Prototyping Tools

If you only have 2 hrs, this is how to become an AI PM

College Dropout Raised $20M Building AI Tools | Cluely, Roy Lee

Bolt CEO and Founder on How he Hit $30M ARR in a Year

LogRocket CEO and Founder on How to Build a $100M+ AI Startup

Amplitude CEO and Founder on Building the Product Analytics Leader

----

P.S. More than 85% of you aren't subscribed yet. If you can subscribe on YouTube, follow on Apple & Spotify, my commitment to you is that we'll continue making this content better.

----

If you want to advertise, email productgrowthppp at gmail.



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.news.aakashg.com/subscribe
Up next
Yesterday
OpenAI Head of Product (Integrity) on the Future of AI Agents, PM, and AI threats
Check out the conversation on Apple, Spotify and YouTube.Brought to you by:* Jira Product Discovery: Build the right thing, reliably* AI PM Certification: Get $550 with code AAKASH550C7* The AI Evals Course for PMs: Get $1050 off with code ag-product-growth* Maven: Get $100 off m ... Show More
1h 21m
Aug 18
He built the top AI agent startup | Flo Crivello, Former PM, now CEO & Founder, Lindy AI
If you’ve ever said “I just wish I had an assistant who knew exactly how I think”... Lindy is that assistant. These agents aren’t demos. They’re real, customizable workflows anyone can build. No code.Flo Crivello (founder of Lindy and ex-Cruise/YC) joined us to show how his perso ... Show More
53m 10s
Aug 12
Teresa Torres' Step-by-Step Guide to AI Product Discovery
In today’s episode, we have one of the two voices I wanted most when I started this podcast: Teresa Torres. Alongside Marty Cagan, she was in my top guests to have.That’s because she has trained over 17,000 PMs in 100 countries.And in today’s episode, she’s breaking down one of t ... Show More
56m 5s
Recommended Episodes
Nov 2024
Becoming an AI PM | Aman Khan (Arize AI, ex-Spotify, Apple, Cruise)
Aman Khan is Director of Product at Arize AI, an observability company for AI engineers at companies like Uber, Instacart, and Discord. Previously he was an AI Product Manager at Spotify on the ML Platform team, enabling hundreds of engineers to build and ship products across the ... Show More
1h 17m
Jul 8
The Robots Aren’t Coming—They’re Already Here (And They're Cheaper Than Your Project Engineer)
From the office to the job sites, AI is rapidly changing the construction industry just as any other. The newly developed AI tools currently available can help you to process data from variety of sources, providing you with insights to make smart decisions. One crucial piece of i ... Show More
21m 15s
May 1
The rise of Cursor: The $300M ARR AI tool that engineers can’t stop using | Michael Truell (co-founder and CEO)
Michael Truell is the co-founder and CEO of Anysphere, the company behind Cursor—the fastest-growing AI code editor in the world, reaching $300 million in annual recurring revenue just two years after its launch. In this conversation, Michael shares his vision for the future, les ... Show More
1h 11m
Jul 8
How I'm Building a Zero-Employee Business with AI
Want to Automate your work with AI? Get the playbook here: https://clickhubspot.com/wgk Episode 66: Can you really build a zero-employee business with AI? Nathan Lands (https://x.com/NathanLands) sits down with John Rush (https://x.com/johnrushx), founder and self-proclaimed buil ... Show More
46 m
Jun 2024
#218 Designing AI Applications with Robb Wilson, Co-Founder & CEO at Onereach.ai
All the hype around generative AI means that every software maker seems to be stuffing chat interfaces into their products whenever they can. For the most part, the jury is still out on whether this is a good idea or not. However, design goes deeper than just the user interface, ... Show More
46m 36s
Sep 2024
AI is more than GenAI
GenAI is often what people think of when someone mentions AI. However, AI is much more. In this episode, Daniel breaks down a history of developments in data science, machine learning, AI, and GenAI in this episode to give listeners a better mental model. Don’t miss this one if y ... Show More
40m 3s
Aug 2024
He Automated His Sales Job With Ai… So His Boss Promoted Him
Ep. 254 What if you could automate the most mundane parts of your sales job and get promoted for it? Kipp and Kieran are joined by Ethan Dewaal, and dive into how AI is revolutionizing sales practices and boosting productivity. You will learn more on automating product questions ... Show More
40 m
Jul 1
What's Actually Coming in AI (From Someone Building It)
Episode 65: What’s actually coming next in AI, and how will it transform the fundamental infrastructure we rely on every day? Matt Wolfe (https://x.com/mreflow) sits down with DJ Sampath (https://x.com/djsampath), Senior Vice President of AI Products at Cisco, for a deep dive int ... Show More
21m 16s
Feb 2025
#281 Developing AI Products That Impact Your Business with Venky Veeraraghavan, Chief Product Officer at DataRobot
As AI continues to dominate industry conversations, the notion of AI readiness becomes a focal point for organizations. It's a multifaceted challenge that goes beyond technology, encompassing business processes and cultural shifts. For professionals, this means grappling with que ... Show More
38m 45s
Oct 2024
Behind the product: NotebookLM | Raiza Martin (Senior Product Manager, AI @ Google Labs)
Raiza Martin is a senior product manager for AI at Google Labs, where she leads the team behind NotebookLM, an AI-powered research tool that includes a delightful podcast-on-demand feature called “Audio Overviews.” NotebookLM started as a 20% project and has grown into a product ... Show More
48m 58s