logo
episode-header-image
Jul 11
1h 34m

The PM’s Role in AI Evals: Step-by-Step

Aakash Gupta
About this episode

Today, we’ve got some of our most requested guests yet: Hamel Husain and Shreya Shankar, creators of the world’s best AI Evals cohort.

You’ll learn:

- Why AI evaluations are the most critical skill for building successful AI products

- What common mistakes people are making and how to avoid them

- How to effectively "hill climb" towards better AI performance

If you're building AI features, or aiming to master how AI Eval actually works, this episode is your step-by-step blueprint.

----

Brought to you by:

The AI Evals Course for PMs & Engineers: You get $800 with this link

Jira Product Discovery: Plan with purpose, ship with confidence

Vanta: Automate compliance, security, and trust with AI (Get $1,000 with my link)

AI PM Certification: Get $500 with code AAKASH25

----

Timestamps:

00:00:00- Preview

00:02:06 - Three reasons PMs NEED evals.

00:04:40 - Why PMs shouldn't view evals as monotonous

00:06:23 - Are evals the hardest part of AI products solved?

00:07:37 - Why can't you just rely on human "vibe checks"?

00:12:11 - Ad 1 (AI Evals Course)

00:13:10 - Ad 2 (Jira Product Discovery)

00:14:06 - Are LLMs good at 1-5ratings?

00:15:45 - The "Whack-a-mole" analogy without evals

00:16:26 - Hallucination problem in emails (Apollo story)

00:21:22 - How Airbnb used machine learning models?

00:23:56 - Evaluating RAG Systems.

00:29:52 - Ad 3 (Vanta)

00:30:56 - Ad 4 (AIPM Certification on Maven)

00:31:42 - Hill Climbing

00:35:51 - Red flag: Suspiciously high eval metrics

00:39:02 - Design principles for effective evals

00:42:42 - How OpenAI approaches evals

00:44:39 - Foundation models are trained on "average taste"

00:49:36 - Cons of fine-tuning

00:51:27 - Prompt engineering vs. RAG vs. Fine-tuning

00:53:00 - Introduction of "The Three Gulfs" framework

00:56:04 - Roadmap for learning AI evals

01:01:41 - Why error analysis is critical for LLMs

01:08:29 - Using LLM as a judge

01:10:15 - Frameworks for systematic problem-solving in labels

01:17:42 - Importance of niche and qualifying clients. (Pro tips)

01:18:43 - $800K for first course cohort!

01:20:15 - Why end a successful cohort?

01:25:49 - GOLD advice for creating a successful course

01:33:39 - Outro

----

Key Takeaways:

1. Stop Guessing. Eval Your AI. Your AI isn’t an MVP without robust evaluations. Build in judgment — or you’re just shipping hope. Without evaluation, AI performance is a happy accident.

2. Error Analysis = Your Superpower. General metrics won’t save you. You need to understand why your AI messed up. Only then can you fix it — not just wish it worked better.

3. 99% Accuracy is a LIE. Suspiciously high metrics usually mean your evaluation setup is broken. Real-world AI is never perfect. If your evals say otherwise, they’re flawed.

4. Fine-Tuning is a Trap (Mostly). Fine-tuning is expensive, brittle, and often unnecessary. Start with smarter prompts and RAG. Only fine-tune if you must.

5. Your Data’s Wild. Understand It. You can’t eyeball everything. Without structured evaluation, you’ll drown in noise and never find patterns or fixes that matter.

6. Models Fail to Generalize. Always. Your AI will break on new data. Don’t blame it. Adapt it. Use RAG, upgrade inputs, and stop expecting out-of-the-box magic.

7. OpenAI Doesn’t Get Your Vibe. Their models are average-taste. Your product isn’t. If you want your brand’s voice in your AI, you must define it yourself — with evals.

8. Trust LLM Judges... but validate them hard. LLMs can scale your evals — but you still need to verify them against human-labeled data. Don’t blindly trust your judge.

9. Your Prompts Are S**T. If your AI is bad, it’s probably your fault. The cheapest, most powerful fix? Sharpen your prompts. Clearer instructions = smarter AI.

10. Let AI Teach You. Seriously. LLM judges aren’t just scoring you — they can teach you. Reviewing how your AI fails is the best way to learn what great outputs should look like.

----

Check it out on Apple, Spotify, or YouTube.

----

Related Podcasts:

Complete Course: AI Product Management

Tutorial of Top 5 AI Prototyping Tools

If you only have 2 hrs, this is how to become an AI PM

College Dropout Raised $20M Building AI Tools | Cluely, Roy Lee

Bolt CEO and Founder on How he Hit $30M ARR in a Year

LogRocket CEO and Founder on How to Build a $100M+ AI Startup

Amplitude CEO and Founder on Building the Product Analytics Leader

----

P.S. More than 85% of you aren't subscribed yet. If you can subscribe on YouTube, follow on Apple & Spotify, my commitment to you is that we'll continue making this content better.

----

If you want to advertise, email productgrowthppp at gmail.



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.news.aakashg.com/subscribe
Up next
Oct 7
Crash Course in AI Product Design from Google Search + Maps Designer, Elizabeth Laraki
Today’s EpisodeEveryone’s building AI products wrong.They’re sprinkling AI on top like fairy dust. Adding chat interfaces to everything. Ignoring 70 years of design principles.Elizabeth Laraki was one of 4 designers on Google Search in 2006. One of 2 designers on Google Maps in 2 ... Show More
1h 12m
Oct 4
The Claude Code Tutorial for AI PMs: Why You Need to Use It + How
Today’s EpisodeClaude Code hit $500 million ARR in four months.Two product managers. Zero marketing dollars. Just pure viral growth.While some PMs are still copying and pasting into ChatGPT, others are orchestrating multiple AI agents that work in parallel, automatically reading ... Show More
1h 37m
Sep 27
The AI PM’s Guide to Building AI Agents, with Warp CEO Zach Lloyd
Today’s EpisodeAs an AI PM, you’re probably tired of building AI Agents and don’t know how to monetize them.But what if I told you there’s a company adding $1 million ARR every 10 days with their AI agent?Zach Lloyd, CEO of Warp and former Google engineering leader, cracked the c ... Show More
39m 25s
Recommended Episodes
Nov 2024
Becoming an AI PM | Aman Khan (Arize AI, ex-Spotify, Apple, Cruise)
Aman Khan is Director of Product at Arize AI, an observability company for AI engineers at companies like Uber, Instacart, and Discord. Previously he was an AI Product Manager at Spotify on the ML Platform team, enabling hundreds of engineers to build and ship products across the ... Show More
1h 17m
Sep 7
How AI is reshaping the product role | Oji and Ezinne Udezue
Ezinne and Oji Udezue have over 50 years of combined product leadership experience at Microsoft, Twitter, Atlassian, WP Engine, Typeform, and Calendly. They’ve witnessed every major shift in product management, and, despite their seniority, they’re taking beginner AI courses and ... Show More
1h 18m
Jul 8
How I'm Building a Zero-Employee Business with AI
Want to Automate your work with AI? Get the playbook here: https://clickhubspot.com/wgk Episode 66: Can you really build a zero-employee business with AI? Nathan Lands (https://x.com/NathanLands) sits down with John Rush (https://x.com/johnrushx), founder and self-proclaimed buil ... Show More
46 m
Aug 28
How 80,000 companies build with AI: products as organisms, the death of org charts, and why agents will outnumber employees by 2026 | Asha Sharma (CVP of AI Platform at Microsoft)
Asha Sharma leads AI product strategy at Microsoft, where she works with thousands of companies building AI products and has unique visibility into what’s working (and what’s not) across more than 15,000 startups and enterprises. Before Microsoft, Asha was COO at Instacart, and V ... Show More
57m 11s
Aug 31
How we restructured Airtable’s entire org for AI | Howie Liu (co-founder and CEO)
Howie Liu is the co-founder and CEO of Airtable, the no-code platform valued at around $12 billion. After a viral tweet declared “Airtable is dead” based on incorrect data, Howie led a radical transformation: reorganizing the entire company around AI, becoming an “IC CEO” who cod ... Show More
1h 40m
Jul 31
He saved OpenAI, invented the “Like” button, and built Google Maps: Bret Taylor on the future of careers, coding, agents, and more
Bret Taylor’s legendary career includes being CTO of Meta, co-CEO of Salesforce, chairman of the board at OpenAI (yes, during that drama), co-creating both Google Maps and the Like button, and founding three companies. Today he’s the founder and CEO of Sierra, an AI agent company ... Show More
1h 28m
Sep 21
From managing people to managing AI: The leadership skills everyone needs now | Julie Zhuo (Facebook VP, Sundial CEO, The Making of a Manager author)
Julie Zhuo is the former VP and Head of Design at Facebook (now Meta), author of the bestselling book The Making of a Manager, and co-founder of Sundial, an AI-powered data analysis company. Also, my first-ever podcast guest over 3 years ago!In our conversation, we discuss:1. The ... Show More
1h 36m
Jul 6
Solo founder, $80M exit, 6 months: The Base44 bootstrapped startup success story | Maor Shlomo
Maor Shlomo is the founder of Base44, an AI-powered app builder that he bootstrapped to an over $80 million acquisition by Wix in just six months. As a solo founder (with severe ADHD), he hit $1 million ARR just three weeks after launch and grew the product to more than 400,000 u ... Show More
1h 31m
Mar 2025
How to win in the AI era: Ship a feature every week, embrace technical debt, ruthlessly cut scope, and create magic your competitors can't copy | Gaurav Misra (CEO and co-founder of Captions)
Gaurav Misra is the co-founder and CEO of Captions, an AI-powered video creation company and one of the most successful consumer AI products in the world today. Previously he was a product leader at Snap, where he created the design engineering function and spent years helping de ... Show More
1h 25m
Jul 17
The AI-native startup: 5 products, 7-figure revenue, 100% AI-written code | Dan Shipper (co-founder/CEO of Every)
Dan Shipper is the co-founder and CEO of Every. With just 15 people, Every publishes a daily AI newsletter, ships multiple AI products, and operates a million-dollar-a-year consulting arm—all while their engineers write virtually zero code. It’s the most radical example of AI-fir ... Show More
1h 34m