Today’s episode
Most PMs treat evals like a quality gate. Something you run right before shipping, just to check the box.
That is backwards.
The best AI product teams treat evals as the starting point. They write the eval before the prompt. They iterate on the scoring function before the model. They use failing evals as a roadmap.
That shift is what today’s episode is about.
I sat down with Ankur Goyal, Founder and CEO of Braintrust. It is the eval platform used by Replit, Vercel, Airtable, Ramp, Zapier, and Notion. Braintrust just announced its Series B at an $800 million valuation.
Users are running 10x more evals than this time last year. People log more data per day now than they did in the entire first year the product existed.
In this episode, we build an eval entirely from scratch. Live. No pre-written prompts, no pre-written data. We connect to Linear’s MCP server, generate test data, write a scoring function, and iterate until the score goes from 0 to 0.75.
Plus, we cover the complete eval playbook for PMs:
If you want access to my AI tool stack - Dovetail, Arize, Linear, Descript, Reforge Build, DeepSky, Relay.app, Magic Patterns, Speechify, and Mobbin - grab Aakash’s bundle.
If you want my PM Operating System in Claude Code, click here.
----
Check out the conversation on Apple, Spotify, and YouTube.
Brought to you by:
* Kameleoon: Leading AI experimentation platform
* Testkube: Leading test orchestration platform
* Pendo: The #1 software experience management platform
* Bolt: Ship AI-powered products 10x faster
* Product Faculty: Get $550 off their #1 AI PM Certification with my link
----
Key Takeaways:
1. Vibe checks are evals - When you look at an AI output and intuit whether it is good or bad, you are using your brain as a scoring function. It is evaluation. It just does not scale past one person and a handful of examples.
2. Every eval has three parts - Data (a set of inputs), Task (generates an output), and Scores (rates the output between 0 and 1). That normalization forces comparability across time.
3. Evals are the new PRD - In 2015, a PRD was an unstructured document nobody followed. In 2026, the modern PRD is an eval the whole team can run to quantify product quality.
4. Start with imperfect data - Auto-generate test questions with a model. Do not spend a month building a golden data set. Jump in and iterate from your first experiment.
5. The distance principle - The farther you are from the end user, the more critical evals become. Anthropic can vibe check Claude Code because engineers are the users. Healthcare AI teams cannot.
6. Use categorical scoring, not freeform numbers - Give the scorer three clear options (full answer, partial, no answer) instead of asking an LLM to produce an arbitrary number.
7. Evals compound, prompts do not - Models and frameworks change every few months. If you encode what your users need as evals, that investment survives every model swap.
8. Have evals that fail - If everything passes, you have blind spots. Keep failing evals as a roadmap and rerun them every time a new model drops.
9. Build the offline-to-online flywheel - Offline evals test your hypothesis. Online evals run the same scorers on production logs. The gap between them is your improvement roadmap.
10. The best teams review production logs every morning - They find novel patterns, add them to the data set, and iterate all day. That morning ritual is what separates teams that ship blind from teams that ship with confidence.
----
Where to find Ankur Goyal
* Braintrust
Related content
Newsletters:
Podcasts:
* AI evals with Hamel Husain and Shreya Shankar
* AI evals part 2 with Hamel and Shreya
* Aman Khan on AI product quality
----
PS. Please subscribe on YouTube and follow on Apple & Spotify. It helps!
If you want to advertise, email productgrowthppp at gmail.