About this episode
Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
- Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe the scope and purpose of data contracts in the context of this conversation?
- In what way(s) do they differ from data quality/data observability?
- Data contracts are also known as the API for data, can you elaborate on this?
- What are the types of guarantees and requirements that you can enforce with these data contracts?
- What are some examples of constraints or guarantees that cannot be represented in these contracts?
- Are data contracts related to the shift-left?
- Data contracts are also known as the API for data, can you elaborate on this?
- The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
- How did you approach the design of the syntax and implementation for Soda's data contracts?
- Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
- Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
- What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
- When are data contracts the wrong choice?
- What do you have planned for the future of data contracts?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from
The Hug by
The Freak Fandango Orchestra /
CC BY-SAToday
Blurring Lines: Data, AI, and the New Playbook for Team Velocity
Summary<br />In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and ... Show More
1 h
Nov 16
State, Scale, and Signals: Rethinking Orchestration with Durable Execution
Summary <br />In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, ... Show More
51m 46s
Nov 9
The AI Data Paradox: High Trust in Models, Low Trust in Data
Summary<br />In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in ... Show More
51m 35s
Oct 2024
825: Data Contracts: The Key to Data Quality, with Chad Sanderson
Data contracts are redefining data quality and governance, and Chad Sanderson, CEO of Gable.ai, joins host Jon Krohn to explain how they can transform your data strategy. He breaks down what data contracts are, how they shift data quality checks closer to production, and why they ... Show More
1h 2m
Mar 2025
#295 How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy
The role of data and AI engineers is more critical than ever. With organizations collecting massive amounts of data, the challenge lies in building efficient data infrastructures that can support AI systems and deliver actionable insights. But what does it take to become a succes ... Show More
52m 27s
Jul 2022
IoT, IIoT and Managing Edge Data
<p>Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.</p><p><b>SHOW: 634</b></p><p><b>CLOUD NEWS OF THE WEEK - </b><a href='http://bit.ly/cloudcast-cnot ... Show More
35m 37s
Sep 15
#321 Developing Financial AI Products at Experian with Vijay Mehta, EVP of Global Solutions & Analytics at Experian
Financial institutions are racing to harness the power of AI, but the path to implementation is filled with challenges. From feature engineering to model deployment, the technical complexities of AI adoption in finance require careful navigation of both technological and regulato ... Show More
49m 28s
Apr 2025
Specialized AI brains for physical industry
Everyone wants a piece of general purpose models. Instacart has deployed ChatGPT for recipes and meal planning. The Mayo Clinic is using it to summarize patient records. Schneider Electric is using an OpenAI LLM to generate sustainability reports. With such powerful models, what’ ... Show More
37m 2s
Jan 2025
3164: Breaking Data Silos: How Hammerspace is Powering AI Storage and Hybrid Cloud
<p>As part of the IT Press Tour in Silicon Valley, I had the opportunity to sit down with David Flynn, CEO of Hammerspace, to explore how the company is redefining the future of enterprise data storage.</p> <p>At a time when AI-driven workloads and hybrid cloud computing are push ... Show More
24m 26s
Mar 2025
189. Numbers Need Narrative: Use Data to Influence and Inspire
<p><em>Why numbers are only as compelling as the narratives we attach to them.</em></p><p><br></p><p>Facts and figures can be your friend, but before you load your presentation full of data, <a href="https://www.fastersmarter.io/guests/miro-kazakoff/">Miro Kazakoff</a> has a word ... Show More
23m 28s
Oct 1
179: How I Use PRIVATE Data ETHICALLY In the New Era of AI
<p>There is an impossible choice most organizations face. Companies building modern AI face a brutal, binary-feeling decision: either ship a privacy-first model that “kinda low key sucks,” or ship a high-performing model that likely exposes sensitive personal data. Luckily, there ... Show More
7m 17s
Sep 3
#319 Building & Managing Human+Agent Hybrid Teams with Karen Ng, Head of Product at HubSpot
The line between human work and AI capabilities is blurring in today's business environment. AI agents are now handling autonomous tasks across customer support, data management, and sales prospecting with increasing sophistication. But how do you effectively integrate these agen ... Show More
44m 31s
Aug 12
172: Tesla Data Analyst: This is how to land a data job (Lily BL)
<p>What does it take to land a data analyst job at Tesla, and what challenges await you once you're there? Join me as I interview Lily BL, a former Tesla data analyst, who reveals her exhilarating journey in the world of data at one of the world's most innovative companies.</p><p ... Show More
33m 45s