logo
episode-header-image
Jul 2024
49m 26s

Achieving Data Reliability: The Role of ...

Tobias Macey
About this episode
Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
  • Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe the scope and purpose of data contracts in the context of this conversation?
  • In what way(s) do they differ from data quality/data observability?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • What are the types of guarantees and requirements that you can enforce with these data contracts?
  • What are some examples of constraints or guarantees that cannot be represented in these contracts?
  • Are data contracts related to the shift-left?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
  • How did you approach the design of the syntax and implementation for Soda's data contracts?
  • Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
  • Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
  • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
  • When are data contracts the wrong choice?
  • What do you have planned for the future of data contracts?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Up next
Jul 6
Foundational Data Engineering At 2Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Recommended Episodes
Oct 2024
825: Data Contracts: The Key to Data Quality, with Chad Sanderson
Data contracts are redefining data quality and governance, and Chad Sanderson, CEO of Gable.ai, joins host Jon Krohn to explain how they can transform your data strategy. He breaks down what data contracts are, how they shift data quality checks closer to production, and why they ... Show More
1h 2m
Apr 2023
2344: Cloudera: Moving Beyond Big Data to Hybrid Data Mastery
I sit down with Chris Royles, EMEA Field CTO at Cloudera, to discuss the evolution of Big Data and why hybrid data is the next challenge for businesses to tackle. In this episode, we explore how the term 'Big Data' has become dated and how the rapid rise of hybrid data has shifte ... Show More
39m 54s
Nov 2024
#262 Self-Service Business Intelligence with Sameer Al-Sakran, CEO at Metabase
We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here.We’re often caught chasing the dream of “self-serve” data—a place where data empowers stakeholders to answer th ... Show More
51m 33s
Jun 2024
How Avangrid built a data foundation for AI
Mark Waclawiak was tuned into energy issues at an early age. Both his parents worked in the industry: his mom designed electrical systems for buildings and his dad worked at the utility. So the importance of electricity was always apparent to him.When he started working for a uti ... Show More
24m 35s
Jul 2022
IoT, IIoT and Managing Edge Data
Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.SHOW: 634CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST ... Show More
35m 37s
Nov 2024
#259 Getting the Data For Your Data-Driven Decisions with Jonathan Bloch & Scott Voigt
We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here.Understanding where the data you use comes from, how to use it responsibly, and how to maximize its value has b ... Show More
46m 16s
Oct 2024
#254 Career Skills for Data Professionals with Wes Kao, Co-Founder of Maven
Mastering the technical side of data and AI is one thing, but communicating those insights effectively is a whole different challenge. How do you make sure your data is understood, acted upon, and influences decisions? It’s not just about presenting the right numbers—it’s about f ... Show More
46m 22s
Feb 2025
#287 Self-Service Generative AI Product Development at Credit Karma with Madelaine Daianu, Head of Data & AI at Credit Karma
As businesses collect more data than ever, the question arises: is bigger always better? Companies are beginning to question whether massive datasets and complex infrastructures are truly delivering results or just adding unnecessary costs. How can you align your data strategy wi ... Show More
48m 17s
Jun 12
The state of play of data center development
The future of the grid increasingly hinges on where and how data centers get built. To forecast the kind of power infrastructure we need to meet AI’s growing appetite, we first need to understand a laundry list of variables: data center size, workload type, latency, reliability — ... Show More
39m 24s
Dec 2024
Best of 2024: The Art of Prompt Engineering with Alex Banks, Founder and Educator, Sunday Signal
As we look back at 2024, we're highlighting some of our favourite episodes of the year, and with 100 of them to choose from, it wasn't easy!The four guests we'll be recapping with are:Lea Pica - A celebrity in the data storytelling and visualisation space. Richie and Lea cover th ... Show More
44m 58s