episode-header-image

Nov 2021

58m 55s

Data Quality Starts At The Source

About this episode

Summary

The most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
Your host is Tobias Macey and today I’m interviewing Michael Harper about definitions of data quality and where to define and enforce it in the data platform

Interview

Introduction
How did you get involved in the area of data management?
What is your definition for the term "data quality" and what are the implied goals that it embodies?
- What are some ways that different stakeholders and participants in the data lifecycle might disagree about the definitions and manifestations of data quality?
The market for "data quality tools" has been growing and gaining attention recently. How would you categorize the different approaches taken by open source and commercial options in the ecosystem?
- What are the tradeoffs that you see in each approach? (e.g. data warehouse as a chokepoint vs quality checks on extract)
What are the difficulties that engineers and stakeholders encounter when identifying and defining information that is necessary to identify issues in their workflows?
Can you describe some examples of adding data quality checks to the beginning stages of a data workflow and the kinds of issues that can be identified?
- What are some ways that quality and observability metrics can be aggregated across multiple pipeline stages to identify more complex issues?
In application observability the metrics across multiple processes are often associated with a given service. What is the equivalent concept in data platform observabiliity?
In your work at Databand what are some of the ways that your ideas and assumptions around data quality have been challenged or changed?
What are the most interesting, innovative, or unexpected ways that you have seen Databand used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working at Databand?
When is Databand the wrong choice?
What do you have planned for the future of Databand?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Up next

The Data Model That Captures Your Business: Metric Trees Explained

SummaryIn this episode of the Data Engineering Podcast Vijay Subramanian, founder and CEO of Trace, talks about metric trees - a new approach to data modeling that directly captures a company's business model. Vijay shares insights from his decade-long experience building data pr ... Show More

From GPUs-as-a-Service to Workloads-as-a-Service: Flex AI’s Path to High-Utilization AI Infra

SummaryIn this crossover episode of the AI Engineering Podcast, host Tobias Macey interviews Brijesh Tripathi, CEO of Flex AI, about revolutionizing AI engineering by removing DevOps burdens through "workload as a service". Brijesh shares his expertise from leading AI/HPC archite ... Show More

From RAG to Relational: How Agentic Patterns Are Reshaping Data Architecture

SummaryIn this episode of the AI Engineering Podcast Mark Brooker, VP and Distinguished Engineer at AWS, talks about how agentic workflows are transforming database usage and infrastructure design. He discusses the evolving role of data in AI systems, from traditional models to m ... Show More

Recommended Episodes

Time Plus Data Equals Efficiency with Paul Dix, the Founder and CTO of InfluxData and the Creator of InfluxDB

If the topic of databases is brought up to certain people, their eyes may gloss over. But if that happened, that would be because they just don’t know the awesome power of databases. Data can be valuable but only if it is contextualized, and time is an extremely relevant aspect t ... Show More

Making the Turn from Data Inventory to Helpful Information with Mara Reiff, the Chief Data Officer of FreshBooks

If data is in a pool that only keeps getting deeper as data inventory is accounted for, when is the exact moment for a business leader to jump in to do something with all the accumulated information? Leaders who care about data appreciate that it’s necessary to take stock before ... Show More

From Different Leadership Vantage Points: Data Drives Value but is Driven by Values

One way to think about data is that it is like rain, and it is pouring outside. Imagine c-suite executives running around in a parking lot with huge buckets trying to capture as much as they can. Afterward, they return to the office, analyze the data, and then decide what to do b ... Show More

Mining the Golden Age of Data with Tableau’s CEO & President Mark Nelson

Mark Nelson is the President and CEO of Tableau, a company dedicated to democratizing analytics and putting data back in the hands of consumers. But while this digital pioneer may be excited about the technical side of things, he’s more excited about how accessing data (and askin ... Show More

Buying and Selling Homes Algorithmically with Opendoor’s VP of Research and Data Science, Kushal Chakrabarti

For many people, the process of buying and selling a home will undoubtedly be the most difficult decisions they will make in their lifetime. Is the price you’re paying for your home fair? Is the price you’re selling your home for an adequate sale price? For a long time, realtors ... Show More

The Algorithms that Bring you Style with Stitch Fix’s Director of Data Science, Tatsiana Maskalevich

The old saying, “look good, feel good,'' fits Stitch Fix perfectly. The direct-to-consumer, online personal styling service has boomed due to its ability to not only match consumers with trendy and comfortable clothes, but to make it a personalized experience for each buyer.“At t ... Show More

#162 Scaling Data Engineering in Retail with Mohammad Sabah, SVP of Engineering & Data at Thrive Market

Poor data engineering is like building a shaky foundation for a house—it leads to unreliable information, wasted time and money, and even legal problems, making everything less dependable and more troublesome in our digital world. In the retail industry specifically, data enginee ... Show More

#628: Data on EKS

Organizations use their data to make better decisions and build innovative experiences for their customers. With the exponential growth in data, and the rapid pace of innovation in machine learning (ML), there is a growing need to build modern data applications that are agile and ... Show More

Listen to millions of songs and podcasts on Anghami