episode-header-image

Jul 2021

1h 1m

Strategies For Proactive Data Quality Ma...

About this episode

Summary

Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production.

Announcements

Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Datafold and the story behind it?
What are the biggest factors that you see contributing to data quality issues?
- How are teams identifying and addressing those failures?
How does the data platform architecture impact the potential for introducing quality problems?
What are some of the potential risks or consequences of introducing errors in data processing?
How can organizations shift to being proactive in their data quality management?
- How much of a role does tooling play in addressing the introduction and remediation of data quality problems?
Can you describe how Datafold is designed and architected to allow for proactive management of data quality?
- What are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold?
What is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development?
What are the organizational patterns that you have found to be most conducive to proactive data quality management?
- Who is responsible for identifying and addressing quality issues?
What are the most interesting, innovative, or unexpected ways that you have seen Datafold used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?
When is Datafold the wrong choice?
What do you have planned for the future of Datafold?

Contact Info

LinkedIn
@glebmm on Twitter

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Special Guest: Gleb Mezhanskiy.

Support Data Engineering Podcast

Up next

Blurring Lines: Data, AI, and the New Playbook for Team Velocity

Summary<br />In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and ... Show More

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Summary <br />In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, ... Show More

The AI Data Paradox: High Trust in Models, Low Trust in Data

Summary<br />In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in ... Show More

Recommended Episodes

LLM Security and Privacy

<p>Sean Falconer (@seanfalconer, Head of Dev Relations @SkyflowAPI, Host @software_daily) talks about security and privacy of LLMs and how to prevent PII (personally identifiable information) from leaking out</p><p><b>SHOW: 807<br/><br/>CLOUD NEWS OF THE WEEK - </b><a href='http: ... Show More

Mining the Golden Age of Data with Tableau’s CEO & President Mark Nelson

<p><a href="https://www.linkedin.com/in/markthomasnelson/">Mark Nelson</a> is the President and CEO of <a href="https://www.tableau.com/">Tableau</a>, a company dedicated to democratizing analytics and putting data back in the hands of consumers. But while this digital pioneer ma ... Show More

The Algorithms that Bring you Style with Stitch Fix’s Director of Data Science, Tatsiana Maskalevich

<p>The old saying, “look good, feel good,'' fits Stitch Fix perfectly. The direct-to-consumer, online personal styling service has boomed due to its ability to not only match consumers with trendy and comfortable clothes, but to make it a personalized experience for each buyer.</ ... Show More

Time Plus Data Equals Efficiency with Paul Dix, the Founder and CTO of InfluxData and the Creator of InfluxDB

<p>If the topic of databases is brought up to certain people, their eyes may gloss over. But if that happened, that would be because they just don’t know the awesome power of databases. Data can be valuable but only if it is contextualized, and time is an extremely relevant aspec ... Show More

Making the Turn from Data Inventory to Helpful Information with Mara Reiff, the Chief Data Officer of FreshBooks

<p>If data is in a pool that only keeps getting deeper as data inventory is accounted for, when is the exact moment for a business leader to jump in to do something with all the accumulated information? Leaders who care about data appreciate that it’s necessary to take stock befo ... Show More

101: Expert Data Analytics Panel Tells All (100th Episode Celebration)

<p>The 100th celebration episode of the Data Career Podcast features a special panel interview conducted by Avery Smith with prominent data content creators, including <strong>Ken Jee</strong>, <strong>Monica Kay Royal</strong>, <strong>Richad Nieves-Becker</strong>, and <strong> ... Show More

Deepthi Sigireddi on Distributed Database Architecture in the Cloud Native Era

In this podcast, Vitess CNCF project technical lead Deepthi Sigireddi discusses the architecture of cloud native distributed databases, sharding, replication, and failover. She also talks about what DB developers should consider when choosing distributed databases. Read a transcr ... Show More

Buying and Selling Homes Algorithmically with Opendoor’s VP of Research and Data Science, Kushal Chakrabarti

<p>For many people, the process of buying and selling a home will undoubtedly be the most difficult decisions they will make in their lifetime. Is the price you’re paying for your home fair? Is the price you’re selling your home for an adequate sale price? For a long time, realto ... Show More

Listen to millions of songs and podcasts on Anghami