logo
episode-header-image
Jan 2022
1h 2m

Automated Data Quality Management Throug...

Tobias Macey
About this episode

Summary

Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
  • Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
  • The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses.
  • Your host is Tobias Macey and today I’m interviewing Elliot Shmukler and Jeremy Stanley about Anomalo, a data quality platform aiming to automate issue detection with zero setup

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Anomalo is and the story behind it?
  • Managing data quality is ostensibly about building trust in your data. What are the promises that data teams are able to make about the information in their control when they are using Anomalo?
    • What are some of the claims that cannot be made unequivocally when relying on data quality monitoring systems?
  • types of data quality issues identified
    • utility of automated vs programmatic tests
  • Can you describe how the Anomalo system is designed and implemented?
    • How have the design and goals of the platform changed or evolved since you started working on it?
  • What is your approach for validating changes to the business logic in your platform given the unpredictable nature of the system under test?
  • model training/customization process
  • statistical model
  • seasonality/windowing
  • CI/CD
  • With any monitoring system the most challenging thing to do is avoid generating alerts that aren’t actionable or helpful. What is your strategy for helping your customers avoid alert fatigue?
  • What are the most interesting, innovative, or unexpected ways that you have seen Anomalo used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomalo?
  • When is Anomalo the wrong choice?
  • What do you have planned for the future of Anomalo?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Up next
Jul 6
Foundational Data Engineering At 2Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Recommended Episodes
Mar 2022
Bayesian Machine Learning with Ravin Kumar (Ep. 191)
This is one episode where passion for math, statistics and computers are merged. I have a very interesting conversation with Ravin,  data scientist at Google where he uses data to inform decisions. He has previously worked at Sweetgreen, designing systems that would benefit team ... Show More
31m 12s
Mar 2022
Mining the Golden Age of Data with Tableau’s CEO & President Mark Nelson
Mark Nelson is the President and CEO of Tableau, a company dedicated to democratizing analytics and putting data back in the hands of consumers. But while this digital pioneer may be excited about the technical side of things, he’s more excited about how accessing data (and askin ... Show More
36m 32s
Apr 2024
2850: From Data Overload to Insight: Sigma Computing's Blueprint for Business Intelligence
In the dynamic realm of business intelligence and data analytics, the journey from static charts to live, collaborative decision-making platforms illustrates a profound evolution. This episode of Tech Talks Daily welcomes Mike Palmer, CEO of Sigma Computing, to shed light on this ... Show More
29m 31s
Aug 2021
Building the Better, More Scalable Algorithms with SigOpt’s Scott Clark
An A.I. the model is similar to a boat in that it needs constant maintenance to perform. The reality is  A.I. models need adjusted boundaries and guidelines to remain efficient.  And when you live in a world where everyone is trying to get bigger and faster and have a certain edg ... Show More
35m 36s
Jun 2021
Buying and Selling Homes Algorithmically with Opendoor’s VP of Research and Data Science, Kushal Chakrabarti
For many people, the process of buying and selling a home will undoubtedly be the most difficult decisions they will make in their lifetime. Is the price you’re paying for your home fair? Is the price you’re selling your home for an adequate sale price? For a long time, realtors ... Show More
32m 26s
Aug 2022
Rendered.ai CEO Nathan Kundtz on Using AI to Build Better AI - Ep. 177
Data is the fuel that makes artificial intelligence run. Training machine learning and AI systems requires data. And the quality of datasets has a big impact on the systems’ results. But compiling quality real-world data for AI and ML can be difficult and expensive. That’s where ... Show More
31m 18s
Jun 2024
Rise of the AI PC & local LLMs
We’ve seen a rise in interest recently and a number of major announcements related to local LLMs and AI PCs. NVIDIA, Apple, and Intel are getting into this along with models like the Phi family from Microsoft. In this episode, we dig into local AI tooling, frameworks, and optimiz ... Show More
35m 35s