logo
episode-header-image
Jan 2015
10m 56s

[MINI] Data Provenance

Kyle Polich
About this episode

This episode introduces a high level discussion on the topic of Data Provenance, with more MINI episodes to follow to get into specific topics. Thanks to listener Sara L who wrote in to point out the Data Skeptic Podcast has focused alot about using data to be skeptical, but not necessarily being skeptical of data.

Data Provenance is the concept of knowing the full origin of your dataset. Where did it come from? Who collected it? How as it collected? Does it combine independent sources or one singular source? What are the error bounds on the way it was measured? These are just some of the questions one should ask to understand their data. After all, if the antecedent of an argument is built on dubious grounds, the consequent of the argument is equally dubious.

For a more technical discussion than what we get into in this mini epiosode, I recommend A Survey of Data Provenance Techniques by authors Simmhan, Plale, and Gannon.

Up next
Yesterday
Designing Recommender Systems for Digital Humanities
<p>In this episode of Data Skeptic, we explore the fascinating intersection of recommender systems and digital humanities with guest Florian Atzenhofer-Baumgartner, a PhD student at Graz University of Technology. Florian is working on <a href= "http://monasterium.net/">Monasteriu ... Show More
36m 48s
Nov 13
DataRec Library for Reproducible in Recommend Systems
<p>In this episode of Data Skeptic's Recommender Systems series, host Kyle Polich explores DataRec, a new Python library designed to bring reproducibility and standardization to recommender systems research. Guest Alberto Carlo Maria Mancino, a postdoc researcher from Politecnico ... Show More
32m 48s
Nov 5
Shilling Attacks on Recommender Systems
In this episode of Data Skeptic's Recommender Systems series, Kyle sits down with Aditya Chichani, a senior machine learning engineer at Walmart, to explore the darker side of recommendation algorithms. The conversation centers on shilling attacks—a form of manipulation where mal ... Show More
34m 48s
Recommended Episodes
Jun 2019
Data Trusts and Citation Trends
<p>In episode eleven of season five, we dig in to just what a data trust actually is, take a look at <a href="http://maithraraghu.com/blog/2019/Citation_Statistics_of_Machine_Learning_Papers/" target="_blank">citation trends </a>and other places <a href="http://proceedings.mlr.pr ... Show More
54m 15s
Feb 2025
863: TabPFN: Deep Learning for Tabular Data (That Actually Works!), with Prof. Frank Hutter
Jon Krohn talks tabular data with Frank Hutter, Professor of Artificial Intelligence at Universität Freiburg in Germany. Despite the great steps that deep learning has made in analysing images, audio, and natural language, tabular data has remained its insurmountable obstacle. In ... Show More
1h 6m
Dec 2024
2. ORIGINS OF AI + sending an accidental n*de to your PI?
<p>🤖 welcome BACK to season 4 episode 2 of the&nbsp;@SoCulturedPodcast&nbsp;! This season is all about discovering science through the lens of history, and todays episode we are delving into the origins of AI!</p><br><p>❓Where did AI get its start? Who were the key figures in th ... Show More
50 m
Jun 2025
#274: Real Talk About Synthetic Data with Winston Li
Synthetic data: it's a fascinating topic that sounds like science fiction but is rapidly becoming a practical tool in the data landscape. From machine learning applications to safeguarding privacy, synthetic data offers a compelling alternative to real-world datasets that might b ... Show More
58m 5s
Sep 2024
821: The Skills You Need to Be an Effective Data Scientist, with Marck Vaisman
Marck Vaisman speaks to Jon Krohn about his paradigm for understanding core data practitioner types. Hear Marck detail the four data practitioner personas that he has identified in his research, why he believes the roadmaps that influencers like to promote as surefire ways to a d ... Show More
1h 13m
Oct 2021
AI Today Podcast: Data science in the Enterprise: Interview with Sanyam Bhutani, host of Chai Time Data Science podcast
On the AI Today podcast we regularly interview thought leaders who are implementing AI and cognitive technology at various companies and agencies. However in this episode hosts Kathleen Walch and Ron Schmelzer interview Sanyam Bhutani, host of Chai Time Data Science podcast. As h ... Show More
23m 38s
Aug 2023
What is Data Science?
<p>We’ve been inundated with questions from our listeners on what defines a data scientist, how to break into analytics, and ways for the average person to assess data reliability. That is why for this month, we interview our very own Xiao-Li Meng, who has contemplated many such ... Show More
40m 19s
Jun 2025
893: How to Jumpstart Your Data Career (by Applying Like a Scientist), with Avery Smith
Avery Smith is a passionate and motivational YouTuber and careers educator for data science. In this episode, Jon Krohn asks Avery about the tools and tricks he has learned from personal experience and from his students in how to get ahead in the tech industry. Avery shares the “ ... Show More
1h 17m
Apr 2024
The Top EconTalk Conversations of 2023 (with Russ Roberts)
<p>The favorite EconTalk episodes for host Russ Roberts are when he and his guest have an unusually powerful connection such as his recent episode with Charles Duhigg, and the ones where he learns something mind-blowing, like Adam Mastroianni’s insight that you can’t reach the br ... Show More
42m 8s
May 2025
#271: It Might Be Irrational, but Let's Talk Behavioral Science with Dr. Lindsay Juarez
Data that tracks what users and customers do is behavioral data. But behavioral science is much more about why humans do things and what sorts of techniques can be employed to nudge them to do something specific. On this episode, behavioral scientist Dr. Lindsay Juarez from Irrat ... Show More
1 h