About this episode
Summary
The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
- Your host is Tobias Macey and today I’m interviewing Davit Buniatyan about Activeloop, a platform for hosting and delivering datasets optimized for machine learning
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Activeloop is and the story behind it?
- How does the form and function of data storage introduce friction in the development and deployment of machine learning projects?
- How does the work that you are doing at Activeloop compare to vector databases such as Pinecone?
- You have a focus on image oriented data and computer vision projects. How does the specific applications of ML/DL influence the format and interactions with the data?
- Can you describe how the Activeloop platform is architected?
- How have the design and goals of the system changed or evolved since you began working on it?
- What are the feature and performance tradeoffs between self-managed storage locations (e.g. S3, GCS) and the Activeloop platform?
- What is the process for sourcing, processing, and storing data to be used by Hub/Activeloop?
- Many data assets are useful across ML/DL and analytical purposes. What are the considerations for managing the lifecycle of data between Activeloop/Hub and a data lake/warehouse?
- What do you see as the opportunity and effort to generalize Hub and Activeloop to support arbitrary ML frameworks/languages?
- What are the most interesting, innovative, or unexpected ways that you have seen Activeloop and Hub used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Activeloop?
- When is Hub/Activeloop the wrong choice?
- What do you have planned for the future of Activeloop?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Nov 24
Blurring Lines: Data, AI, and the New Playbook for Team Velocity
Summary<br />In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and ... Show More
1 h
Nov 16
State, Scale, and Signals: Rethinking Orchestration with Durable Execution
Summary <br />In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, ... Show More
51m 46s
Nov 9
The AI Data Paradox: High Trust in Models, Low Trust in Data
Summary<br />In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in ... Show More
51m 35s
Mar 2022
Bayesian Machine Learning with Ravin Kumar (Ep. 191)
<p>This is one episode where passion for math, statistics and computers are merged.
I have a very interesting conversation with Ravin, data scientist at Google where he uses data to inform decisions.</p>
<p>He has previously worked at Sweetgreen, designing systems that would b ... Show More
31m 12s
Nov 2021
Time Plus Data Equals Efficiency with Paul Dix, the Founder and CTO of InfluxData and the Creator of InfluxDB
<p>If the topic of databases is brought up to certain people, their eyes may gloss over. But if that happened, that would be because they just don’t know the awesome power of databases. Data can be valuable but only if it is contextualized, and time is an extremely relevant aspec ... Show More
36m 4s
Feb 2023
Shorten the distance between production data and insight
<p>Modern networked applications generate a lot of data, and every business wants to make the most of that data. Most of the time, that means moving production data through some transformation process to get it ready for the analytics process. But what if you could have in-app an ... Show More
20m 27s
Aug 2018
The Future of Computing
<p>In this episode, we are joined by Alex Wright-Gladstein, CEO and co-founder of Ayar Labs. Ayar Labs has developed new electronic-photonic integrated circuits that move data using light instead of electricity.</p> <p>Alex shares exciting insights around the future of computing ... Show More
29m 8s
Mar 2021
Solving the World's Biggest Problems at Scale, with WekaIO President, Ken Grohe
<p>The No. 1 feature of technology is storage. Ok, so that’s not true. But, it’s one of the most critical pieces of hardware that enables software to function. How fast, how easy, and how much data can be accessed and leveraged inside of applications plays a critical part in tech ... Show More
48m 5s
Mar 2022
Mining the Golden Age of Data with Tableau’s CEO & President Mark Nelson
<p><a href="https://www.linkedin.com/in/markthomasnelson/">Mark Nelson</a> is the President and CEO of <a href="https://www.tableau.com/">Tableau</a>, a company dedicated to democratizing analytics and putting data back in the hands of consumers. But while this digital pioneer ma ... Show More
36m 32s
Jun 2024
Making ETL pipelines a thing of the past
<p>RelationalAI’s first <a href="https://relational.ai/resources/introducing-first-ai-coprocessor" target="_blank">big partner is Snowflake</a>, meaning customers can now start using their data with GenAI without worrying about the privacy, security, and governance hassle that wo ... Show More
26m 13s
Jun 2022
Using AI to Supercharge Data-Driven Applications with Zilliz
Theo is in the interviewer’s chair for this episode as Frank Liu from Zilliz joins the show to talk about how AI and machine learning are making it possible for developers to understand and extract more value from unstructured data such as text, audio, images, video, and more. Tr ... Show More
20 m