logo
episode-header-image
Jun 11
44m 9s

AI and the Lakehouse: How Starburst is P...

Tobias Macey
About this episode
Summary
In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your AI platforms. Visit dataengineeringpodcast.com/coresignal to start your free 14-day trial.
  • Your host is Tobias Macey and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the interaction points of AI with the types of data workflows that you are supporting with Starburst?
  • What are some of the limitations of warehouse and lakehouse systems when it comes to supporting AI systems?
  • What are the points of friction for engineers who are trying to employ LLMs in the work of maintaining a lakehouse environment?
  • Methods such as tool use (exemplified by MCP) are a means of bolting on AI models to systems like Trino. What are some of the ways that is insufficient or cumbersome?
  • Can you describe the technical implementation of the AI-oriented features that you have incorporated into the Starburst platform?
    • What are the foundational architectural modifications that you had to make to enable those capabilities?
  • For the vector storage and indexing, what modifications did you have to make to iceberg?
    • What was your reasoning for not using a format like Lance?
  • For teams who are using Starburst and your new AI features, what are some examples of the workflows that they can expect?
  • What new capabilities are enabled by virtue of embedding AI features into the interface to the lakehouse?
  • What are the most interesting, innovative, or unexpected ways that you have seen Starburst AI features used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI features for Starburst?
  • When is Starburst/lakehouse the wrong choice for a given AI use case?
  • What do you have planned for the future of AI on Starburst?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Up next
Jul 6
Foundational Data Engineering At 2Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Recommended Episodes
Aug 2024
SE Radio 631: Abhay Paroha on Cloud Migration for Oil and Gas Operations
Abhay Paroha, an engineering leader with more than 15 years' experience in leading product dev teams, joins SE Radio's Kanchan Shringi to talk about cloud migration for oil and gas production operations. They discuss Abhay's experiences in building a cloud foundation layer that i ... Show More
58m 53s
Mar 2024
Venkatesh Rao: Protocols, Intelligence, and Scaling
“There is this move from generality in a relative sense of ‘we are not as specialized as insects’ to generality in the sense of omnipotent, omniscient, godlike capabilities. And I think there's something very dangerous that happens there, which is you start thinking of the word ‘ ... Show More
2h 18m
Aug 2024
Episode 201 - Introduction to KitOps for MLOps
Join Allen and Mark in this episode of Two Voice Devs as they dive into the world of MLOps and explore KitOps, an open-source tool for packaging and versioning machine learning models and related artifacts. Learn how KitOps leverages the Open Container Initiative (OCI) standard t ... Show More
33m 59s
Aug 2024
Driving Training Workflows with Tribal Knowledge - with Brenda Kahl of Illumina
Today’s guest is Brenda Kahl, Senior Director of Service and Support at Illumina. Illumina is a San Diego-based biotechnology company founded in 1998 that develops and markets systems for genetic analysis, serving sequencing, genotyping, gene expression, and proteomics markets in ... Show More
20m 59s
Feb 2025
Satya Nadella – Microsoft’s AGI Plan & Quantum Breakthrough
Satya Nadella on: Why he doesn’t believe in AGI but does believe in 10% economic growth; Microsoft’s new topological qubit breakthrough and gaming world models;Whether Office commoditizes LLMs or the other way around. Watch on Youtube; listen on Apple Podcasts or Spotify.-------- ... Show More
1h 16m
Nov 2024
scikit-learn & data science you own
We are at GenAI saturation, so let’s talk about scikit-learn, a long time favorite for data scientists building classifiers, time series analyzers, dimensionality reducers, and more! Scikit-learn is deployed across industry and driving a significant portion of the “AI” that is ac ... Show More
52m 2s
May 2023
TinyML: Bringing machine learning to the edge
When we think about machine learning today we often think in terms of immense scale — large language models that require huge amounts of computational power, for example. But one of the most interesting innovations in machine learning right now is actually happening on a really s ... Show More
45m 45s
Jun 2019
SLP80 Richard Myers - Bitcoin Incentivised Mesh Networking with Lot49
Richard Myers of Gotenna and Global Mesh Labs joins me to talk about a new way to improve mesh networking with the use of bitcoin payments to provide incentive for message and packet routing. We talk: What the problem is with the current set up of the internet and current mesh ne ... Show More
57m 14s
Jul 2022
IoT, IIoT and Managing Edge Data
Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.SHOW: 634CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST ... Show More
35m 37s
Sep 2024
Pausing to think about scikit-learn & OpenAI o1
Recently the company stewarding the open source library scikit-learn announced their seed funding. Also, OpenAI released “o1” with new behavior in which it pauses to “think” about complex tasks. Chris and Daniel take some time to do their own thinking about o1 and the contrast to ... Show More
50m 10s