logo
episode-header-image
Oct 2024
58m 1s

Bring Vector Search And Storage To The D...

Tobias Macey
About this episode
Summary
The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storage
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Lance is and the story behind it?
    • What are the core problems that Lance is designed to solve?
      • What is explicitly out of scope?
  • The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?
    • What formats does Lance replace or obviate?
  • In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?
    • Are there any practical or hard limitations on vector dimensionality?
  • When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?
  • I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?
  • What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?
  • The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?
    • What are the other main integrations for Lance?
    • What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?
  • What are the most interesting, innovative, or unexpected ways that you have seen Lance used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?
  • When is Lance the wrong choice?
  • What do you have planned for the future of Lance?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Up next
Nov 24
Blurring Lines: Data, AI, and the New Playbook for Team Velocity
Summary<br />In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and ... Show More
1 h
Nov 16
State, Scale, and Signals: Rethinking Orchestration with Durable Execution
Summary&nbsp;<br />In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, ... Show More
51m 46s
Nov 9
The AI Data Paradox: High Trust in Models, Low Trust in Data
Summary<br />In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in ... Show More
51m 35s
Recommended Episodes
Apr 2023
2344: Cloudera: Moving Beyond Big Data to Hybrid Data Mastery
I sit down with Chris Royles, EMEA Field CTO at Cloudera, to discuss the evolution of Big Data and why hybrid data is the next challenge for businesses to tackle. In this episode, we explore how the term 'Big Data' has become dated and how the rapid rise of hybrid data has shifte ... Show More
39m 54s
Jul 2022
IoT, IIoT and Managing Edge Data
<p>Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.</p><p><b>SHOW: 634</b></p><p><b>CLOUD NEWS OF THE WEEK - </b><a href='http://bit.ly/cloudcast-cnot ... Show More
35m 37s
Jan 2025
3164: Breaking Data Silos: How Hammerspace is Powering AI Storage and Hybrid Cloud
<p>As part of the IT Press Tour in Silicon Valley, I had the opportunity to sit down with David Flynn, CEO of Hammerspace, to explore how the company is redefining the future of enterprise data storage.</p> <p>At a time when AI-driven workloads and hybrid cloud computing are push ... Show More
24m 26s
Mar 2025
#295 How To Get Hired As A Data Or AI Engineer with Deepak Goyal, CEO & Founder at Azurelib Academy
The role of data and AI engineers is more critical than ever. With organizations collecting massive amounts of data, the challenge lies in building efficient data infrastructures that can support AI systems and deliver actionable insights. But what does it take to become a succes ... Show More
52m 27s
Sep 30
Turbopuffer with Simon Hørup Eskildsen
<p>Vector search has become a foundational technology for AI applications, enabling everything from semantic code search to contextual retrieval for large language models. However, a major challenge with vector databases has been the cost as data storage scales. Turbopuffer is a ... Show More
50m 36s
Apr 2025
Specialized AI brains for physical industry
Everyone wants a piece of general purpose models. Instacart has deployed ChatGPT for recipes and meal planning. The Mayo Clinic is using it to summarize patient records. Schneider Electric is using an OpenAI LLM to generate sustainability reports. With such powerful models, what’ ... Show More
37m 2s
Sep 4
Context-Aware SQL and Metadata with Shinji Kim
<p>A common challenge in data-rich organizations is that critical context about the data is often hard to capture and even harder to keep up to date. As more people across the organization use data and data models get more complex, simply finding the right dataset can be slow and ... Show More
41m 38s
Sep 11
The Future of Engineering Design with ToffeeX – Thomas Rees | Podcast #155
🔗 Toffeex Link: https://toffeex.com🔗 Thomas Rees on Linkedin: https://www.linkedin.com/in/twrees/?originalSubdomain=uk 📌 Episode Overview:In this episode, we explore the cutting edge of modern engineering design—from mimicking natural evolution with generative algorithms to st ... Show More
37m 31s
Mar 2025
189. Numbers Need Narrative: Use Data to Influence and Inspire
<p><em>Why numbers are only as compelling as the narratives we attach to them.</em></p><p><br></p><p>Facts and figures can be your friend, but before you load your presentation full of data, <a href="https://www.fastersmarter.io/guests/miro-kazakoff/">Miro Kazakoff</a> has a word ... Show More
23m 28s