logo
episode-header-image
Oct 2024
58m 1s

Bring Vector Search And Storage To The D...

Tobias Macey
About this episode
Summary
The rapid growth of generative AI applications has prompted a surge of investment in vector databases. While there are numerous engines available now, Lance is designed to integrate with data lake and lakehouse architectures. In this episode Weston Pace explains the inner workings of the Lance format for table definitions and file storage, and the optimizations that they have made to allow for fast random access and efficient schema evolution. In addition to integrating well with data lakes, Lance is also a first-class participant in the Arrow ecosystem, making it easy to use with your existing ML and AI toolchains. This is a fascinating conversation about a technology that is focused on expanding the range of options for working with vector data.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • Your host is Tobias Macey and today I'm interviewing Weston Pace about the Lance file and table format for column-oriented vector storage
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Lance is and the story behind it?
    • What are the core problems that Lance is designed to solve?
      • What is explicitly out of scope?
  • The README mentions that it is straightforward to convert to Lance from Parquet. What is the motivation for this compatibility/conversion support?
    • What formats does Lance replace or obviate?
  • In terms of data modeling Lance obviously adds a vector type, what are the features and constraints that engineers should be aware of when modeling their embeddings or arbitrary vectors?
    • Are there any practical or hard limitations on vector dimensionality?
  • When generating Lance files/datasets, what are some considerations to be aware of for balancing file/chunk sizes for I/O efficiency and random access in cloud storage?
  • I noticed that the file specification has space for feature flags. How has that aided in enabling experimentation in new capabilities and optimizations?
  • What are some of the engineering and design decisions that were most challenging and/or had the biggest impact on the performance and utility of Lance?
  • The most obvious interface for reading and writing Lance files is through LanceDB. Can you describe the use cases that it focuses on and its notable features?
    • What are the other main integrations for Lance?
    • What are the opportunities or roadblocks in adding support for Lance and vector storage/indexes in e.g. Iceberg or Delta to enable its use in data lake environments?
  • What are the most interesting, innovative, or unexpected ways that you have seen Lance used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Lance format?
  • When is Lance the wrong choice?
  • What do you have planned for the future of Lance?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Up next
Jul 6
Foundational Data Engineering At 2Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Recommended Episodes
Apr 2023
2344: Cloudera: Moving Beyond Big Data to Hybrid Data Mastery
I sit down with Chris Royles, EMEA Field CTO at Cloudera, to discuss the evolution of Big Data and why hybrid data is the next challenge for businesses to tackle. In this episode, we explore how the term 'Big Data' has become dated and how the rapid rise of hybrid data has shifte ... Show More
39m 54s
Nov 2024
#262 Self-Service Business Intelligence with Sameer Al-Sakran, CEO at Metabase
We’re improving DataFramed, and we need your help! We want to hear what you have to say about the show, and how we can make it more enjoyable for you—find out more here.We’re often caught chasing the dream of “self-serve” data—a place where data empowers stakeholders to answer th ... Show More
51m 33s
Jul 2022
IoT, IIoT and Managing Edge Data
Brian Gilmore (@BrianMGilmore, Director IoT/Emerging Technology @InfluxDB) talks about Edge and Industrial Edge Computing, as well as application and data challenges at the edge.SHOW: 634CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST ... Show More
35m 37s
Jan 2025
3164: Breaking Data Silos: How Hammerspace is Powering AI Storage and Hybrid Cloud
As part of the IT Press Tour in Silicon Valley, I had the opportunity to sit down with David Flynn, CEO of Hammerspace, to explore how the company is redefining the future of enterprise data storage. At a time when AI-driven workloads and hybrid cloud computing are pushing storag ... Show More
24m 26s
Oct 2024
MongoDB Vector Search with Ben Flast
MongoDB Atlas is a managed NoSQL database that uses JSON-like documents with optional schemas. The platform recently released new vector search capabilities to facilitate building AI capabilities. Ben Flast is the Director of Product Management at MongoDB. He joins the show to ta ... Show More
41m 54s
Oct 2024
Data Lakehouses & Apache Iceberg
Alex Merced (@AMdatalakehouse, Senior Tech Evangelist, @dremio) talks about everything data and we dig deep into Apache Iceberg and DataLakehouses. SHOW: 865 Want to go to All Things Open in Raleigh for FREE? (Oct 27th-29th) We are offering 5 Free passes, first come, first serve ... Show More
28m 35s
Jun 2024
How Avangrid built a data foundation for AI
Mark Waclawiak was tuned into energy issues at an early age. Both his parents worked in the industry: his mom designed electrical systems for buildings and his dad worked at the utility. So the importance of electricity was always apparent to him.When he started working for a uti ... Show More
24m 35s
Mar 2017
MetPy: Taming The Weather With Python
Summary What’s the weather tomorrow? That’s the question that meteorologists are always trying to get better at answering. This week the developers of MetPy discuss how their project is used in that quest and the challenges that are inherent in atmospheric and weather research. I ... Show More
52m 23s
Jun 12
The state of play of data center development
The future of the grid increasingly hinges on where and how data centers get built. To forecast the kind of power infrastructure we need to meet AI’s growing appetite, we first need to understand a laundry list of variables: data center size, workload type, latency, reliability — ... Show More
39m 24s
Jan 2025
The Role of Analytics in Shaping the Future of MLOps
Sophia Rowland, Senior Product Manager at SAS, discusses her journey from data science to product management at SAS, focusing on the integration of AI and analytics. She explains the concepts of Model Ops and ML Ops, the challenges organizations face in operationalizing machine l ... Show More
32m 42s