About this episode
Summary
Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Materialize is and the problems that you are aiming to solve with it?
- What was your motivation for creating it?
- What use cases does Materialize enable?
- What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?
- How does it fit into the broader ecosystem of data tools and platforms?
- What are some of the use cases that Materialize is uniquely able to support?
- How is Materialize architected and how has the design evolved since you first began working on it?
- Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?
- What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?
- In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?
- A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?
- Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?
- What are the considerations and constraints on maintaining the full history of the source data within Materialize?
- For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?
- What is the workflow for defining and maintaining a set of views?
- What are some of the complexities that users might face in ensuring the ongoing functionality of those views?
- For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?
- The Materialize product is currently pre-release. What are the remaining steps before launching it?
- What do you have planned for the future of the product and company?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast
Jul 6
Foundational Data Engineering At 2Sigma
SummaryIn this episode of the Data Engineering Podcast Effie Baram, a leader in foundational data engineering at Two Sigma, talks about the complexities and innovations in data engineering within the finance sector. She discusses the critical role of data at Two Sigma, balancing ... Show More
55m 5s
Jun 29
Enabling Agents In The Enterprise With A Platform Approach
SummaryIn this episode of the Data Engineering Podcast Arun Joseph talks about developing and implementing agent platforms to empower businesses with agentic capabilities. From leading AI engineering at Deutsche Telekom to his current entrepreneurial venture focused on multi-agen ... Show More
54m 18s
Jun 18
Dagster's New Era: Modularizing Data Transformation in the Age of AI
SummaryIn this episode of the Data Engineering Podcast we welcome back Nick Schrock, CTO and founder of Dagster Labs, to discuss the evolving landscape of data engineering in the age of AI. As AI begins to impact data platforms and the role of data engineers, Nick shares his insi ... Show More
1h 1m
Feb 2023
Shorten the distance between production data and insight
Modern networked applications generate a lot of data, and every business wants to make the most of that data. Most of the time, that means moving production data through some transformation process to get it ready for the analytics process. But what if you could have in-app analy ... Show More
20m 27s
May 2020
How Important are algorithm and data structures in backend engineering?
Algorithms & Data Structures are critical to Backend Engineering however it really depends on what kind of application and infrastructure you are building. In this video I want to go through the following 1 Backend Engineers are two types - Integrating Existing Backend - Core ... Show More
13m 29s
Jul 2020
What data transformation library should I use? Pandas vs Dask vs Ray vs Modin vs Rapids (Ep. 112)
In this episode I speak about data transformation frameworks available for the data scientist who writes Python code. The usual suspect is clearly Pandas, as the most widely used library and de-facto standard. However when data volumes increase and distributed algorithms are in p ... Show More
21m 10s
Mar 2022
Bayesian Machine Learning with Ravin Kumar (Ep. 191)
This is one episode where passion for math, statistics and computers are merged. I have a very interesting conversation with Ravin, data scientist at Google where he uses data to inform decisions.
He has previously worked at Sweetgreen, designing systems that would benefit team ... Show More
31m 12s
Oct 2023
#628: Data on EKS
Organizations use their data to make better decisions and build innovative experiences for their customers. With the exponential growth in data, and the rapid pace of innovation in machine learning (ML), there is a growing need to build modern data applications that are agile and ... Show More
20m 56s
Nov 2021
Time Plus Data Equals Efficiency with Paul Dix, the Founder and CTO of InfluxData and the Creator of InfluxDB
If the topic of databases is brought up to certain people, their eyes may gloss over. But if that happened, that would be because they just don’t know the awesome power of databases. Data can be valuable but only if it is contextualized, and time is an extremely relevant aspect t ... Show More
36m 4s
Aug 2020
Introduction to GraphQL
Tanmai Gopal (@tanmaigo, CEO Hasura) and Rajoshi Ghosh (@rajoshighosh, COO Hasura) talk about the evolution of GraphQL as an efficient way to engage with APIs and data models, and how Hasura Cloud helps simplify GraphQL for developers.SHOW: 462
SHOW SPONSOR LINKS:Datadog Security ... Show More
40m 40s
Dec 2022
MongoDB Internal Architecture | The Backend Engineering Show
I’m a big believer that database systems share similar core fundamentals at their storage layer and understanding them allows one to compare different DBMS objectively. For example, How documents are stored in MongoDB is no different from how MySQL or PostgreSQL store rows. Every ... Show More
44m 13s
May 2024
Deepthi Sigireddi on Distributed Database Architecture in the Cloud Native Era
In this podcast, Vitess CNCF project technical lead Deepthi Sigireddi discusses the architecture of cloud native distributed databases, sharding, replication, and failover. She also talks about what DB developers should consider when choosing distributed databases.
Read a transcr ... Show More
37m 24s
Mar 2024
LLM Security and Privacy
Sean Falconer (@seanfalconer, Head of Dev Relations @SkyflowAPI, Host @software_daily) talks about security and privacy of LLMs and how to prevent PII (personally identifiable information) from leaking outSHOW: 807
CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotw
NEW TO CLO ... Show More
26m 9s