logo
episode-header-image
Mar 2020
43m 36s

Easier Stream Processing On Kafka With k...

Tobias Macey
About this episode

Summary

Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
  • Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
  • You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
  • Your host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing what ksqlDB is?
  • What are some of the use cases that it is designed for?
  • How do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize?
  • What was the motivation for building a unified project for providing a database interface on the data stored in Kafka?
  • How is ksqlDB architected?
    • If you were to rebuild the entire platform and its components from scratch today, what would you do differently?
  • What is the workflow for an analyst or engineer to design and build an application on top of ksqlDB?
    • What dialect of SQL is supported?
      • What kinds of extensions or built in functions have been added to aid in the creation of streaming queries?
  • How are table schemas defined and enforced?
    • How do you handle schema migrations on active streams?
  • Typically a database is considered a long term storage location for data, whereas Kafka is a streaming layer with a bounded amount of durable storage. What is a typical lifecycle of information in ksqlDB?
  • Can you talk through an example architecture that might incorporate ksqlDB including the source systems, applications that might interact with the data in transit, and any destinations sytems for long term persistence?
  • What are some of the less obvious features of ksqlDB or capabilities that you think should be more widely publicized?
  • What are some of the edge cases or potential pitfalls that users should be aware of as they are designing their streaming applications?
  • What is involved in deploying and maintaining an installation of ksqlDB?
    • What are some of the operational characteristics of the system that should be considered while planning an installation such as scaling factors, high availability, or potential bottlenecks in the architecture?
  • When is ksqlDB the wrong choice?
  • What are some of the most interesting/unexpected/innovative projects that you have seen built with ksqlDB?
  • What are some of the most interesting/unexpected/challenging lessons that you have learned while working on ksqlDB?
  • What is in store for the future of the project?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Up next
Yesterday
Blurring Lines: Data, AI, and the New Playbook for Team Velocity
Summary<br />In this crossover episode, Max Beauchemin explores how multiplayer, multi‑agent engineering is transforming the way individuals and teams build data and AI systems. He digs into the shifting boundary between data and AI engineering, the rise of “context as code,” and ... Show More
1 h
Nov 16
State, Scale, and Signals: Rethinking Orchestration with Durable Execution
Summary&nbsp;<br />In this episode Preeti Somal, EVP of Engineering at Temporal, talks about the durable execution model and how it reshapes the way teams build reliable, stateful systems for data and AI. She explores Temporal’s code‑first programming model—workflows, activities, ... Show More
51m 46s
Nov 9
The AI Data Paradox: High Trust in Models, Low Trust in Data
Summary<br />In this episode of the Data Engineering Podcast Ariel Pohoryles, head of product marketing for Boomi's data management offerings, talks about a recent survey of 300 data leaders on how organizations are investing in data to scale AI. He shares a paradox uncovered in ... Show More
51m 35s
Recommended Episodes
Dec 2022
MongoDB Internal Architecture | The Backend Engineering Show
<p>I’m a big believer that database systems share similar core fundamentals at their storage layer and understanding them allows one to compare different DBMS objectively. For example, How documents are stored in MongoDB is no different from how MySQL or PostgreSQL store rows. Ev ... Show More
44m 13s
Jan 2023
MySQL on HTTP/3 | The Backend Engineering Show
<p>The communication between backend applications and database systems always fascinated me. The protocols keep evolving and we are in constant search for an efficient protocol that best fit the workload of Backend-DB communication.</p> <p>In this episode of the backend engineeri ... Show More
37m 10s
Jul 2021
Should you go with an Optimistic or Pessimistic Concurrency Control Database?
<p>MongoDB, Postgres, Microsoft SQL Server, or MySQL, or any other database manages concurrency control differently. There are two methods, pessimistic and optimistic, both have their pros and cons. Let explore how different databases implement this and what is the effect on perf ... Show More
21m 46s
May 2020
How Important are algorithm and data structures in backend engineering?
<p>Algorithms &amp; Data Structures are critical to Backend Engineering however it really depends on what kind of application and infrastructure you are building. In this video I want to go through the following &nbsp;&nbsp;1 Backend Engineers are two types - Integrating Existing ... Show More
13m 29s
Feb 2023
Shorten the distance between production data and insight
<p>Modern networked applications generate a lot of data, and every business wants to make the most of that data. Most of the time, that means moving production data through some transformation process to get it ready for the analytics process. But what if you could have in-app an ... Show More
20m 27s
Feb 2023
Postgres Architecture | The Backend Engineering Show
<p>Creating a listener on the backend application that accepts connections is simple. You listen on an address-port pair, connection attempts to that address and port will get added to an accept queue; The application accepts connections from the queue and start reading the data ... Show More
34m 4s
May 2022
Why this query is fast
<p>Welcome to another database question. In this question I created a community poll question and provided some answers. All answers can be correct of course but the question is what is the most efficient? this is what I try to explore in this video and compare how different data ... Show More
17m 50s
Jun 2022
YugabyteDB supports read committed isolation
YugabyteDB is a postgres compatible and cloud native database. Read committed isolation level is a critical feature and adding it might lure more postgres customer’s to move to the cloud native database. But will they compete in front of Google’s new AlloyDB ?    0:00 Yogabyte im ... Show More
11m 57s
Aug 2021
Table Clustering (Clustered Index) - The pros and cons
In this episode of the backend engineering show, I discuss database clustering. This is also known as table clustering, clustered index or Index organized table all names represents the same thing. I will talk about the benefits of clustering and also the disadvantages of impleme ... Show More
28m 33s
May 2022
#366: Optimizing PostgreSQL DB Queries with pgMustard
See the full show notes for this episode on the website at <a href="https://talkpython.fm/366">talkpython.fm/366</a> 
1h 14m