logo
episode-header-image
Oct 2018
40m 55s

Using Notebooks As The Unifying Layer Fo...

Tobias Macey
About this episode

Summary

Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.

Preamble

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
  • Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
  • Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
  • Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
    • Where are you using notebooks and where are you not?


  • What is the technical infrastructure that you have built to suppport that design choice?

  • Which team was driving the effort?

    • Was it difficult to get buy in across teams?


  • How much shared code have you been able to consolidate or reuse across teams/roles?

  • Have you investigated the use of any of the other notebook platforms for similar workflows?

  • What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?

  • What are some of the limitations of the notebook environment for the work that you are doing?

  • What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?

  • What are some of the projects that are ongoing or planned for the future that you are most excited by?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Support Data Engineering Podcast

Up next
Aug 18
High Performance And Low Overhead Graphs With KuzuDB
SummaryIn this episode of the Data Engineering Podcast Prashanth Rao, an AI engineer at KuzuDB, talks about their embeddable graph database. Prashanth explains how KuzuDB addresses performance shortcomings in existing solutions through columnar storage and novel join algorithms. ... Show More
1h 1m
Aug 12
Bridging Data and Decision-Making: AI's Role in Modern Analytics
SummaryIn this episode of the Data Engineering Podcast Lucas Thelosen and Drew Gilson from Gravity talk about their development of Orion, an autonomous data analyst that bridges the gap between data availability and business decision-making. Lucas and Drew share their backgrounds ... Show More
1h 10m
Aug 5
From Bits to Tables: The Evolution of S3 Storage
SummaryIn this episode of the Data Engineering Podcast Andy Warfield talks about the innovative functionalities of S3 Tables and Vectors and their integration into modern data stacks. Andy shares his journey through the tech industry and his role at Amazon, where he collaborates ... Show More
50m 8s
Recommended Episodes
Mar 2024
LLM Security and Privacy
Sean Falconer (@seanfalconer, Head of Dev Relations @SkyflowAPI, Host @software_daily) talks about security and privacy of LLMs and how to prevent PII (personally identifiable information) from leaking outSHOW: 807 CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotw NEW TO CLO ... Show More
26m 9s
Feb 2021
We Review Resumes, Websites, and Online Presence
In this episode of Syntax, Scott and Wes review resumes, websites, and online presences, and discuss pros and cons, what you should focus on, and more! Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at freshbooks.com/syntax and put SYNTAX in the “How did you hear abou ... Show More
1h 7m
Jun 2024
.NET Unwrapped: From Workflow Engines to Identity, A Developer's Journey with Dustin Metzgar
Avalonia XPF This episode of The Modern .NET Show is supported, in part, by Avalonia XPF, a binary-compatible cross-platform fork of WPF, enables WPF apps to run on new platforms with minimal effort and maximum compatibility. Show Notes I want it to be like one of those books tha ... Show More
1h 22m
Jan 2021
How Edgevana CEO Mark Thiele is Streamlining The Way Companies Access Data Centers
Mark Thiele has spent his entire life in and around IT infrastructure, even building his own fair share of data centers. But if there is one thing about the entire process that he finds vexing, it’s the wasted time between when companies start negotiating contracts for data cente ... Show More
46m 5s
Jan 2023
Amjad Masad - The Future of Software Creation - [Invest Like the Best, EP.310]
My guest today is Amjad Masad. Amjad is the founder and CEO of Replit, whose mission is to bring the next billion software creators online. Replit has built a browser-based coding environment that makes coding more fun, collaborative, and approachable. We discuss how that is poss ... Show More
1h 2m
Jun 2024
Microsoft is all-in on AI: Part 2 (Interview)
Mark Russinovich, Eric Boyd & Neha Batra join us to discuss the state of AI for Microsoft and OpenAI at Microsoft Build 2024. It’s safe to say that Microsoft is all-in on AI. Leave us a comment Changelog++ members save 14 minutes on this episode because they made the ads disappea ... Show More
2h 46m
Mar 2020
GitHub Actions and the DevOps Lifecycle
Chris Patterson (@chrisrpatterson, Product Manager for GitHub Actions @GitHub) talks about the evolution of GitHub from a collaboration-centric platform to a DevOps-centric platform, as well as discussing the expanding role of GitHub Actions for developers, DevOps and SREs. SHOW: ... Show More
28m 13s
Dec 2021
Gitpod, iPad Coding, Web3, WTF NFT
In this episode of Syntax, Scott and Wes talk with Geoff and Pauline from Gitpod about developing on Gitpod, Web3, and The NFT Bay. Freshbooks - Sponsor Get a 30 day free trial of Freshbooks at freshbooks.com/syntax and put SYNTAX in the "How did you hear about us?" section. Logr ... Show More
1h 3m
Oct 2020
370: Designing for One Hand
This week, we discuss the tradeoffs and challenges of designing interfaces for one-handed use. In The Sidebar, we talk about strategies for collaborating effectively with brand and product design.Golden Ratio Patrons:Float Float has been a lifeline for teams working remotely in 2 ... Show More
33m 6s
Mar 2021
Potluck — VSCode × Vercel vs Netlify × Models × Mutations × Multi-Vendor Platforms × Websites vs Web Apps × More!
It’s another potluck! In this episode, Scott and Wes answer your questions about VSCode, Vercel vs Netlify, staying up to date with dev concepts, models and mutations, websites vs seb apps, adaptive vs responsive design, and more! Freshbooks - Sponsor Get a 30 day free trial of F ... Show More
58m 44s