About this episode
Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
The majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.
Announcements
- Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
- Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
- Your host is Tobias Macey and today I’m interviewing Max Halford about River, a Python toolkit for streaming and online machine learning
Interview
- Introduction
- How did you get involved in machine learning?
- Can you describe what River is and the story behind it?
- What is "online" machine learning?
- What are the practical differences with batch ML?
- Why is batch learning so predominant?
- What are the cases where someone would want/need to use online or streaming ML?
- The prevailing pattern for batch ML model lifecycles is to train, deploy, monitor, repeat. What does the ongoing maintenance for a streaming ML model look like?
- Concept drift is typically due to a discrepancy between the data used to train a model and the actual data being observed. How does the use of online learning affect the incidence of drift?
- Can you describe how the River framework is implemented?
- How have the design and goals of the project changed since you started working on it?
- How do the internal representations of the model differ from batch learning to allow for incremental updates to the model state?
- In the documentation you note the use of Python dictionaries for state management and the flexibility offered by that choice. What are the benefits and potential pitfalls of that decision?
- Can you describe the process of using River to design, implement, and validate a streaming ML model?
- What are the operational requirements for deploying and serving the model once it has been developed?
- What are some of the challenges that users of River might run into if they are coming from a batch learning background?
- What are the most interesting, innovative, or unexpected ways that you have seen River used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on River?
- When is River the wrong choice?
- What do you have planned for the future of River?
Contact Info
Parting Question
- From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
Sponsored By:
Dec 2022
Declarative Machine Learning For High Performance Deep Learning Models With Predibase
Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
Deep learning is a revolutionary category of machine learning that accelerates our ability to build powerful inference ... Show More
59m 22s
Nov 2022
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks
Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the mo ... Show More
47m 37s
Nov 2022
Build A Full Stack ML Powered App In An Afternoon With Baseten
Preamble
This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.
Summary
Building an ML model is getting easier than ever, but it is still a challenge to get that model in front of the people ... Show More
45m 22s
Jun 2021
Accelerating ML Training And Delivery With In-Database Machine Learning
Summary
When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the ... Show More
1h 5m
Sep 2021
Declarative Machine Learning Without The Operational Overhead Using Continual
Summary
Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to ... Show More
1h 11m
Feb 2019
Machine Learning In The Enterprise
Summary
Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and mai ... Show More
48m 19s
Mar 2021
Bridging The Gap Between Machine Learning And Operations At Iguazio
Summary
The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worke ... Show More
1h 6m
Aug 2021
Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop
Summary
The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. ... Show More
48m 39s
Nov 2021
Exploring Processing Patterns For Streaming Data Integration In Your Data Lake
Summary
One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand th ... Show More
52m 53s
Jun 2021
Lessons Learned From The Pipeline Data Engineering Academy
Summary
Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that t ... Show More
1h 11m
Sep 2021
Massively Parallel Data Processing In Python Without The Effort Using Bodo
Summary
Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challen ... Show More
1h 4m
Jul 2021
Exploring The Design And Benefits Of The Modern Data Stack
Summary
We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of clou ... Show More
49m 2s
Dec 2023
SE Radio 594: Sean Moriarity on Deep Learning with Elixir and Axon
Sean Moriarity, creator of the Axon deep learning framework, co-creator of the Nx library, and author of Machine Learning in Elixir and Genetic Algorithms in Elixir, published by the Pragmatic Bookshelf, speaks with SE Radio host Gavin Henry about what deep learning (neural netwo ... Show More
57m 43s