episode-header-image

Dec 2022

1h 16m

Update Your Model's View Of The World In...

About this episode

Preamble

This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning.

Summary

The majority of machine learning projects that you read about or work on are built around batch processes. The model is trained, and then validated, and then deployed, with each step being a discrete and isolated task. Unfortunately, the real world is rarely static, leading to concept drift and model failures. River is a framework for building streaming machine learning projects that can constantly adapt to new information. In this episode Max Halford explains how the project works, why you might (or might not) want to consider streaming ML, and how to get started building with River.

Announcements

Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
Your host is Tobias Macey and today I’m interviewing Max Halford about River, a Python toolkit for streaming and online machine learning

Interview

Introduction
How did you get involved in machine learning?
Can you describe what River is and the story behind it?
What is "online" machine learning?
- What are the practical differences with batch ML?
- Why is batch learning so predominant?
- What are the cases where someone would want/need to use online or streaming ML?
The prevailing pattern for batch ML model lifecycles is to train, deploy, monitor, repeat. What does the ongoing maintenance for a streaming ML model look like?
- Concept drift is typically due to a discrepancy between the data used to train a model and the actual data being observed. How does the use of online learning affect the incidence of drift?
Can you describe how the River framework is implemented?
- How have the design and goals of the project changed since you started working on it?
How do the internal representations of the model differ from batch learning to allow for incremental updates to the model state?
In the documentation you note the use of Python dictionaries for state management and the flexibility offered by that choice. What are the benefits and potential pitfalls of that decision?
Can you describe the process of using River to design, implement, and validate a streaming ML model?
- What are the operational requirements for deploying and serving the model once it has been developed?
What are some of the challenges that users of River might run into if they are coming from a batch learning background?
What are the most interesting, innovative, or unexpected ways that you have seen River used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on River?
When is River the wrong choice?
What do you have planned for the future of River?

Contact Info

Email
@halford_max on Twitter
MaxHalford on GitHub

Parting Question

From your perspective, what is the biggest barrier to adoption of machine learning today?

Closing Announcements

Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Sponsored By:

Linode: Do you want to try out some of the tools and applications that you heard about on Podcast.\_\_init\_\_? Do you have a side project that you want to share with the world? With Linode's managed Kubernetes platform it's now even easier to get started with the latest in cloud technologies. With the combined power of the leading container orchestrator and the speed and reliability of Linode's object storage, node balancers, block storage, and dedicated CPU or GPU instances, you've got everything you need to scale up. Go to [pythonpodcast.com/linode](https://www.pythonpodcast.com/linode) today and get a $100 credit to launch a new cluster, run a server, upload some data, or... And don't forget to thank them for being a long time supporter of Podcast.\_\_init\_\_!

Up next

Declarative Machine Learning For High Performance Deep Learning Models With Predibase

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Deep learning is a revolutionary category of machine learning that accelerates our ability to build powerful inference ... Show More

Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Machine learning has the potential to transform industries and revolutionize business capabilities, but only if the mo ... Show More

Build A Full Stack ML Powered App In An Afternoon With Baseten

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Building an ML model is getting easier than ever, but it is still a challenge to get that model in front of the people ... Show More

Recommended Episodes

Accelerating ML Training And Delivery With In-Database Machine Learning

Summary When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the ... Show More

Declarative Machine Learning Without The Operational Overhead Using Continual

Summary Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to ... Show More

Machine Learning In The Enterprise

Summary Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and mai ... Show More

Bridging The Gap Between Machine Learning And Operations At Iguazio

Summary The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worke ... Show More

Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop

Summary The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. ... Show More

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand th ... Show More

Lessons Learned From The Pipeline Data Engineering Academy

Summary Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that t ... Show More

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Summary Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challen ... Show More

Exploring The Design And Benefits Of The Modern Data Stack

Summary We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of clou ... Show More

SE Radio 594: Sean Moriarity on Deep Learning with Elixir and Axon

Sean Moriarity, creator of the Axon deep learning framework, co-creator of the Nx library, and author of Machine Learning in Elixir and Genetic Algorithms in Elixir, published by the Pragmatic Bookshelf, speaks with SE Radio host Gavin Henry about what deep learning (neural netwo ... Show More

Listen to millions of songs and podcasts on Anghami