River for Online Machine Learning in Python
River is a Python library for online machine learning. The library lets you train machine learning models on streaming data.
Introduction
All the traditional machine learning algorithms whether it is as simple as linear regression or strong learner algorithms like xgboost, all the algorithms process the data in batches. This means that these algorithms look at the complete dataset and fit the model. In case, there is new data available, it requires model fitting from scratch considering both; the new and old data.
There can be many challenges in re-training the model. Sometimes, it requires lot of memory to hold all the data which can affect training adversely and can make the process slower. In some other cases, it can be limited to the data storage infrastructure. It is almost impossible to retrieve the older data in some applications which keeps on generating new data.
One of the solutions to deal with above challenges is to do online training with streaming data. The continuously generated data is considered as a stream which makes it stream learning or incremental learning. This method is broadly suitable for IoT applications in which the real-time data is collected by sensors.
What is Online Machine Learning?
Online machine learning is a technique used for training machine learning models in those applications where it is either involves impracticable computations to train the model on the whole dataset or where the data is available time to time in sequential ordering. As the data is found to be in motion and keeps changing, it is required to capture the behavior of streaming data to be able to process it whenever it is available.The method is useful in the settings where the algorithm is required to dynamically adapt to new patterns available in the data over a period of time.
River: The Online Machine Learning Library
River is a Python package for online machine learning. It provides an array of incremental learning algorithms including supervised and unsupervised learning. It is a combined package consisting of Creme and Scikit-Multiflow.
River like creme has a similar API like Scikit-learn and that’s why also known to be the Scikit-learn for online machine learning. It supports almost all the different ML estimators and transformers specially built for streaming data. It has wide range of supported models including naïve Bayes, tree-ensemble models, factorization machines, linear models, and many more. A complete listing of algorithms is available here.
Some of the differences between the libraries and frameworks used for model training on data and streaming data is as follows:
Model Training on Data at Rest | Model Training on Data in Motion |
---|---|
|
|
Thanks to River, it has brought about the possibilities of deal with data on the go with online learning as opposed to offline learning.
Thanks for posting this article. I will go ahead and read more on it.
This is really useful. We have been struggling a lot with streaming data.