Real time analytics is used in several different ways. In this post, I’ll give you a quick introduction to real time analytics and distinguish between some of the ways the term is used.
Real Time Scoring of Data Using Precomputed Analytic Models
Perhaps the most important way that the term is used is to describe the real time scoring of data stream using an analytic model. The most common standard for describing analytic models is the Predictive Model Markup Language or PMML. In the modeling environment, a trained individual aggregates the data, cleans and preprocesses the data, and then uses modeling software to build an analytic model, which can be exported as a PMML file, as in the top have of the diagram below.
This PMML model can then be deployed in operational environments, as in the lower half of the diagram. A stream of data can then be scored using the analytic model in “real time.” Notice that in this case, the model does not change automatically, it simply scores the data. Of course, in practice, the team that builds the analytic models also rebuilds them from time to time.
Real Time Scoring of Data Using Continuously Updated Models
In the first example, the analytic model was built over all of the data and did change until it was rebuilt on new data. Some analytic models have the property that they can be updated incrementally with each new data record that comes along.
A good example, is the nearest neighbor (NN) model. The NN model contains all the data over which it was built and simply assigns a record to be scored, the label of the nearest record in the model. There is an obvious incremental version of the algorithm that consists of the model that results from adding each new data point to the current model to produce the new model.
Note that in practice continuously updated models like this run in computers that have a finite amount of storage and a finite amount of disks.
The subject of streaming analytics is concerned with how to build analytic models in which data is presented incrementally and there is a finite amount of memory and storage.
Real time analytics with continuously updated models is essentially the first use case but instead of periodically updating the model manually, models are updated with each new record that they consume.
Event Stream Processing
More recently, systems such as Storm and S4, use an elastic scale out architecture to process streams of data in parallel in real time. Most often rules are used to process the data, but analytic models could also be used.
In a later post, we will discuss this third example of real time processing in more detail.