Building Predictive Models Over Big Data Using Elastic MapReduce

Earlier this week, Robert Grossman and Collin Bennett from Open Data Group gave a lecture as part of a tutorial at the SC 12 Conference in Salt Lake City about big data. They described some of the ways of building predictive models over big data using Hadoop streams and Hadoop’s implementation of MapReduce.

They illustrated the lecture with an example of building a predictive model over data provided by the City of Chicago about CTA busses using Amazon’s Elastic MapReduce.

You can find some of the materials for the lecture on the web page The materials also contain links to some best practices for deploying analytics in operational systems using PMML and PMML-compliant scoring engines, such as Augustus.

Posted in analytic models, big data, Blog, PMML | Leave a comment

Open Data Group Tutorial at O’Reilly Strata Conference in NYC

On October 23, 2012, Robert Grossman and Collin Bennett from Open Data Group will give a tutorial at the O’Reilly Strata Conference in New York City on “Best Practices for Building and Deploying Predictive Models over Big Data.”

The slides and some related materials can be downloaded from

The 3.5 hour tutorial consists of 12 modules:

  1. Introduction
  2. Building Predictive Models – EDA and Building Features
  3. Case Study: MalStone
  4. Working with Multiple Models: Ensembles and Segments
  5. Case Study: CTA
  6. Deploying Predictive Models Using PMML-based Scoring Engines
  7. Three Ways to Build Models over Hadoop Using R
  8. Case Study: Building Trees over Big Data
  9. Improving the Impact of a Model In Operations – The SAMS Methodology
  10. Case Study: AdReady
  11. Quantifying the Lift of a Predictive Models and Improving It
  12. Case Study: Matsu

Open Data Group helped pioneer some of the technology behind topics 2, 3, 6, 7, 8 and 9. For example, you can follow the links to learn more about the MalStone Benchmark, the Multiple Model component of the DMG’s PMML standard, and Project Matsu, which uses MapReduce to process and analyze images.

If you are at the Strata Conference, please stop by to say hello.

Posted in analytic models, big data, Blog, news, PMML | Comments Off

Quality Always Takes Time: Custom Analytic Models

In this post, I discuss some of the different options available when building analytic models. For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions. The term predictive analytics is applied (appropriately enough) to this type of analytics. A longer definition is to view predictive analytics as building statistically valid models from data that can be used to make predictions about future events, to take actions, and to make decisions.

In this post, the point of view is that a business owner of a problem in a company that requires a model and is considering whether to build the model in-house, outsource the model to a vendor providing analytic services, or simply to give up on building a model and produce a report instead. I don’t recommend the latter option, but unfortunately, in practice, it is all too common.

Broadly speaking, from a business owner’s point of view, there are several phases required to build a model for a new project. The process looks a bit different from the modeler’s point of view. It is also a bit simpler if the same model has been built before and all that is required is to update the model using new data. Here are the basic steps required to build a model from the business owner’s point of view.

  1. Working with IT to obtain all the data required for the project and making it available to the modeler.
  2. Answering questions from the modeler about the data.
  3. Agreeing upon the output of the model.
  4. Reviewing the first model with the modeler.
  5. Reviewing the second and subsequent models with the modeler.
  6. Working with IT to deploy the model.

All steps except for Step 1 and Step 6 are collaborative between the business owner and the modeler. At the beginning of many projects, Step 3 looks obvious. It turns out that it is often not so obvious until the project is towards the end, the data has been cleaned, and the deployment well underway. One way to understood why this is so is because often one doesn’t have a good understanding of the most appropriate output of a model until the data has been cleaned and there is a good understanding of how the model will be deployed in operational systems.

Let’s look at this same process now from the viewpoint of the modeler. To simplify, the following steps are required:

  1. Waiting for the data.
  2. Cleaning the data.
  3. Asking the business owner questions about the data.
  4. Agreeing upon the output of the model.
  5. Developing a set of features for the model.
  6. Estimating the parameters of the model.
  7. Building a measure to evaluate the model.
  8. Evaluating the model using the measure.
  9. Developing post-processing rules for the scores produced by the model.
  10. Repeating the steps above for the second and subsequent versions of the model until everyone is happy, or there is no more time or funding left.
  11. Deploying the model.

Building a new model requires completing all the steps above. Generally, a series of models (version 1 of the model, version 2 of the model, etc.) are produced and reviewed by the business owner and the modeler (Step 10). The more time available for Step 10, the better the quality of the model.

To understand these steps a bit better, it might be helpful to review post about the SAMS Methodology. The SAMS methodology explains how to think of models in terms of the Scores they produce, the Actions these enable, the Measures used to evaluate the actions, and whether these actions support a targeted Strategy or not.

Sometimes a model has been built before and only some of these steps need to be repeated. For example, refreshing a model only require completing steps 6 and 8 for a series of models. Rebuilding a model usually only requires repeating Steps 5, 6, 8 and 9 for a series of models.

Sometimes, the data is supplied in a standard format (for example, it is provided by a third party) and the deployment uses a standard format (for example, only a list is required that contains a list of names and corresponding offers). In this case, after a model has been built once, all that is required when a business owner supplies new data is to perform Steps 6 and 8. Call this a standard model. Standard models are substantially less work to build then models that require completing all the steps above. These more labor intensive models are often called custom models.

Most requests for models fit into some standard categories of models. For example, models that predict whether a prospect will respond to an offer (response models), whether a customer will remain a customer (attrition models), whether a customer will keep current with their payments (credit model), whether a transaction is valid or fraudulent (fraud models), etc.

Sometimes, models that don’t fit into these familiar categories of models are built. Call these new types of models. A new type of model also requires that the modeler develop new types of features, new types of measures for evaluating the models, etc. New types of custom models are the most labor intensive to build.

In practice, it usually takes four to six months or longer to build a custom model, once the data has arrived. As the size and complexity of the data grows, each of the steps usually requires more time.

Posted in analytic models, Blog, custom analytic models, deploying models, predictive analytics | Leave a comment