<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Open Data Group</title>
	<atom:link href="http://opendatagroup.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://opendatagroup.com</link>
	<description>Open Data builds predictive models over big data.</description>
	<lastBuildDate>Thu, 11 Apr 2013 00:45:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Open Data Founder named to Federal 100 Award List</title>
		<link>http://opendatagroup.com/2013/02/13/open-data-founder-named-to-federal-100-award-list/</link>
		<comments>http://opendatagroup.com/2013/02/13/open-data-founder-named-to-federal-100-award-list/#comments</comments>
		<pubDate>Wed, 13 Feb 2013 18:59:39 +0000</pubDate>
		<dc:creator>Jenna</dc:creator>
				<category><![CDATA[analytic strategy]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[predictive analytics]]></category>
		<category><![CDATA[Federal 100]]></category>
		<category><![CDATA[government IT]]></category>
		<category><![CDATA[Open Data]]></category>
		<category><![CDATA[Robert Grossman]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=297</guid>
		<description><![CDATA[Robert Grossman, founding partner of Open Data, has been named by Federal Computer Week to its Federal 100 Award list.  The 24th annual list recognizes government and industry leaders who have played pivotal roles in the federal government IT community.  &#8230; <a href="http://opendatagroup.com/2013/02/13/open-data-founder-named-to-federal-100-award-list/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Robert Grossman, founding partner of Open Data, has been named by Federal Computer Week to its <strong>Federal 100 Award list.  </strong>The 24<sup>th</sup> annual list recognizes<strong> </strong>government and industry leaders who have played pivotal roles in the federal government IT community.  Dr. Grossman is part of an elite group of individuals <a href="http://opendatagroup.com/files/2013/02/Fed100.png"><img class="wp-image-299 alignright" alt="Fed100" src="http://opendatagroup.com/files/2013/02/Fed100-300x300.png" width="231" height="247" /></a>who have gone above and beyond their daily responsibilities and have made a difference in the way technology has transformed or accelerated the mission of the agencies they support.  He and the other winners will  be honored during ceremonies on March 20, 2013 at the Grand Hyatt in Washington DC.  <b> </b></p>
<p>Grossman is widely recognized as an expert in large data, including design of analytic architectures deployed in cloud environments involving Petabytes of data.   Dr. Grossman has served in a variety of key advisory capacities over his career to assist government agencies meet the complex challenges of information technology, security and oversight.</p>
<p>In addition to his work with Open Data&#8217;s commercial and government clients, Grossman is a faculty member at the University of Chicago, where he is the Director of Informatics at the Institute for Genomics and Systems Biology, a Senior Fellow at the Computation Institute, and a Professor of Medicine in the Section of Genetic Medicine.  He also serves on the NASA Advisory Council, chairs the Open Cloud Consortium, founded the Data Mining Group, is a Visiting Professor at the Booth School of Management, and is a frequent speaker and author on the area of big data and intensive computing.</p>
<p>Open Data is proud of the well deserved recognition, and pleased that Robert Grossman&#8217;s  significant contributions  are being honored in this way.</p>
<p><a title="More About the Fed100 Award" href="http://fcw.com/events/2013-fed-100/home.aspx">More about the Fed100 Award</a></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2013/02/13/open-data-founder-named-to-federal-100-award-list/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automation, Algorithms, Predictive Models and All That</title>
		<link>http://opendatagroup.com/2012/12/06/automation-algorithms-and-models/</link>
		<comments>http://opendatagroup.com/2012/12/06/automation-algorithms-and-models/#comments</comments>
		<pubDate>Thu, 06 Dec 2012 21:22:02 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic models]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=285</guid>
		<description><![CDATA[Earlier this week, I was one of the speakers at a panel that discussed how automation, algorithms, predictive models, and related technology have changed our lives. The event was kicked off Christopher Steiner, author of Automate This: How Algorithms Came &#8230; <a href="http://opendatagroup.com/2012/12/06/automation-algorithms-and-models/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Earlier this week, I was one of the speakers at a panel that discussed how automation, algorithms, predictive models, and related technology have changed our lives.</p>
<p>The event was kicked off Christopher Steiner, author of <a href="http://www.amazon.com/dp/1591844924">Automate This: How Algorithms Came to Rule Our World</a>, who talked about some of the ways that algorithms are changing our lives, ranging from high speed trading to medical diagnoses.</p>
<p>In addition to myself, the panel included:</p>
<ul>
<li> Rayid Ghani, Chief Scientist of the <a href="http://www.barackobama.com/">Obama for America</a> 2012 campaign
<li> George John, Founder and CEO, <a href="http://rocketfuel.com/">rocketfuel</a>
<li> Keary Philips, <a href="http://www.allstate.com/">Allstate</a> Insurance Company
<li> Rishad Tobaccowala, Chief Strategy and Innovation Officer, <a href="http://www.vivaki.com/">VivaKi</a>
</ul>
<p>Rayid Ghani spoke about some of the ways that predictive analytics was used to help persuade some of those who may not have voted to actually register to vote and later to show up at the polls and to vote.</p>
<p>I discussed that as important as algorithms are, they are sometimes best thought of in the context of: i) what are the <b>C</b>oncepts and abstractions used to model the problem; ii) what are the <b>A</b>lgorithms used to compute with these abstractions; iii) and what are the <b>D</b>evices that the algorithms run on?  CAD for short.  </p>
<p>It is interesting to look at big data from the perspective of the concepts, algorithms and devices.  We are better at predictive analytics today not just because we have better algorithms, but also because we have made significant progress on the concepts and abstractions that underlie predictive analytics and on the devices we use.</p>
<p>For example, 20 years ago with big data and predictive analytics, the focus was on building a single statistical model and looking for knowledge; we generally used regression algorithms to analyze data; and we used high end workstations for the computations.  Today, with big data, we tend to think of collections of models (ensembles, cubes of models, etc.) and focus the actions (not the knowledge) that are possible; we would more typically use algorithms that compute trees or support vector machines; and we do computations over clusters of workstations.   </p>
<p>There is more about CAD in Chapter 1 of my book on the <a href="http://www.amazon.com/The-Structure-Digital-Computing-Mainframes/dp/1936298007">Structure of Digital Computing</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/12/06/automation-algorithms-and-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building Predictive Models Over Big Data Using Elastic MapReduce</title>
		<link>http://opendatagroup.com/2012/11/14/building-predictive-models-over-emr/</link>
		<comments>http://opendatagroup.com/2012/11/14/building-predictive-models-over-emr/#comments</comments>
		<pubDate>Wed, 14 Nov 2012 20:25:31 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic models]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=271</guid>
		<description><![CDATA[Earlier this week, Robert Grossman and Collin Bennett from Open Data Group gave a lecture as part of a tutorial at the SC 12 Conference in Salt Lake City about big data. They described some of the ways of building &#8230; <a href="http://opendatagroup.com/2012/11/14/building-predictive-models-over-emr/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Earlier this week, Robert Grossman and Collin Bennett from Open Data Group gave a lecture as part of a <a href="http://sc12.supercomputing.org/schedule/event_detail.php?evid=tut137">tutorial</a> at the SC 12 Conference in Salt Lake City about big data.  They described some of the ways of building predictive models over big data using Hadoop streams and Hadoop&#8217;s implementation of MapReduce.</p>
<p>They illustrated the lecture with an example of building a predictive model over data provided by the <a href="https://data.cityofchicago.org/">City of Chicago</a> about CTA busses using Amazon&#8217;s <a href="http://aws.amazon.com">Elastic MapReduce</a>.</p>
<p>You can find some of the materials for the lecture on the web page <a href="http://tutorials.opendatagroup.com">tutorials.opendatagroup.com</a>.  The materials also contain links to some best practices for deploying analytics in operational systems using PMML and PMML-compliant scoring engines, such as <a href="http://augustus.googlecode.com">Augustus</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/11/14/building-predictive-models-over-emr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Data Group Tutorial at O&#8217;Reilly Strata Conference in NYC</title>
		<link>http://opendatagroup.com/2012/10/21/open-data-group-tutorial-at-oreilly-strata-conference-in-nyc/</link>
		<comments>http://opendatagroup.com/2012/10/21/open-data-group-tutorial-at-oreilly-strata-conference-in-nyc/#comments</comments>
		<pubDate>Sun, 21 Oct 2012 21:37:22 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic models]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[news]]></category>
		<category><![CDATA[PMML]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=251</guid>
		<description><![CDATA[On October 23, 2012, Robert Grossman and Collin Bennett from Open Data Group will give a tutorial at the O&#8217;Reilly Strata Conference in New York City on &#8220;Best Practices for Building and Deploying Predictive Models over Big Data.&#8221; The slides &#8230; <a href="http://opendatagroup.com/2012/10/21/open-data-group-tutorial-at-oreilly-strata-conference-in-nyc/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>On October 23, 2012, <a href="http://rgrossman.com">Robert Grossman</a> and Collin Bennett from Open Data Group will give a tutorial at the O&#8217;Reilly <a href="http://strataconf.com/">Strata Conference</a> in New York City on &#8220;<a href="http://strataconf.com/stratany2012/public/schedule/detail/25440">Best Practices for Building and Deploying Predictive Models</a> over Big Data.&#8221;</p>
<p>The slides and some related materials can be downloaded from <a href="http://tutorials.opendatagroup.com">tutorials.opendatagroup.com</a>.</p>
<p><a href="http://opendatagroup.com/files/2012/10/strata-nyc.jpg"><img src="http://opendatagroup.com/files/2012/10/strata-nyc.jpg" alt="" title="O&#039;Reilly Strata Conference" width="252" height="182" class="alignright size-full wp-image-261" /></a></p>
<p>The 3.5 hour tutorial consists of 12 modules:</p>
<ol>
<li>Introduction</li>
<li>Building Predictive Models – EDA and Building Features</li>
<li>Case Study: MalStone</li>
<li>Working with Multiple Models: Ensembles and Segments</li>
<li>Case Study: CTA</li>
<li>Deploying Predictive Models Using PMML-based Scoring Engines</li>
<li>Three Ways to Build Models over Hadoop Using R</li>
<li>Case Study: Building Trees over Big Data</li>
<li>Improving the Impact of a Model In Operations &#8211; The SAMS Methodology</li>
<li>Case Study: AdReady</li>
<li>Quantifying the Lift of a Predictive Models and Improving It</li>
<li>Case Study: Matsu</li>
</ol>
<p>Open Data Group helped pioneer some of the technology behind topics 2, 3, 6, 7, 8 and 9.  For example, you can follow the links to learn more about the <a href="http://code.google.com/p/malgen/wiki/Malstone">MalStone Benchmark</a>, the <a href="http://www.dmg.org/v4-1/MultipleModels.html">Multiple Model</a> component of the DMG&#8217;s <a href="http://www.dmg.org">PMML</a> standard, and <a href="http://matsu.opensciencedatacloud.org">Project Matsu</a>, which uses MapReduce to process and analyze images. </p>
<p>If you are at the Strata Conference, please stop by to say hello.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/10/21/open-data-group-tutorial-at-oreilly-strata-conference-in-nyc/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Quality Always Takes Time: Custom Analytic Models</title>
		<link>http://opendatagroup.com/2012/09/10/custom-analytic-model/</link>
		<comments>http://opendatagroup.com/2012/09/10/custom-analytic-model/#comments</comments>
		<pubDate>Mon, 10 Sep 2012 20:59:49 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic models]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[custom analytic models]]></category>
		<category><![CDATA[deploying models]]></category>
		<category><![CDATA[predictive analytics]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=278</guid>
		<description><![CDATA[In this post, I discuss some of the different options available when building analytic models. For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions. The term predictive analytics is &#8230; <a href="http://opendatagroup.com/2012/09/10/custom-analytic-model/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>In this post, I discuss some of the different options available when building analytic models. For the purposes here, a good short definition of analytics is to view analytics as using data to make predictions. The term predictive analytics is applied (appropriately enough) to this type of analytics. A longer definition is to view predictive analytics as building statistically valid models from data that can be used to make predictions about future events, to take actions, and to make decisions.</p>
<p><a href="http://opendatagroup.com/files/2012/12/iStock_000010322827XSmall.jpg"><img src="http://opendatagroup.com/files/2012/12/iStock_000010322827XSmall-300x199.jpg" alt="" title="Custom models take time." width="300" height="199" class="alignleft size-medium wp-image-279" /></a></p>
<p>In this post, the point of view is that a business owner of a problem in a company that requires a model and is considering whether to build the model in-house, outsource the model to a vendor providing analytic services, or simply to give up on building a model and produce a report instead. I don’t recommend the latter option, but unfortunately, in practice, it is all too common.</p>
<p>Broadly speaking, from a business owner’s point of view, there are several phases required to build a model for a new project. The process looks a bit different from the modeler’s point of view. It is also a bit simpler if the same model has been built before and all that is required is to update the model using new data. Here are the basic steps required to build a model from the business owner’s point of view.</p>
<ol>
<li>   Working with IT to obtain all the data required for the project and making it available to the modeler.
<li>      Answering questions from the modeler about the data.
<li>     Agreeing upon the output of the model.
<li>     Reviewing the first model with the modeler.
<li>      Reviewing the second and subsequent models with the modeler.
<li>   Working with IT to deploy the model.
</ol>
<p>All steps except for Step 1 and Step 6 are collaborative between the business owner and the modeler. At the beginning of many projects, Step 3 looks obvious. It turns out that it is often not so obvious until the project is towards the end, the data has been cleaned, and the deployment well underway. One way to understood why this is so is because often one doesn’t have a good understanding of the most appropriate output of a model until the data has been cleaned and there is a good understanding of how the model will be deployed in operational systems.</p>
<p>Let’s look at this same process now from the viewpoint of the modeler. To simplify, the following steps are required:</p>
<ol>
<li>    Waiting for the data.
<li>   Cleaning the data.
<li>    Asking the business owner questions about the data.
<li>   Agreeing upon the output of the model.
<li>   Developing a set of features for the model.
<li>   Estimating the parameters of the model.
<li>   Building a measure to evaluate the model.
<li>   Evaluating the model using the measure.
<li>   Developing post-processing rules for the scores produced by the model.
<li>    Repeating the steps above for the second and subsequent versions of the model until everyone is happy, or there is no more time or funding left.
<li>    Deploying the model.
</ol>
<p>Building a new model requires completing all the steps above. Generally, a series of models (version 1 of the model, version 2 of the model, etc.) are produced and reviewed by the business owner and the modeler (Step 10). The more time available for Step 10, the better the quality of the model.</p>
<p>To understand these steps a bit better, it might be helpful to review post about the SAMS Methodology. The SAMS methodology explains how to think of models in terms of the <b>S</b>cores they produce, the <b>A</b>ctions these enable, the <b>M</b>easures used to evaluate the actions, and whether these actions support a targeted <b>S</b>trategy or not.</p>
<p>Sometimes a model has been built before and only some of these steps need to be repeated. For example, refreshing a model only require completing steps 6 and 8 for a series of models. Rebuilding a model usually only requires repeating Steps 5, 6, 8 and 9 for a series of models.</p>
<p>Sometimes, the data is supplied in a standard format (for example, it is provided by a third party) and the deployment uses a standard format (for example, only a list is required that contains a list of names and corresponding offers). In this case, after a model has been built once, all that is required when a business owner supplies new data is to perform Steps 6 and 8. Call this a standard model. Standard models are substantially less work to build then models that require completing all the steps above. These more labor intensive models are often called <em>custom models</em>.</p>
<p>Most requests for models fit into some standard categories of models. For example, models that predict whether a prospect will respond to an offer (response models), whether a customer will remain a customer (attrition models), whether a customer will keep current with their payments (credit model), whether a transaction is valid or fraudulent (fraud models), etc.</p>
<p>Sometimes, models that don’t fit into these familiar categories of models are built. Call these <em>new types</em> of models. A new type of model also requires that the modeler develop new types of features, new types of measures for evaluating the models, etc. New types of custom models are the most labor intensive to build.</p>
<p>In practice, it usually takes four to six months or longer to build a custom model, once the data has arrived. As the size and complexity of the data grows, each of the steps usually requires more time.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/09/10/custom-analytic-model/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Open Data Group Founder Interviewed by WashingtonExec</title>
		<link>http://opendatagroup.com/2012/07/29/washingtonexec-intervie/</link>
		<comments>http://opendatagroup.com/2012/07/29/washingtonexec-intervie/#comments</comments>
		<pubDate>Sun, 29 Jul 2012 22:51:06 +0000</pubDate>
		<dc:creator>opendatagroup</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[news]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=230</guid>
		<description><![CDATA[There is pretty good practice out there of how to build data warehouses. There is not a lot of good practice or knowledge out there about how to build statistical models over big data. The quote above is from an &#8230; <a href="http://opendatagroup.com/2012/07/29/washingtonexec-intervie/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<blockquote><p>There is pretty good practice out there of how to build data warehouses. There is not a lot of good practice or knowledge out there about how to build statistical models over big data. </p></blockquote>
<p>The quote above is from an interview by WashingtonExec of Open Data Group Founder Robert Grossman.  He was interviewed about big data, predictive modeling and related topics.  You can find the interview <a href="http://www.washingtonexec.com/2012/07/interview-with-robert-grossman-12-rules-for-success-and-data-driven-decision-support/">here</a>.</p>
<p><a href="http://www.washingtonexec.com/2012/07/interview-with-robert-grossman-12-rules-for-success-and-data-driven-decision-support/"><img src="http://opendatagroup.com/files/2012/07/washingtonexec-logo-300x48.png" alt="" title="WashingtonExec" width="300" height="48" class="alignleft size-medium wp-image-240" /></a></p>
<p>In the interview, he briefly discusses some of the rules he has developed over the years for building predictive models over big data.  One of the top three is: &#8220;Do you have an environment where you can deploy the models you build into operational systems?&#8221;   </p>
<p>Open Data Group often uses the <a href="http://augustus.googlecode.com">Augustus</a> system for deploying models into operational systems.   Augustus is open source and follows the <a href="http://www.dmg.org">PMML</a> standard.   It supports segmented models and pre-processing of the inputs to models and post-processing of the scores produced by models.  Augustus support for pre- and post-processing was described in a recent <a href="http://opendatagroup.com/2012/05/14/scores-models-and-rules/">post</a>.</p>
<p>He was also asked about the disruptive nature of predictive modeling over big data.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/07/29/washingtonexec-intervie/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Real Time Predictive Analytics</title>
		<link>http://opendatagroup.com/2012/07/26/real-time-predictive-analytics/</link>
		<comments>http://opendatagroup.com/2012/07/26/real-time-predictive-analytics/#comments</comments>
		<pubDate>Thu, 26 Jul 2012 04:05:09 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic models]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[PMML]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=216</guid>
		<description><![CDATA[Real time analytics is used in several different ways. In this post, I&#8217;ll give you a quick introduction to real time analytics and distinguish between some of the ways the term is used. Real Time Scoring of Data Using Precomputed &#8230; <a href="http://opendatagroup.com/2012/07/26/real-time-predictive-analytics/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Real time analytics is used in several different ways.  In this post, I&#8217;ll give you a quick introduction to real time analytics and distinguish between some of the ways the term is used.</p>
<h3>Real Time Scoring of Data Using Precomputed Analytic Models</h3>
<p>Perhaps the most important way that the term is used is to describe the real time scoring of data stream using an analytic model.  The most common standard for describing analytic models is the Predictive Model Markup Language or <a href="http://www.dmg.org">PMML</a>.  In the modeling environment, a trained individual aggregates the data, cleans and preprocesses the data, and then uses modeling software to build an analytic model, which can be exported as a PMML file, as in the top have of the diagram below.</p>
<p>This PMML model can then be deployed in operational environments, as in the lower half of the diagram.  A stream of data can then be scored using the analytic model in &#8220;real time.&#8221;   Notice that in this case, the model does not change automatically, it simply scores the data.  Of course, in practice, the team that builds the analytic models also rebuilds them from time to time.</p>
<p><a href="http://opendatagroup.com/files/2012/07/pmml-producer-consumer.jpg"><img src="http://opendatagroup.com/files/2012/07/pmml-producer-consumer-300x194.jpg" alt="" title="Analytic Model Producers and Consumers" width="300" height="194" class="alignleft size-medium wp-image-220" /></a></p>
<h3>Real Time Scoring of Data Using Continuously Updated Models</h3>
<p>In the first example, the analytic model was built over <em>all</em> of the data and did change until it was rebuilt on new data.  Some analytic models have the property that they can be updated incrementally with each new data record that comes along.  </p>
<p>A good example, is the nearest neighbor (NN) model.  The NN model contains all the data over which it was built and simply assigns a record to be scored, the label of the nearest record in the model.  There is an obvious incremental version of the algorithm that consists of the model that results from adding each new data point to the current model to produce the new model.</p>
<p>Note that in practice continuously updated models like this run in computers that have a finite amount of storage and a finite amount of disks.</p>
<p>The subject of streaming analytics is concerned with how to build analytic models in which data is presented incrementally and there is a finite amount of memory and storage.</p>
<p>Real time analytics with continuously updated models is essentially the first use case but instead of periodically updating the model manually, models are updated with each new record that they consume.</p>
<h3>Event Stream Processing </h3>
<p>More recently, systems such as <a href="https://github.com/nathanmarz/storm/">Storm</a> and <a href="ncubator.apache.org/s4/">S4</a>, use an elastic scale out architecture to process streams of data in parallel in real time.  Most often rules are used to process the data, but analytic models could also be used.</p>
<p>In a later post, we will discuss this third example of real time processing in more detail.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/07/26/real-time-predictive-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Analytic Maturity of an Organization</title>
		<link>http://opendatagroup.com/2012/06/17/the-analytic-maturity-of-an-organization/</link>
		<comments>http://opendatagroup.com/2012/06/17/the-analytic-maturity-of-an-organization/#comments</comments>
		<pubDate>Sun, 17 Jun 2012 17:44:19 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[analytic maturity model]]></category>
		<category><![CDATA[analytic strategy]]></category>
		<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=200</guid>
		<description><![CDATA[For the past several years I have been working on a means to measure the analytic maturity of an organization and a framework that can used to improve an organization&#8217;s analytic maturity. You can think of this as roughly the &#8230; <a href="http://opendatagroup.com/2012/06/17/the-analytic-maturity-of-an-organization/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>For the past several years I have been working on a means to measure the analytic maturity of an organization and a framework that can used to improve an organization&#8217;s analytic maturity.</p>
<p>You can think of this as roughly the analogy for analytics and predictive modeling of the <a href="http://www.sei.cmu.edu/reports/93tr024.pdf">Capability Maturity Model</a> for Software that was developed by Software Engineering Institute at CMU about 20 years ago.</p>
<p><a href="http://www.predictiveanalyticsworld.com/chicago/2012/"><img class="alignleft size-full wp-image-199" title="Chicago PAW" src="http://opendatagroup.com/files/2012/06/hear-me-speak-paw.jpeg" alt="" width="125" height="125" /></a></p>
<p>I recently finished a survey of over 25 companies that I used to fine tune this model.  Here is a high level summary of the five different analytic maturity levels of an organization:</p>
<p><strong>Level 1 &#8211; Analytic reporting. </strong>An Analytic Maturity Level 1 organization has the ability to analyze data and to build reports.</p>
<p><strong>Level 2 &#8211; Analytic modeling. </strong> An Analytic Maturity Level 2 organization has the ability to build predictive models over data.</p>
<p><strong>Level 3 &#8211; Repeatable analytics.</strong> An Analytic Maturity Level 3 organization uses a repeatable process for building and deploying analytic models.</p>
<p><strong>Level 4 &#8211; Enterprise level analytics.</strong> An Analytic Maturity Level 4 organization uses a repeatable process for building analytic models throughout an enterprise and integrates these models together to improve operations.</p>
<p><strong>Level 5 &#8211; Strategy driven analytics. </strong> An Analytic Maturity Level 5 organization has an analytic strategy and a senior leader in charge of the analytics strategy and the analytic governance.</p>
<p>On June 25, 2012, I&#8217;ll be speaking at <a href="http://www.predictiveanalyticsworld.com/chicago/2012/">Predictive Analytics World</a> in Chicago on the Analytic Maturity Model and some steps that organizations can take to improve their Analytic Maturity  Level.</p>
<p>The following two forthcoming publications contain more detail about the Analytic Maturity Model:</p>
<ol>
<li>Robert L. Grossman, The Strategic Dimensions of Data, Chapter 10, Open Data Press, 2012, to appear.</li>
<li>Robert L. Grossman and Kevin Siegel, An Organizational Maturity Model for Analytics, to appear.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/06/17/the-analytic-maturity-of-an-organization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scores, Models and Rules</title>
		<link>http://opendatagroup.com/2012/05/14/scores-models-and-rules/</link>
		<comments>http://opendatagroup.com/2012/05/14/scores-models-and-rules/#comments</comments>
		<pubDate>Mon, 14 May 2012 04:00:45 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=188</guid>
		<description><![CDATA[A common approach to deploying predictive models in operational systems in which data arrives event by event is to implement an event loop: an event associated with an entity comes in, say a user (the entity) visits a page (the &#8230; <a href="http://opendatagroup.com/2012/05/14/scores-models-and-rules/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>A common approach to deploying predictive models in operational systems in which data arrives event by event is to implement an <em>event loop</em>:</p>
<ol>
<p>
<li>an event associated with an entity comes in, say a user (the entity) visits a page (the event) on a web site</li>
</p>
<p>
<li>the features associated with the entity are retrieved, say features that describe the user and his or her behavior at the site, and the features are updated with the event</li>
</p>
<p>
<li>the features are used as the input to a predictive model, which produces a score, say the likelihood that the user will click on an ad in a certain category </li>
</p>
<p>
<li>an action is taken on the basis of the score, say to display an ad in a certain category if the score is above a threshold. </li>
</p>
</ol>
<p>This may be easier to remember if you use the acronym EEFM for <b>E</b>vent, <b>E</b>ntity, <b>F</b>eatures, and <b>M</b>odel.  If the events are all available at the same time, the feature vectors for all the entities can be created at the same time and the features can all be scored &#8220;in batch&#8221;. </p>
<p>At Open Data Group, we used the Predictive Model Markup Language (<a href="http://www.dmg.org">PMML</a>) to express our models in XML so that models may be built in a development environment with one application and then easily deployed in a production environment with another application.   We also use an open source scoring engine (<a href="http://code.google.com/p/augustus/">Augustus</a>) to deploy our models in operational environments.</p>
<p>In practice, when actually deploying models, it is usually a bit more complicated.</p>
<p>First, various business rules are usually used to process the event prior to using the associated state as input to the predictive model (pre-processing).  Second, the score of the model is usually processed by additional business rules (post-processing) prior to selecting an action.  For example, if the event is associated with the visit of someone who is likely to be under 18, the site may choose different types of ads as part of the pre-processing.   Second, as part of the post-processing, inventory rules and rules about how often to show ads (exposure rules) may be used to exclude certain ads.</p>
<p>Until recently, pre- and post-processing usually needed to be coded manually and couldn&#8217;t be expressed easily in PMML.  </p>
<p>That has changed with <a href="http://code.google.com/p/augustus/">Augustus 0.5.2</a>.  With this version of Augustus, Python code can be embedded in the PMML file to express pre- and post-processing rules, as well as to combine multiple models to produce scores, and to process multiple models and scores in a variety of different ways.  </p>
<p>We call this <b>augmented processing</b> and in our experience over the past several months it has significantly simplified the deployment of predictive models into operational systems.</p>
<p>Here is an example from a white paper that we are writing about augmented processing using Augustus.  With segmented modeling, you can use multiple models in different segments to score an event and in this way produce multiple scores.  Assume that you want to use the <em>minimum</em> score produced in this way as the score for the event.  Here is some Augustus code to do this that can be embedded in the PMML file for the segmented model:</p>
<p><sourcecode language="python"><br />
def action():<br />
    segmentScores = []<br />
    for segment in segments:<br />
        segmentScores.append(segment.score()[PREDICTEDVALUE])</p>
<p>    if len(segmentScores) == 0:<br />
        finalScore = MISSING<br />
    else:<br />
        finalScore = min(segmentScores)</p>
<p>    output.xmlopen(&#8220;Event&#8221;, attrib={&#8220;number&#8221;: eventNumber})<br />
    output.xmlfield(&#8220;Score&#8221;, finalScore)<br />
</sourcecode></p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/05/14/scores-models-and-rules/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>AdReady Chooses Open Data and Augustus for their Real Time Bidding System</title>
		<link>http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/</link>
		<comments>http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 19:11:12 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=162</guid>
		<description><![CDATA[Augustus is a scalable Python-based open source system for building and scoring statistical and data mining models. Augustus follows the PMML standard so that models can be easily imported and exported as PMML files. Augustus 0.5.x was a rewrite of &#8230; <a href="http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><a href="http://augustus.googlecode.com">Augustus</a> is a scalable Python-based open source system for building and scoring statistical and data mining models.  Augustus follows the <a href="http://www.dmg.org">PMML</a> standard so that models can be easily imported and exported as PMML files.</p>
<p>Augustus 0.5.x was a rewrite of the PMML-compliant scoring engine with two important changes:</p>
<ol>
<p>
<li>Augustus now has the ability to build models in batch mode using static data files as usual, but also in a streaming mode in which data is read only once. </li>
</p>
<p>
<li>Augustus now supports custom processing, a mechanism that allows the user to embed arbitrary Python code inside of Augustus. This provides a simple mechanism, for example, to combine multiple models and produce a single score.</li>
</p>
</ol>
<p>Augustus 0.5.x uses the latest PMML specification 4.1 from December of 2011.<br />
The latest version, Augustus 0.5.2.0, was released on April 4, 2012.  Its model coverage is: Tree, Regression, Baseline, Rule Set, Naïve-Bayes, and Cluster.  </p>
<p><a href="http://www.adready.com">AdReady</a> is an Open Data Group Business Partner and is using the latest Augustus for the development and deployment of sophisticated statistical models for use in a large, robust, and scalable targeted ad-bidding system.   AdReady’s Real-Time-Bidding (RTB) enables advertisers to bid more effectively on precise inventory and audience targets.   Bid requests need to be collected, scored against segmented models, and responded to in less than 100 ms.</p>
<p>By using Augustus, AdReady can grow using elastic capacity and score on VM instances, bypassing the normal (Augustus) data pipeline and send events directly into the scoring engine processing loop from their Python MVC web framework (Django over Apache).  AdReady described their use of Augustus in a nice <a href="http://www.adready.com/site/blog/2012/partnership-supporting-collective-insights/">blog post</a> today.</p>
<p>To learn more about building a scalable analytics platform using Augustus, please contact us at info at opendatagroup.com.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
