<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Open Data Group &#187; Blog</title>
	<atom:link href="http://opendatagroup.com/category/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://opendatagroup.com</link>
	<description>Open Data builds predictive models over big data.</description>
	<lastBuildDate>Mon, 14 May 2012 12:51:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.1</generator>
		<item>
		<title>Scores, Models and Rules</title>
		<link>http://opendatagroup.com/2012/05/14/scores-models-and-rules/</link>
		<comments>http://opendatagroup.com/2012/05/14/scores-models-and-rules/#comments</comments>
		<pubDate>Mon, 14 May 2012 04:00:45 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=188</guid>
		<description><![CDATA[A common approach to deploying predictive models in operational systems in which data arrives event by event is to implement an event loop: an event associated with an entity comes in, say a user (the entity) visits a page (the &#8230; <a href="http://opendatagroup.com/2012/05/14/scores-models-and-rules/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A common approach to deploying predictive models in operational systems in which data arrives event by event is to implement an <em>event loop</em>:</p>
<ol>
<p>
<li>an event associated with an entity comes in, say a user (the entity) visits a page (the event) on a web site</li>
</p>
<p>
<li>the features associated with the entity are retrieved, say features that describe the user and his or her behavior at the site, and the features are updated with the event</li>
</p>
<p>
<li>the features are used as the input to a predictive model, which produces a score, say the likelihood that the user will click on an ad in a certain category </li>
</p>
<p>
<li>an action is taken on the basis of the score, say to display an ad in a certain category if the score is above a threshold. </li>
</p>
</ol>
<p>This may be easier to remember if you use the acronym EEFM for <b>E</b>vent, <b>E</b>ntity, <b>F</b>eatures, and <b>M</b>odel.  If the events are all available at the same time, the feature vectors for all the entities can be created at the same time and the features can all be scored &#8220;in batch&#8221;. </p>
<p>At Open Data Group, we used the Predictive Model Markup Language (<a href="http://www.dmg.org">PMML</a>) to express our models in XML so that models may be built in a development environment with one application and then easily deployed in a production environment with another application.   We also use an open source scoring engine (<a href="http://code.google.com/p/augustus/">Augustus</a>) to deploy our models in operational environments.</p>
<p>In practice, when actually deploying models, it is usually a bit more complicated.</p>
<p>First, various business rules are usually used to process the event prior to using the associated state as input to the predictive model (pre-processing).  Second, the score of the model is usually processed by additional business rules (post-processing) prior to selecting an action.  For example, if the event is associated with the visit of someone who is likely to be under 18, the site may choose different types of ads as part of the pre-processing.   Second, as part of the post-processing, inventory rules and rules about how often to show ads (exposure rules) may be used to exclude certain ads.</p>
<p>Until recently, pre- and post-processing usually needed to be coded manually and couldn&#8217;t be expressed easily in PMML.  </p>
<p>That has changed with <a href="http://code.google.com/p/augustus/">Augustus 0.5.2</a>.  With this version of Augustus, Python code can be embedded in the PMML file to express pre- and post-processing rules, as well as to combine multiple models to produce scores, and to process multiple models and scores in a variety of different ways.  </p>
<p>We call this <b>augmented processing</b> and in our experience over the past several months it has significantly simplified the deployment of predictive models into operational systems.</p>
<p>Here is an example from a white paper that we are writing about augmented processing using Augustus.  With segmented modeling, you can use multiple models in different segments to score an event and in this way produce multiple scores.  Assume that you want to use the <em>minimum</em> score produced in this way as the score for the event.  Here is some Augustus code to do this that can be embedded in the PMML file for the segmented model:</p>
<p><sourcecode language="python"><br />
def action():<br />
    segmentScores = []<br />
    for segment in segments:<br />
        segmentScores.append(segment.score()[PREDICTEDVALUE])</p>
<p>    if len(segmentScores) == 0:<br />
        finalScore = MISSING<br />
    else:<br />
        finalScore = min(segmentScores)</p>
<p>    output.xmlopen(&#8220;Event&#8221;, attrib={&#8220;number&#8221;: eventNumber})<br />
    output.xmlfield(&#8220;Score&#8221;, finalScore)<br />
</sourcecode></p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/05/14/scores-models-and-rules/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>AdReady Chooses Open Data and Augustus for their Real Time Bidding System</title>
		<link>http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/</link>
		<comments>http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 19:11:12 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=162</guid>
		<description><![CDATA[Augustus is a scalable Python-based open source system for building and scoring statistical and data mining models. Augustus follows the PMML standard so that models can be easily imported and exported as PMML files. Augustus 0.5.x was a rewrite of &#8230; <a href="http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://augustus.googlecode.com">Augustus</a> is a scalable Python-based open source system for building and scoring statistical and data mining models.  Augustus follows the <a href="http://www.dmg.org">PMML</a> standard so that models can be easily imported and exported as PMML files.</p>
<p>Augustus 0.5.x was a rewrite of the PMML-compliant scoring engine with two important changes:</p>
<ol>
<p>
<li>Augustus now has the ability to build models in batch mode using static data files as usual, but also in a streaming mode in which data is read only once. </li>
</p>
<p>
<li>Augustus now supports custom processing, a mechanism that allows the user to embed arbitrary Python code inside of Augustus. This provides a simple mechanism, for example, to combine multiple models and produce a single score.</li>
</p>
</ol>
<p>Augustus 0.5.x uses the latest PMML specification 4.1 from December of 2011.<br />
The latest version, Augustus 0.5.2.0, was released on April 4, 2012.  Its model coverage is: Tree, Regression, Baseline, Rule Set, Naïve-Bayes, and Cluster.  </p>
<p><a href="http://www.adready.com">AdReady</a> is an Open Data Group Business Partner and is using the latest Augustus for the development and deployment of sophisticated statistical models for use in a large, robust, and scalable targeted ad-bidding system.   AdReady’s Real-Time-Bidding (RTB) enables advertisers to bid more effectively on precise inventory and audience targets.   Bid requests need to be collected, scored against segmented models, and responded to in less than 100 ms.</p>
<p>By using Augustus, AdReady can grow using elastic capacity and score on VM instances, bypassing the normal (Augustus) data pipeline and send events directly into the scoring engine processing loop from their Python MVC web framework (Django over Apache).  AdReady described their use of Augustus in a nice <a href="http://www.adready.com/site/blog/2012/partnership-supporting-collective-insights/">blog post</a> today.</p>
<p>To learn more about building a scalable analytics platform using Augustus, please contact us at info at opendatagroup.com.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/04/09/adready-chooses-open-data-group/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How To Rapidly Prototype Analytic Models Over Big Data</title>
		<link>http://opendatagroup.com/2012/03/15/rapid-analytic-prototypin/</link>
		<comments>http://opendatagroup.com/2012/03/15/rapid-analytic-prototypin/#comments</comments>
		<pubDate>Thu, 15 Mar 2012 12:23:12 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=147</guid>
		<description><![CDATA[Cloud computing is changing the way that companies build and deploy their analytic solutions. With cloud computing, computing is available on demand, scales elastically, and can be self-provisioned. This flexibility sometimes requires developing new analytic infrastructure and new analytic algorithms, &#8230; <a href="http://opendatagroup.com/2012/03/15/rapid-analytic-prototypin/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Cloud computing is changing the way that companies build and deploy their analytic solutions. With cloud computing, computing is available on demand, scales elastically, and can be self-provisioned. This flexibility sometimes requires developing new analytic infrastructure and new analytic algorithms, which, in turn, requires some experimenting. This process can usually benefit from an external perspective. </p>
<p>The fastest way forward is to use a public cloud, external experts, and to do some quick experiments and prototyping. At this point, for many companies, there is a problem. It is quite common these days for companies to have policies that prohibit placing proprietary data, or data that contains information that can identify customers, on public clouds. Providing access to this data to third parties is also usually quite difficult. </p>
<p>One practical approach is to replace actual data with simulated data, and, instead of using public clouds, to use instead private clouds operated by third parties. This requires using data simulators that produce realistic data. For example, large data is rarely normally distributed, but more often follows power laws or similar types of distributions. </p>
<p>As a reminder, a private cloud is a cloud that is used exclusively by a single organization. It may be managed by the organization or by a third party; and, it may exist on premise (an in-house private cloud) or off premise (a third-party private cloud). In contrast, in a public cloud, the cloud infrastructure is made available to the general public, or a large group, and is owned by an organization selling cloud services (a cloud service provider). In this post, we assume that private third party clouds are also single tenant clouds; that is, only one client’s data is on the cloud at a time and the cloud is sanitized between use by different clients.</p>
<p>In more detail, one approach for moving your analytics to clouds is:</p>
<ul>
<li>use simulated data following realistic simulations, instead of actual data; </li>
<li>supplement in-house expertise with third party experts who specialize in analytics and cloud computing; </li>
<li>use third party private clouds instead of public clouds to decrease risk or perceived risk; </li>
<li>experiment with different analytic approaches and different analytic infrastructures; </li>
<li>agree on APIs up front and transfer technology by transferring code that uses these APIs. </li>
</ul>
<p>Open Data Group offers a service to our clients called Rapid Analytic Prototyping, or RAP, that uses this approach help our clients rapidly develop and deploy predictive models over big data.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/03/15/rapid-analytic-prototypin/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Five Common Mistakes in Analytic Projects</title>
		<link>http://opendatagroup.com/2012/02/20/five-common-mistakes-in-analytic-projects/</link>
		<comments>http://opendatagroup.com/2012/02/20/five-common-mistakes-in-analytic-projects/#comments</comments>
		<pubDate>Mon, 20 Feb 2012 12:10:02 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=144</guid>
		<description><![CDATA[Managing projects is often challenging. Developing predictive models can be very challenging. Managing projects that develop analytic models can present some especially difficult challenges. In this post, I’ll describe some of the most common mistakes that occur when managing analytic &#8230; <a href="http://opendatagroup.com/2012/02/20/five-common-mistakes-in-analytic-projects/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Managing projects is often challenging. Developing predictive models can be very challenging. Managing projects that develop analytic models can present some especially difficult challenges. In this post, I’ll describe some of the most common mistakes that occur when managing analytic projects.</p>
<p><b>Mistake 1.</b> Underestimating the time required to get the data. This is probably the most common mistake in modeling projects. Getting the data required for analytic projects usually requires a special request to the IT department. Any special requests made to IT departments can take time. Usually, several meetings are required between the business owners of the analytic problem, the statisticians building the models, and the IT department in order to decide what data is required and whether it is available. Once there is agreement on what data is required, then the special request to the IT department is made and the wait begins. Project managers are sometimes under the impression that good models can be built without data, just as statisticians are sometimes under the impression that modeling projects can be managed without a project plan.</p>
<p><b>Mistake 2. </b>There is not a good plan for deploying the model. There are several phases in a modeling project. In one phase, data is acquired from the IT department and the model is built. A statistician is usually in charge of building the model. In the next phase, the model is deployed. This is the responsibility of the IT department. This requires providing the model with the appropriate data, post-processing the scores produced by the model to compute the associated actions, and then integrating these actions into the required business processes. Deploying models is in many cases just as complicated or more complicated than building the models and requires a plan. A good standards-compliant architecture can help here. It is often useful for the statistician to export the model as PMML. The model can then be imported by the application used in the operational system.</p>
<p><b>Mistake 3. </b>Building predictive models without an analytic strategy.  Companies do not undertake major IT projects without having an IT strategy and a senior executive to champion the project, but the same cannot be said about analytic projects.  It is a best practice for a company or organization to develop an analytic strategy and to assign a senior executive to execute the strategy.  The executive then makes sure that the analytic strategy aligns with the corporate strategy and that all required resources are in place for the project to succeed.  For an analytic project so succeed: the data required to support the project must be available;  there must be modelers (or statisticians) available to develop the models; the modelers must have the right software tools; the models that are built must be able to be deployed; deployed models must enable actions that increase revenues, decrease risks, or improve the efficiency of business processes; and finally the improvements produced by the models must be tracked and reviewed.  </p>
<p><b>Mistake 4. </b>Trying to build the perfect model. Another common mistake is trying to build the perfect statistical model. Usually, the impact of a model will be much higher if a model that is good enough is deployed and then a process is put in place that: i) reviews the effectiveness of the model frequently with the business owner of the problem; ii) refreshes the model on a regular basis with the most recent data; and, iii) rebuilds the model on a periodic basis with the lessons learned from the reviews.</p>
<p><b>Mistake 5. </b>The predictions of the model are not actionable.  From this point of view, the model is evaluated not just by its accuracy but instead by measures that directly support a specified strategy. For example, the strategy might be to increase sales by recommending another product after an initial product is selected. Here the relevant measure might be the incremental revenue generated by the recommendations. The actions could be present up to three additional products to the shopper. The scores might be a score from 1 to 1000. The products with the highest three scores are then presented. This is a simple example. Unfortunately, in most of the projects that I have been involved with determining the appropriate actions and measures often requires an iterative process to get it right.  I have developed a framework called SAMS to help ensure that the predictions of statistical models are actionable that will be the subject of a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/02/20/five-common-mistakes-in-analytic-projects/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Health and Status Monitoring</title>
		<link>http://opendatagroup.com/2012/01/16/health-and-status-monitoring/</link>
		<comments>http://opendatagroup.com/2012/01/16/health-and-status-monitoring/#comments</comments>
		<pubDate>Mon, 16 Jan 2012 21:58:25 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=153</guid>
		<description><![CDATA[Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it &#8230; <a href="http://opendatagroup.com/2012/01/16/health-and-status-monitoring/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider. If the Amazon web site, or Google’s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.</p>
<p>As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions. These types of systems can contain thousands of different data feeds, data flows and processes. A problem with just one of them can interrupt payments, ads, and status updates, respectively. Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.</p>
<p>One way to gain some insight into this problem is to look at the origins in the 1920’s of statistical quality control. Walter Andrew Shewhart (1891 – 1967) was an engineer from 1918 to 1924 at the Western Electric Company, which manufactured hardware for the Bell Telephone Company. From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].</p>
<div id="attachment_157" class="wp-caption alignleft" style="width: 191px"><a href="http://opendatagroup.com/files/2012/04/shewhart-cover.png"><img src="http://opendatagroup.com/files/2012/04/shewhart-cover.png" alt="" title="Shewhart&#039;s Book on Quality Control" width="181" height="299" class="size-full wp-image-157" /></a><p class="wp-caption-text">Shewhart introduced control charts in the 1920&#039;s and they are still relevant today for quality control and related problems.</p></div>
<p>One of the problems that concerned him was identifying potential issues in factory assembly lines. For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded. He distinguished between two types of variations in these measurements:</p>
<ul>
<p>
<li>Common cause of variation (or noise) occurs as a normal part of the manufacturing process.</li>
</p>
<p>
<li>A special cause of variation is not part of the normal manufacturing process, but represents a problem.</li>
</p>
</ul>
<p>One of the goals of statistical quality control is to distinguish between these two types of variation and to quickly identify special causes of variation.</p>
<p>Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation. A control chart had a central line and upper and lower control limits. When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated. Usually, the upper and lower control limits were three standard deviations above and and below the mean.</p>
<p>As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:</p>
<ul>
<p>
<li><b>Plan.</b> Identify an opportunity or potential problem and make a plan for improving it or changing it.</li>
</p>
<p>
<li><b>Do.</b> Implement the change on a small scale and collect the appropriate data.</li>
</p>
<p>
<li><b>Check. </b>Use data to analyze statistically the results of the change and determine whether it made a difference.</li>
</p>
<p>
<li><b>Act. </b> If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again. </li>
</p>
</ul>
<p>These same ideas are still used today as the basis for health and monitoring systems. Well designed digital systems these days are designed from the ground up so that appropriate log data is produced. Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data. Often this data is available through an http interface and is continually collected.</p>
<p>Below is a dashboard from the open source <a href="http://code.google.com/p/augustus/">Augustus system</a>.  Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor]. Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski]. Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.</p>
<p><a href="http://opendatagroup.com/files/2012/04/generic-dashboard0.png"><img src="http://opendatagroup.com/files/2012/04/generic-dashboard0.png" alt="" title="Augustus Health and Status Monitoring Dashboard" width="300" height="297" class="alignleft size-full wp-image-156" /></a></p>
<p>Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source Augustus system.</p>
<p>Open Data Group worked with the <a href="http://www.dmg.org">Data Mining Group</a> to develop Predictive Model Markup Language (PMML) versions of change detection models and cubes of models.  Both of these contributions were included in the most release of the PMML standard, version 4.1, which was finished in December of 2011.  </p>
<p>This is an updated version of a post that first appeared on November 29, 2009.</p>
<p><b>References</b></p>
<p>[ASQ] ASQ, The History of Quality – Overview, retrieved from www.asq.org.</p>
<p>[Bugajski] Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke and Steve Vejcik, Data Quality Models for High Volume Transaction Streams: A Case Study, Proceedings of the Second Workshop on Data Mining Case Studies and Success Stories, ACM 2007</p>
<p>[Poor] H. Vincent Poor and Olympia Hadjiliadi, Quickest Detection, Cambridge University Press, 2008.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2012/01/16/health-and-status-monitoring/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Building and Deploying Statistical Models over Big Data</title>
		<link>http://opendatagroup.com/2011/11/18/sc11-tutoria/</link>
		<comments>http://opendatagroup.com/2011/11/18/sc11-tutoria/#comments</comments>
		<pubDate>Fri, 18 Nov 2011 11:49:53 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=140</guid>
		<description><![CDATA[On Monday, November 14, 2011, Bob Grossman and Collin Bennett from Open Data Group gave a three hour tutorial at the SC 11 Conference in Seattle on managing big data and building statistical models over it. Best practices for managing &#8230; <a href="http://opendatagroup.com/2011/11/18/sc11-tutoria/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>On Monday, November 14, 2011, Bob Grossman and Collin Bennett from Open Data Group gave a three hour tutorial at the <a href="http://sc11.supercomputing.org/">SC 11</a> Conference in Seattle on managing big data and building statistical models over it.  </p>
<p>Best practices for managing big data and building empirically derived and statistically valid models over it are now emerging and the tutorial described some of these.  </p>
<p>It is often a challenge once statistical models are built over big data to deploy these in operational systems.  The open source <a href="http://code.google.com/p/augustus/">Augustus</a> system developed by Open Data Group provides an efficient means to deploy statistical and data mining models in operational systems and to quickly refresh models once they are deployed.  Models deployed with Augustus can be updated simply by reading an XML file that uses the <a href="http://www.dmg.org">Predictive Model Markup Language</a> or PMML.  Augustus can both produce and read PMML files.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2011/11/18/sc11-tutoria/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some FAQ on Predictive Analytics</title>
		<link>http://opendatagroup.com/2011/09/12/predictive-analytics-faq/</link>
		<comments>http://opendatagroup.com/2011/09/12/predictive-analytics-faq/#comments</comments>
		<pubDate>Mon, 12 Sep 2011 14:51:08 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=111</guid>
		<description><![CDATA[Robert Grossman, a Partner at Open Data Group, was interviewed about predictive analytics, data mining, and related topics recently. You can find the video interview here. He also updated a FAQ about predictive analytics and data mining that you can &#8230; <a href="http://opendatagroup.com/2011/09/12/predictive-analytics-faq/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Robert Grossman, a Partner at Open Data Group, was interviewed about predictive analytics, data mining, and related topics recently.  You can find the video interview <a href="http://nationalsecurityzone.org/datamining/data-mining-and-link-analysis-basics/the-science/">here</a>.</p>
<p>He also updated a FAQ about predictive analytics and data mining that you can <a href="http://opendatagroup.com/predictive-analytics-faq/">here</a></p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2011/09/12/predictive-analytics-faq/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>PMML Workshop</title>
		<link>http://opendatagroup.com/2011/08/22/pmml-workshop/</link>
		<comments>http://opendatagroup.com/2011/08/22/pmml-workshop/#comments</comments>
		<pubDate>Mon, 22 Aug 2011 15:09:09 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://opendatagroup.opendatagroup.net/?p=118</guid>
		<description><![CDATA[A Workshop on the Predictive Model Markup Language (PMML) took place on August 21, 2011 at the KDD 2011 Conference in San Diego. The essential idea of PMML is that a predictive model, and more generally a statistical or data &#8230; <a href="http://opendatagroup.com/2011/08/22/pmml-workshop/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A Workshop on the Predictive Model Markup Language (PMML) took place on August 21, 2011 at the <a href="http://www.kdd.org/kdd2011/">KDD 2011</a> Conference in San Diego.</p>
<p>The essential idea of PMML is that a predictive model, and more generally a statistical or data mining model, should not be thought of as <b>code</b>, but rather abstracted and described as <b>metadata</b> about the underlying data that it models.   The PMML standard specifies an XML format for this metatadata.  </p>
<p>The reason that this point of view is important is so that one application (the <em>model producer</em>) can produce the model, while another application (the <em>model consumer</em>) can use the model for scoring data.   The model consumer can be integrated into production and operational systems and models can then be updated simply by reading new PMML files.   </p>
<p>With this approach predictive models in operational systems can be updated quickly and easily.   In contrast, when predictive models are viewed as code and new code is added to operational systems, a careful QA process is required before any new code can be deployed.</p>
<p>The upcoming version of PMML (PMML version 4.1, which should be released in the Fall of 2011) supports multiple models.   This PMML feature was championed by Open Data Group over the past few years based upon its experience building predictive models over big data.</p>
<p>As the amount of data increases, building predictive models using multiple models (segmented models, hierarchical models, and related techniques) is absolutely critical.   For big data, there is really no alternative.   </p>
<p>I expect that with the explosion of big data and big data analytics, and with PMML&#8217;s support for multiple models, that PMML will begin to be an essential component of any analytic infrastructure that supports big data.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2011/08/22/pmml-workshop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

