<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Open Data Group &#187; data quality</title>
	<atom:link href="http://opendatagroup.com/tag/data-quality/feed/" rel="self" type="application/rss+xml" />
	<link>http://opendatagroup.com</link>
	<description>Open Data Group&#039;s Home Page and Blog</description>
	<lastBuildDate>Sat, 04 Sep 2010 00:51:55 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Health and Status Monitoring</title>
		<link>http://opendatagroup.com/2009/11/29/health-and-status-monitoring-2/</link>
		<comments>http://opendatagroup.com/2009/11/29/health-and-status-monitoring-2/#comments</comments>
		<pubDate>Sun, 29 Nov 2009 01:31:58 +0000</pubDate>
		<dc:creator>Robert Grossman</dc:creator>
				<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[Augustus]]></category>
		<category><![CDATA[baseline models]]></category>
		<category><![CDATA[change detection models]]></category>
		<category><![CDATA[CUSUM]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[GLR]]></category>
		<category><![CDATA[health and status monitoring]]></category>
		<category><![CDATA[Shewhart]]></category>
		<category><![CDATA[statistical quality control]]></category>

		<guid isPermaLink="false">http://blog.opendatagroup.com/?p=219</guid>
		<description><![CDATA[Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider.  If the Amazon web site, or Google&#8217;s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.
As digital systems grow larger and more complex, it can [...]]]></description>
			<content:encoded><![CDATA[<p>Service interruptions of digital systems can inconvenience millions of people and have a significant financial impact on the provider.  If the Amazon web site, or Google&#8217;s Gmail, or the Visa payments network goes down even for a few minutes, it can make front page news.</p>
<p>As digital systems grow larger and more complex, it can become very challenging to monitor their health and status, which is the first step in detecting potential problems, identifying the root causes, and taking appropriate preventive actions.  These types of systems can contain thousands of different data feeds, data flows and processes.  A problem with just one of them can interrupt payments, ads, and status updates, respectively.  Often there are hourly, daily, weekly and seasonal variations in the data that complicates the detection of problems.</p>
<p><span id="more-219"></span></p>
<p>One way to gain some insight into this problem is to look at the origins in the 1920&#8217;s of statistical quality control.   Walter Andrew Shewhart (1891 &#8211; 1967) was an engineer at the Western Electric Company, which manufactured hardware for the Bell Telephone Company, from 1918-1924.  From 1925 to 1956 he was a member of the Technical Staff of Bell Telephone Company [ASQ].</p>
<p><a href="http://opendatagroup.files.wordpress.com/2009/11/shewhart-cover.png"><img class="alignleft size-medium wp-image-227" title="Shewhart - Statistical Method from the Viewpoint of Quality Control" src="http://opendatagroup.files.wordpress.com/2009/11/shewhart-cover.png?w=181" alt="" width="181" height="300" /></a></p>
<p>One of the problems that concerned him was identifying potential problems in factory assembly lines.  For example, the dimensions and weight of metal parts that are sampled from an assembly can be recorded.  He distinguished between two types of variations in these measurements:</p>
<ul>
<li>Common cause of variation (or noise) occurs as a normal part of the manufacturing process.</li>
<li>A special cause of variation is not part of the normal manufacturing process, but represents a problem.</li>
</ul>
<p>One of the goals of <em>statistical quality control</em> is to distinguish between these two types of variation and to quickly identify special causes of variation.</p>
<p>Shewhart introduction control charts as a tool for distinguishing between common and special causes of variation.  A control chart had a central line and upper and lower control limits.  When the measurement exceeded either the upper or lower control limits, it was considered a potential special cause of variation and investigated.  Usually, the upper and lower control limits were three standard deviations above and and below the mean.</p>
<p>As anyone who has investigated potential data quality problems knows, identifying roots causes of potential problems is not easy and Shewhart also introduced a four step approach to these types of investigations that became known as the Shewhart Cycle, the Deming Cycle or the Plan-Do-Check-Act Cycle:</p>
<ul>
<li><strong>Plan.</strong> Identify an opportunity or potential problem and make a plan for improving it or changing it.</li>
<li><strong>Do.</strong> Implement the change on a small scale and collect the appropriate data.</li>
<li><strong>Check.</strong> Use data to analyze statistically the results of the change and determine whether it made a difference.</li>
<li><strong>Act.</strong> If the change was successful, implement it on a wider scale and continuously monitor and improve your results. If the change did not work, begin the cycle again.</li>
</ul>
<p>These same ideas are still used today as the basis for <strong>health and monitoring systems</strong>.  Well designed digital systems these days are designed from the ground up so that appropriate log data is produced.  Instead of a single assembly line producing physical items, there are thousands or millions of digital processes producing (nearly) continuous digital data.  Often this data is available through an http interface and is continually collected.</p>
<div id="attachment_226" class="wp-caption alignleft" style="width: 310px"><a href="http://opendatagroup.files.wordpress.com/2009/11/generic-dashboard0.png"><img class="size-medium wp-image-226" title="Augustus Baseline Dashboard" src="http://opendatagroup.files.wordpress.com/2009/11/generic-dashboard0.png?w=300" alt="" width="300" height="297" /></a><p class="wp-caption-text">This is dashboard from the open source Augustus system for health and status monitoring.</p></div>
<p>Instead of a control chart, a change detection model is used, such as a CUSUM or GLR statistical model [Poor].  Instead of building a single model, a model for each cell in a multi-dimensional cube of models is built [Bugajski].   Instead of looking at the charts each day, an online dash board is used that is at the hub of an operations center.</p>
<p>Baseline and change detection models for each cell in a multi-dimensional data cube of models can be built easily using the open source <a href="http://augustus.googlecode.com">Augustus</a> system.</p>
<p><strong>References</strong></p>
<p>[ASQ] ASQ, The History of Quality &#8211; Overview, retrieved from www.asq.org.</p>
<p>[Bugajski] Joseph Bugajski, Chris Curry, Robert L. Grossman, David Locke and Steve Vejcik, Data Quality Models for High Volume Transaction Streams: A Case Study, Proceedings of the Second Workshop on Data Mining Case Studies and Success Stories, ACM 2007</p>
<p>[Poor] H. Vincent Poor and Olympia Hadjiliadi, Quickest Detection, Cambridge University Press, 2008.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2009/11/29/health-and-status-monitoring-2/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Announcing PMML 4.0 compliant Augustus</title>
		<link>http://opendatagroup.com/2009/09/15/announcing-pmml-4-0-compliant-augustus/</link>
		<comments>http://opendatagroup.com/2009/09/15/announcing-pmml-4-0-compliant-augustus/#comments</comments>
		<pubDate>Tue, 15 Sep 2009 17:43:36 +0000</pubDate>
		<dc:creator>jennarussell</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[analytic projects]]></category>
		<category><![CDATA[Augustus]]></category>
		<category><![CDATA[data quality]]></category>
		<category><![CDATA[open source analytics]]></category>
		<category><![CDATA[PMML]]></category>

		<guid isPermaLink="false">http://odg.opendatagroup.net/?p=292</guid>
		<description><![CDATA[September 2009
Augustus, an open source analytic scoring engine that works with segmented models is  compliant with the PMML 4 standard recently adopted.  Augustus is designed for use with statistical and data mining models. The new release provides Baseline, Tree and Naive-Bayes producers and consumers.  The new release, training, documentation and support can be found at [...]]]></description>
			<content:encoded><![CDATA[<h3>September 2009</h3>
<p>Augustus, an open source analytic scoring engine that works with segmented models is  compliant with the PMML 4 standard recently adopted.  Augustus is designed for use with statistical and data mining models. The new release provides Baseline, Tree and Naive-Bayes producers and consumers.  The new release, training, documentation and support can be found at our Google Code project, <a href="http://code.google.com/p/augustus/" target="_blank">http://code.google.com/p/augustus/</a></p>
<p>Augustus is typically used to construct models and score data with models. Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by <em>consuming</em> PMML-compliant files describing an appropriate model. The typical model development and use cycle with Augustus is as follows:</p>
<ol>
<li>Identify suitable data with which to construct a new model.</li>
<li>Provide a model schema which proscribes the requirements for the model.</li>
<li>Run the Augustus producer to obtain a new model.</li>
<li>Run the Augustus consumer on new data to effect scoring.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2009/09/15/announcing-pmml-4-0-compliant-augustus/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
