<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Open Data Group &#187; R</title>
	<atom:link href="http://opendatagroup.com/category/blog/r/feed/" rel="self" type="application/rss+xml" />
	<link>http://opendatagroup.com</link>
	<description>Open Data Group&#039;s Home Page and Blog</description>
	<lastBuildDate>Sat, 04 Sep 2010 00:51:55 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>hash-2.0.0</title>
		<link>http://opendatagroup.com/2010/04/30/hash-2-0-0/</link>
		<comments>http://opendatagroup.com/2010/04/30/hash-2-0-0/#comments</comments>
		<pubDate>Fri, 30 Apr 2010 15:34:58 +0000</pubDate>
		<dc:creator>Christopher Brown</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[CRAN]]></category>
		<category><![CDATA[hash package for R]]></category>
		<category><![CDATA[open source analytics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[R packages]]></category>

		<guid isPermaLink="false">http://opendatagroup.com/?p=346</guid>
		<description><![CDATA[Come see my talk on hashes in R at useR! 2010.  (http://user2010.org/)
July 20-23
National Institute of Standards and Technology (NIST),
Gaithersburg, Maryland, USA]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-433 alignleft" title="hash" src="http://opendatagroup.com/files/2010/04/hash.png" alt="hash" width="221" height="128" />The <strong>hash-2.0.0</strong> package has been uploaded to <strong><a href="http://cran.r-project.org">CRAN</a></strong>.  This version was developed in conjunction with R-2.11.0 and was refactored for performance.   <strong>hash-2.0.0 </strong>requires R-2.10.0 or later and will <strong>not </strong>be supported on earlier versions of R.  This is a result of recent changes to the language itself.</p>
<p><span id="more-346"></span><span style="color: #ff0000"><span style="color: #000000">Importantly: Understand that </span><strong>hash-2.0.0, breaks backward compatibility</strong><span style="color: #000000">;</span> <span style="color: #000000">code written with previous versions of the hash package are not guaranteed to work with this or future versions. </span></span>This is due to changes made in order to achieve much higher performance.  Assignments and look-ups are achieved more quickly through  direct inheritance of environments, stripping of non-essential customizations  and reliance on core and primitive functions.</p>
<p>Here is a summary of major changes:</p>
<ul>
<li>Coercion of keys to valid R names ( i.e. non-blank character values) is not the responsibility of the user.  The four accessor functions: [, [[, $, values, no longer do this automatically.  An error results if a proper R name is not provided.</li>
</ul>
<ul>
<li>The default for missing keys has changed from <span style="color: #333333"><strong><span style="color: #808080">NA</span> </strong></span>to <span style="color: #808080"><strong>NULL</strong></span><span style="color: #000000">. This is to match the behavior lists in trying to access non-existing objects in R.  ( For a more complete, discussion, see my previous blog post discussing the <a href="http://opendatagroup.com/2010/04/25/r-na-v-null/">differences between NA and NULL</a>. )<br />
</span></p>
<ul>
<li><span style="color: #000000">Custom behavior for accessing non-existent keys has been removed.  Access to non-existing keys will always yield NULL.  Consistency is often better than customization.</span></li>
</ul>
</li>
</ul>
<p><em>ChangeLog</em> and <em>TODO</em> track many technical details; here I will discuss only the more  important changes:</p>
<h2>Performance</h2>
<p>Included in this version is a demo script that runs benchmarks (demo(hash-benchmarks).  One of the questions that has been repeatedly posed, often in the context of look-up, is:  <em>how does this compare to native R named lists and vectors?</em> In other words, how much quicker is accessing a value on a hash / environment as opposed to a list (or vector)?  This is a difficult questions, and generally depends on the size of the hash or list.  My rule of thumb is that it is quicker to look-up elements on lists and vectors less than about 500 elements.  After ~500 elements, hashes and environments greatly outperform lists.  The difference increases relative to the size of the object.  However, look-ups for all these objects are very fast if objects are small  ( &gt;120,000 / sec ).  So unless you are doing many serial look-ups, hashes are likely the better option.</p>
<p>I have written previously about hashes in R [<a href="../2009/07/26/hash-package-for-r/">1</a>]  [<a href="../2010/02/17/hash-1-99-x/" target="_self">2</a>], and will continue to  discuss the  evolution of R hashes on this blog.  Additionally I will be speaking on this and related work at <a href="http://user2010.org/" target="_blank">useR!2010</a> (July 20-23.)</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2010/04/30/hash-2-0-0/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>R : NA vs. NULL</title>
		<link>http://opendatagroup.com/2010/04/25/r-na-v-null/</link>
		<comments>http://opendatagroup.com/2010/04/25/r-na-v-null/#comments</comments>
		<pubDate>Sun, 25 Apr 2010 12:51:02 +0000</pubDate>
		<dc:creator>Christopher Brown</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://opendatagroup.com/?p=348</guid>
		<description><![CDATA[The R language has two closely related NULL-like values, NA  and NULL ... Both are used to represent missing or undefined values.  This has lead to much confusion. ]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-350 alignleft" title="na-null" src="http://opendatagroup.com/files/2010/04/na-null.png" alt="na-null" width="253" height="50" /></p>
<p>It is common for programming languages to have a <a href="http://en.wikipedia.org/wiki/NULL">NULL</a> value.  What often leads to confusion is the fact NULL can have two distinct meanings.  In the first, NULL is used to represent missing or undefined values.  This is well appreciated in SQL. In the second case, NULL is the logical representation a statement that is neither TRUE nor FALSE.  This indeterminacy is the basis for <a href="http://en.wikipedia.org/wiki/Ternary_logic">ternary logic</a>.  While these meanings are distinct, they are very often related.  When missing values (the first meaning) are evaluated, the desired result is often an ambiguous result (the second).  That is, the former implies the latter.  In programming, the distinction is often unnecessary and glossed over and the concepts become confounded.</p>
<p><span id="more-348"></span></p>
<p>The <strong>R</strong> language has two closely related NULL-like values<strong>, <span style="color: #888888">NA</span></strong> and <span style="color: #888888"><strong>NULL</strong></span>.  Both are fully support in the language by core functions (e.g, <span style="color: #888888"><strong>is.na</strong>, <strong>is.null</strong>, <strong>as.null</strong><span style="color: #000000">, etc.)</span>. </span>And, while <strong><span style="color: #808080">NA</span></strong> is used exclusively in the logical sense, both are used to represent missing or undefined values.  This has lead to much confusion.  Here&#8217;s what the R documentation has to say:</p>
<blockquote><p><strong>NULL</strong> represents the null object in R: it is a reserved word.<br />
NULL is often returned by expressions and functions whose values are<br />
undefined.</p></blockquote>
<blockquote><p><strong>NA</strong> is a logical constant of length 1 which contains a missing<br />
value indicator. NA can be freely coerced to any other vector<br />
type except raw.  There are also constants NA_integer_,<br />
NA_real_, NA_complex_ and NA_character_ of the other atomic<br />
vector types which support missing values: all of these are<br />
reserved words in the R language.</p></blockquote>
<p>There is a lot of subtlety in the treatment of these values.  A good way to understand the distinction between  <span style="color: #888888">NA</span> and <span style="color: #888888">NULL</span> is through some examples:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="312" valign="top"><strong>NA</strong></td>
<td width="293" valign="top"><strong>NULL</strong></td>
</tr>
<tr>
<td width="312" valign="top">
<pre><span style="color: #ff0000">
 &gt; NA</span><span style="color: #0000ff">
 [1] NA

</span><span style="color: #ff0000"> &gt; class(NA)</span><span style="color: #0000ff">
 [1]   "logical"

</span><span style="color: #ff0000"> &gt; NA &gt; 1</span>
<span style="color: #0000ff"> [1] NA</span></pre>
</td>
<td width="293" valign="top">
<pre><span style="color: #ff0000">
 &gt; NULL</span><span style="color: #0000ff">
 NULL</span><span style="color: #ff0000"> 

 &gt; class(NULL)</span><span style="color: #0000ff">
 [1]   "NULL"<span style="color: #ff0000">

 &gt; NULL &gt; 1</span>
 logical(0)
</span></pre>
</td>
</tr>
</tbody>
</table>
<p>The important distinction is that <span style="color: #808080">NA</span> is a &#8216;logical&#8217; value that when evaluated in an expression, yields NA.  This is the expected behavior of a value that handles logical indeterminacy.   <span style="color: #808080">NULL</span> is its own thing and does not yield any response when evaluated in an expression, which is not how we would want or expect <span style="color: #808080">NA</span> to work.</p>
<p>To delve deeper into the behavior we must look at how R&#8217;s basic data structures, vectors (including matrices and arrays) and lists (including data.frames) behave.  Vectors and lists are similar structures, both allow for multiple values with similar <a href="http://opendatagroup.com/2009/10/21/r-accessors-explained">accessors</a>.  There are subtle differences in the treatment of <span style="color: #808080">NA </span>and<span style="color: #808080"> NULL</span>.  Let&#8217;s take a look at how they compare:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr style="text-align: left">
<td width="361" valign="top">
<p style="text-align: left"><strong> Vectors ( inc. Matrices and Arrays )</strong></p>
</td>
<td width="330" valign="top"><strong> List ( inc. data frames )</strong></td>
</tr>
<tr>
<td width="361" valign="top">
<pre> <span style="color: #ff0000">&gt; v &lt;-  c( 1, NA, NULL)
 &gt; v</span><span style="color: #0000ff"> 
 [1]  1 NA

 </span></pre>
</td>
<td width="330" valign="top">
<pre><span style="color: #ff0000">
 &gt; list(1, NA, NULL)
</span><span style="color: #0000ff"> [[1]]
 [1] 1</span><span style="color: #0000ff">

 [[2]]</span><span style="color: #0000ff">
 [1] NA
</span>
<span style="color: #0000ff"> [[3]]
 NULL</span></pre>
</td>
</tr>
</tbody>
</table>
<p>What happened?  <span style="color: #808080">NULL </span>is not allowed in a vector.  When you attempt to set it as a value in a vector, it is it is quietly ignored.  This is because <span style="color: #808080">NULL </span>is an object and type of its own.  <span style="color: #808080">NULL </span>does not have various types such as NULL_integer_.  There is just <span style="color: #808080">NULL</span>. By contrast, <span style="color: #808080">NA<span style="color: #000000"> has NA_integer, etc. and </span></span><span style="color: #000000">happ</span>ily coexists with any of the basic vector types vector.  <em>So for any vector (matrix or array), <span style="color: #808080">NA </span>represents a missing value.  <span style="color: #808080">NULL does not</span></em>.</p>
<p>Now, let&#8217;s look at the lists example. This is interesting! Unlike the vector, the list can hold objects and values other than the basic types.  This includes the <strong><span style="color: #808080">NULL</span> </strong>value/object.  Perhaps a little inconsistent and not what we would expect.  But from here, things get a little quirky, let&#8217;s try value assignment:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td width="361" valign="top"><strong>Vectors ( inc. Matrices and Arrays )</strong></td>
<td width="330" valign="top"><strong>List ( inc. data frames )</strong></td>
</tr>
<tr>
<td width="361" valign="top">
<pre><span style="color: #ff0000">
  &gt; v[[1]] &lt;- NULL</span><span style="color: #0000ff">
   Error in v[[1]] &lt;- NULL :</span><span style="color: #0000ff">
    more elements supplied than there are to replace

 </span></pre>
</td>
<td width="330" valign="top">
<pre><span style="color: #ff0000">
 &gt; li &lt;- list( 1, 2, 3 )
 &gt; li[[1]] &lt;- NULL
 &gt; li</span>
<span style="color: #0000ff"> [[1]]
 [1] 2
</span>
<span style="color: #0000ff"> [[2]]
 [1] 3</span></pre>
</td>
</tr>
</tbody>
</table>
<p>Sure enough <span style="color: #808080">NULL <span style="color: #000000">cannot be assigned to a vector.  So for all purposes, <span style="color: #666699">NA <span style="color: #000000">with respect to the basic vector behaves like <span style="color: #666699">NULL</span> in other languages.  <span style="color: #666699">NULL</span> is almost never what you want.  On the list side, however, we see an idiom of <span style="color: #666699">NULL. </span></span></span></span></span><em>Assigning <span style="color: #808080">NULL </span>to list items, removes them</em>.  This behavior is a bit unexpected, but it is the idiom.</p>
<p>There is one final idiom to know about <span style="color: #888888">NULL </span>and lists. Namely, that trying to access a list element by a non-existing name yields a <span style="color: #888888">NULL </span>value.</p>
<pre style="padding-left: 30px"><span style="color: #ff0000">&gt; li$aa</span>
<span style="color: #0000ff">NULL</span>
<span style="color: #ff0000">&gt; li[['aa']]</span>
<span style="color: #0000ff">NULL

</span><span style="color: #0000ff"> </span></pre>
<p>( Note: the same is true for trying to access non-existing objects on an environment )</p>
<p>R does not have a consistent or intuitive way of dealing with missing and logically ambiguous values, i.e. addressing the two meanings from the beginning of this post.  For vectors and basic variables, R mimics other languages and uses <span style="color: #888888">NA</span>.  For lists however, the syntax is more idiomatic.  It is this latter case that presents difficulty.  R has other quirks <a href="http://www.r-statistics.com/2010/04/the-difference-between-lettersc1na-and-letterscnana/">too</a>.  But all languages have quirks, and given R&#8217;s strength for statistical analysis, I have found no better tool for this.</p>
<p><span style="color: #000000"> </span></p>
<div id="_mcePaste" style="width: 1px;height: 1px;overflow: hidden"><!--[if gte mso 9]&gt;  Normal 0     false false false  EN-US X-NONE X-NONE              MicrosoftInternetExplorer4              &lt;![endif]--><!--[if gte mso 9]&gt;                                                                                                                                            &lt;![endif]--><!--  /* Font Definitions */  @font-face 	{font-family:"Cambria Math"; 	panose-1:2 4 5 3 5 4 6 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:roman; 	mso-font-pitch:variable; 	mso-font-signature:-1610611985 1107304683 0 0 415 0;} @font-face 	{font-family:Calibri; 	panose-1:2 15 5 2 2 2 4 3 2 4; 	mso-font-charset:0; 	mso-generic-font-family:swiss; 	mso-font-pitch:variable; 	mso-font-signature:-520092929 1073786111 9 0 415 0;}  /* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal 	{mso-style-unhide:no; 	mso-style-qformat:yes; 	mso-style-parent:""; 	margin-top:0in; 	margin-right:0in; 	margin-bottom:10.0pt; 	margin-left:0in; 	line-height:115%; 	mso-pagination:widow-orphan; 	font-size:11.0pt; 	font-family:"Calibri","sans-serif"; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:Calibri; 	mso-fareast-theme-font:minor-latin; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} .MsoChpDefault 	{mso-style-type:export-only; 	mso-default-props:yes; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:Calibri; 	mso-fareast-theme-font:minor-latin; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:"Times New Roman"; 	mso-bidi-theme-font:minor-bidi;} .MsoPapDefault 	{mso-style-type:export-only; 	margin-bottom:10.0pt; 	line-height:115%;} @page Section1 	{size:8.5in 11.0in; 	margin:1.0in 1.0in 1.0in 1.0in; 	mso-header-margin:.5in; 	mso-footer-margin:.5in; 	mso-paper-source:0;} div.Section1 	{page:Section1;} --><!--[if gte mso 10]&gt; &lt;!   /* Style Definitions */  table.MsoNormalTable 	{mso-style-name:&quot;Table Normal&quot;; 	mso-tstyle-rowband-size:0; 	mso-tstyle-colband-size:0; 	mso-style-noshow:yes; 	mso-style-priority:99; 	mso-style-qformat:yes; 	mso-style-parent:&quot;&quot;; 	mso-padding-alt:0in 5.4pt 0in 5.4pt; 	mso-para-margin-top:0in; 	mso-para-margin-right:0in; 	mso-para-margin-bottom:10.0pt; 	mso-para-margin-left:0in; 	line-height:115%; 	mso-pagination:widow-orphan; 	font-size:11.0pt; 	font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;; 	mso-ascii-font-family:Calibri; 	mso-ascii-theme-font:minor-latin; 	mso-fareast-font-family:&quot;Times New Roman&quot;; 	mso-fareast-theme-font:minor-fareast; 	mso-hansi-font-family:Calibri; 	mso-hansi-theme-font:minor-latin; 	mso-bidi-font-family:&quot;Times New Roman&quot;; 	mso-bidi-theme-font:minor-bidi;} --> <!--[endif]--></p>
<p class="MsoNormal" style="margin-bottom: 0.0001pt;line-height: normal">&gt;Here</p>
<p class="MsoNormal" style="margin-bottom: 0.0001pt;line-height: normal">c( 1, NA, NULL)</p>
<p>[1]  1 NA</p></div>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2010/04/25/r-na-v-null/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>hash-1.99.x</title>
		<link>http://opendatagroup.com/2010/02/17/hash-1-99-x/</link>
		<comments>http://opendatagroup.com/2010/02/17/hash-1-99-x/#comments</comments>
		<pubDate>Wed, 17 Feb 2010 15:26:45 +0000</pubDate>
		<dc:creator>Christopher Brown</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[hashes]]></category>
		<category><![CDATA[R packages]]></category>
		<category><![CDATA[R programming]]></category>

		<guid isPermaLink="false">http://opendatagroup.com/?p=322</guid>
		<description><![CDATA[hash-2.0.0 has been released please read about it here: 
Earlier today, hash-1.99.x was released to CRAN.  This is a stable release and adds some more functions to an already full-featured hash implementation.  This version fixes some bugs, adds some features, improves performance and stability.  You can read about the hash package in [...]]]></description>
			<content:encoded><![CDATA[<p><span style="color: #ff0000"><strong><span>hash-2.0.0 has been released please read about it <a href="http://opendatagroup.com/2010/04/30/hash-2-0-0/">here</a>: </span></strong></span></p>
<p>Earlier today, hash-1.99.x was released to CRAN.  This is a stable release and adds some more functions to an already full-featured hash implementation.  This version fixes some bugs, adds some features, improves performance and stability.  You can read about the hash package in my previous blog post,<a href="http://opendatagroup.com/2009/07/26/hash-package-for-r/"> The hash package: hashes come to R</a>.  All changes were responsible from users who wrote in and contributed, thoughts, ideas and use cases.  Keep the good ideas coming.  Two of the major changes are summarized below.</p>
<p><span id="more-322"></span></p>
<p>Matthias Buch-Kromann of the Copenhagen Business School recommended the ability to access multiple keys from a single call and even access the same key multiple times.  This was previously allowed using the <code>[[</code> method, but was deprecated.  By convention, the <code>[[</code> method returns only one value.  ( You can read about the conventions of this and other R accessors in my previous blog post, <a href="http://opendatagroup.com/2009/10/21/r-accessors-explained/">R Accessors Explained</a>. ) This behavior has returned to hash-1.99.x the use of the <code>values</code> method and the and optional <code>keys</code> argument:</p>
<p style="padding-left: 30px"><code><span style="color: #333333"><br />
h &lt;- hash( c('a','b','c'), 1:3 )<br />
values(h)<br />
values(h, keys=c('a','b','c','a','b','c' ) )</span><br />
</code></p>
<p>Matthias suggested calling the method <code>mget</code>, but there was some disparity with the <code>mget</code> function in base.  The generic function that I needed just wouldn't play nice with base::mget.</p>
<p>Another change in the behavior was prompted by Mohammad Fahim of the Department of Computer Engineering and Computer Science at the University of Louisville.  He wrote me to ask if there is a way to suppress warnings when trying to access non-existent keys.  When accessing  hashes hundreds of thousands of times, it becomes a drag to continually see:</p>
<p style="padding-left: 30px"><code>key: xxxx not found in the hash : hash_table_name</code></p>
<p>I have refactored the behavior to be more R-like by following <span style="color: #333333"><code>na.action</code>-</span>type conventions.  Now the default behavior is to return <span style="color: #333333"><code>NA</code></span> when trying to access non-existing keys.</p>
<p style="padding-left: 30px"><code><br />
<span style="color: #333333">&gt; library(hash)<br />
&gt;h &lt;- hash( c('a','b','c'), 1:3 )<br />
&gt; h  h[ letters[1:5] ]<br />
containing 6 key-value pair(s).<br />
a : 1<br />
b : 2<br />
c : 3<br />
d : NA<br />
e : NA</span><br />
</code></p>
<p>The behavior is also controllable by <code>na.action.hash</code> option.  The functions are provided for most use cases:</p>
<ul>
<li><code>na.default.hash</code> (default) returns <code>NA</code> silently ,</li>
<li><code>na.fail.hash</code> (old default) errors on non-existing keys</li>
<li><code>na.warn.hash</code> returns <code>NA</code> but issues a warning.</li>
</ul>
<p>Behaviors can be set by setting the <code>na.hash.action</code> option.  For example, to get the default behavior:</p>
<p style="padding-left: 30px"><code><br />
&gt; <span style="color: #333333">options( na.hash.action = na.fail.hash )<br />
&gt; h$d<br />
Error: key, d, not found in hash.<br />
&gt; h[[ 'd' ]]<br />
Error: key, d, not found in hash.</span><br />
</code></p>
<p>And , for the <span style="color: #333333"><code>[</code></span> and <span style="color: #333333"><code>[[</code> </span>methods, this behavior can be declared at access time:</p>
<p style="padding-left: 30px"><code><br />
&gt; h[[ 'd', na.action=na.warn.hash ]]<br />
Warning: key, d, not found in hash.<br />
d<br />
NA<br />
&gt; h[[ 'd', na.action=na.fail.hash ]]<br />
Error: key, d, not found in hash.<br />
&gt; h[[ 'd', na.action=na.default.hash ]]<br />
d<br />
NA<br />
</code></p>
<p>If you don&#8217;t like these hash-key-miss behaviors, you are free to write your own.  Functions should minimally accept arguments of the hash and the key.</p>
<p>Thanks to both Matthias and Mohammed for your feedback.</p>
<p>New features are on their way.  Notably, the ability to use any object as keys and to preserve the order of the hash.  These are sometimes called Indexed Hashes.  Look for that in the hash-2.00.x release.  If you would like to see features added contact me at cbrown -at- opendatagroup.com</p>
<p>References:</p>
<ul>
<li><a href="http://opendatagroup.com/2009/07/26/hash-package-for-r/">The hash package: hashes come to R</a></li>
<li><a href="http://opendatagroup.com/2009/10/21/r-accessors-explained/">R Accessors Explained</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2010/02/17/hash-1-99-x/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>[, [[, $: R accessors explained</title>
		<link>http://opendatagroup.com/2009/10/21/r-accessors-explained/</link>
		<comments>http://opendatagroup.com/2009/10/21/r-accessors-explained/#comments</comments>
		<pubDate>Wed, 21 Oct 2009 23:00:19 +0000</pubDate>
		<dc:creator>Christopher Brown</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://blog.opendatagroup.com/?p=137</guid>
		<description><![CDATA[For more than ten years, I have been teaching R both formally and informally.  One thing that I find often trips up students is the use of R&#8217;s accessors and mutators.  ( For those readers not from a formal computer science background, an accessor is a method for accessing data in an object [...]]]></description>
			<content:encoded><![CDATA[<div id="attachment_197" class="wp-caption alignleft" style="width: 197px"><img class="size-full wp-image-197" title="R Accessors" src="http://opendatagroup.files.wordpress.com/2009/10/accessors.png" alt="R Accessors" width="187" height="132" /><p class="wp-caption-text">R Accessors</p></div>
<p>For more than ten years, I have been teaching R both formally and informally.  One thing that I find often trips up students is the use of R&#8217;s accessors and mutators.  ( For those readers not from a formal computer science background, an accessor is a method for accessing data in an object usually an attribute of that object.)  A simple example is taking a subset of a vector:</p>
<p><code><br />
letters[1:3]<br />
[1] "a" "b" "c"<br />
</code></p>
<p>As you can see, the result is a <em>character vector</em> containing the first three letters of letters vector.</p>
<p>Good programming languages have a standard pattern for accessor and mutators.  For R, there are three: <strong>[</strong>, <strong>[[</strong>, and <strong>$</strong>.  This confuses beginners coming from other programming languages.  Java and Python have one: '.'.  Why does R need three?</p>
<p>The reason derives from R's data centric view of the world.  R natively provides vectors, lists, data frames, matrices, etc.  In truth, one can get by using only <strong>[</strong> to extract information from these structures, but the others are handy in certain scenarios.  So much so that after a while, they feel indispensible.  I will explain each and hopefully by the end of this article you will understand why each exists, what to remember and, more importantly, when to each should be used.</p>
<p><strong>Subset with [</strong></p>
<p>When you want a subset of an object use <strong>[</strong>. Remember that when you take a subset of an the object you get the same type of thing.  Thus, the subset of a vector will be a vector, the subset of a list will be a list and the subset of a data.frame will be a data.frame.</p>
<p>There is one inconsistency, however.  The default in R is to reduce the results to the lowest dimension, so if your subset contains only result, you will only get that one item which may be something of a different type.  Thus, taking a subset of the iris data frame with only one column</p>
<p><code><br />
class( iris[ , "Petal.Length" ] )<br />
[1] numeric<br />
</code></p>
<p>returns a numeric vector and not a data frame.  You can override this behavior with the little publicized drop parameter, which indicates not to reduce the result.  Taking the subset of iris with drop = FALSE</p>
<p><code><br />
iris[ , "Petal.Length", drop=FALSE ]<br />
</code></p>
<p>is a proper data frame.</p>
<p>Things to Remember:</p>
<ul>
<li> Most often, a subset is the same type as the original object.</li>
<li> Both indices and names can be used to extract the subset. ( In order to use names, object must have a name type attribute such as names, rownames, colnames, etc. )</li>
<li> You can use negative integers to indicate exclusion.</li>
<li>Unquoted variables are interpolated within the brackets.</li>
</ul>
<p><strong>Extract one item with [[</strong></p>
<p>The double square brackets are used to extract <em>one</em> element from potentially many.  For vectors yield vectors with a single value; data frames give a column vector; for list, one element:</p>
<p><code><br />
letters[[3]]<br />
iris[["Petal.Length"]]<br />
</code></p>
<p>The mnemonic device, here is that the double square bracket look as if you are asking for something deep within a container.  You are not taking a slice but reaching to get at the <em>one</em> thing at the core.</p>
<p>Three important things to remember:</p>
<ul>
<li> You can return only <em>one</em> item.</li>
<li> The result is <em>not</em> (necessarily) the same type of object as the container.</li>
<li> The dimension will be the dimension of the one item which is not necessarily 1.</li>
<li> And, as before:
<ul>
<li> Names or indices can both be used.</li>
<li> Variables are interpolated.</li>
</ul>
</li>
</ul>
<p><strong>Interact with $</strong></p>
<p>Interestingly enough, the accessor that provides the least unique utility is also probably used the most often used.  <strong>$</strong> is a special case of <strong>[[</strong> in which you access a single item by actual name.   The following are equivalent:</p>
<p><code><br />
iris$Petal.Length<br />
iris[["Petal.Length"]]<br />
</code></p>
<p>The appeal of this accessor is nothing more than brevity.  One character, $, replaces six, [[""]].  This accessor is handiest when doing interactive programming but should be discouraged for more production oriented code because of its limitations, namely the inability to interpolate the names or use integer indices.</p>
<p>Things to Remember:</p>
<ul>
<li> You cannot use integer indices</li>
<li> The name will not be interpolated.</li>
<li> Returns only one item.</li>
<li>If the name contains special characters, the name must be enclosed in backticks: ``</li>
</ul>
<p>That is really all there is to it.  [ - for subsets, [[ - for extracting items, and $ - for extracting by name.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2009/10/21/r-accessors-explained/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>R: The Dummies Package</title>
		<link>http://opendatagroup.com/2009/09/30/r-the-dummies-package/</link>
		<comments>http://opendatagroup.com/2009/09/30/r-the-dummies-package/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 18:20:47 +0000</pubDate>
		<dc:creator>Christopher Brown</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[dummies package]]></category>
		<category><![CDATA[dummy variables]]></category>

		<guid isPermaLink="false">http://blog.opendatagroup.com/?p=117</guid>
		<description><![CDATA[R-2.9.2 was released in August.   While R can be considered stable and battle-ready, it is also far from stagnation.  It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever.   Every day I see a new package or read a new comment on R-Help [...]]]></description>
			<content:encoded><![CDATA[<p>R-2.9.2 was released in August.   While R can be considered stable and battle-ready, it is also far from stagnation.  It is humbling to see such an intelligent and vibrant community helping CRAN grow faster than ever.   Every day I see a new package or read a new comment on R-Help gives me pause to think.</p>
<p>As much as I like R, on occasion I will find myself lost in some dark corner.  Sometimes, I find light.  Sometimes I am gnashing teeth and wringing hands.  Frustrated.  In a recent foray, I found myself trying to do something that I thought exceedingly trivial: expanding character and factor vectors to dummy variables.  There must be some function, but what?   Trying ?dummy didn&#8217;t turn up anything.  Surely some else must have encountered this and provided a package.   I went to the Internet and sure enough the <a href="http://wiki.r-project.org/rwiki/doku.php?id=tips:data-manip:create_indicator">R-wiki</a> was here to save me.  And looking even harder, I found some who had treaded before me on the R-Help archives.  It turns out, it&#8217;s simple.  Expanding a variable as a dummy variable can be done like so:</p>
<p><code><br />
x &lt;- c(2, 2, 5, 3, 6, 5, NA)<br />
xf &lt;- factor(x, levels = 2:6)<br />
model.matrix( ~ xf - 1)<br />
</code></p>
<p>Two problems.  The first problem is that without an external source (Google), I would have never stumbled upon what I wanted.  ( Thanks Google!)  I understand it now, but for what I wanted to do, I would never have thought, &#8220;oh, model.matrix.&#8221;</p>
<p>The second problem is the arcane syntax, <code>wtf &lt;- ~ xf - 1</code>.  I get it now, but it took me some time to figure out what was going on.  I get it, but why not just <code>dummy(var)</code>?  This is what I want to do.</p>
<p>The solution on the wiki wasn&#8217;t quite what I was looking for.  For instance, you can&#8217;t say:</p>
<p><code>model.matrix( ~ xf1 + xf2 + xf3- 1)</code></p>
<p>It turns out, you can only expand one variable at a time.  Well, this is not good.  I know that you could solve this with some sapply&#8217;s and some tests, but next time I might forgot about how to do it.  So with a couple of spare hours, I decided that the next guy, wouldn&#8217;t have to think about it.  He could just use my <a href="http://cran.r-project.org/web/packages/dummies">dummies package</a>.</p>
<p>Like the R-wiki solution, the dummies package provides a nice interface for encoding a single variable.  You can pass a variable -or- a variable name with a data frame.  These are equivalent:</p>
<p><code><br />
dummy( df$var )<br />
dummy( "var", df )<br />
</code></p>
<p>Moreover, you can choose the style of the dummy names, whether to include unused factor level, to have verbose output, etc.</p>
<p>But more than the R-wiki solution, dummy.data.frame offers to something similar to data.frames.  You can specify which columns to expand by name or class and whether to return non-expanded columns.</p>
<p>The package dummies-1.04 is available in CRAN.  Comments and questions are always appreciated.</p>
]]></content:encoded>
			<wfw:commentRss>http://opendatagroup.com/2009/09/30/r-the-dummies-package/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
