Blog
Prototyping Cloud Analytic Applications
Posted by Robert Grossman in Blog on July 27, 2010
Cloud computing is changing the way that companies build and deploy their analytic solutions. With cloud computing, computing is available on demand, scales elastically, and can be self-provisioned. This flexibility sometimes requires developing new analytic infrastructure and new analytic algorithms, which, in turn, requires some experimenting. This process can usually benefit from an external perspective.
The fastest way forward is to use a public cloud, external experts, and to do some quick experiments and prototyping. At this point, for many companies, there is a problem. It is quite common these days for companies to have policies that prohibit placing proprietary data, or data that contains information that can identify customers, on public clouds. Providing access to this data to third parties is also usually quite difficult.
One practical approach is to replace actual data with simulated data, and, instead of using public clouds, to use instead private clouds operated by third parties. This requires using data simulators that produce realistic data. For example, large data is rarely normally distributed, but more often follows power laws or similar types of distributions.
As a reminder, a private cloud is a cloud that is used exclusively by a single organization. It may be managed by the organization or by a third party; and, it may exist on premise (an in-house private cloud) or off premise (a third-party private cloud). In contrast, in a public cloud, the cloud infrastructure is made available to the general public, or a large group, and is owned by an organization selling cloud services (a cloud service provider). In this post, we assume that private third party clouds are also single tenant clouds; that is, only one client’s data is on the cloud at a time and the cloud is sanitized between use by different clients.
In more detail, one approach for moving your analytics to clouds is:
- use simulated data following realistic simulations, instead of actual data;
- supplement in-house expertise with third party experts who specialize in analytics and cloud computing;
- use third party private clouds instead of public clouds to decrease risk or perceived risk;
- experiment with different analytic approaches and different analytic infrastructures;
- agree on APIs up front and transfer technology by transferring code that uses these APIs.
We have found this approach works well. We would be interested in hearing your experiences.
Full disclosure: Open data operates private clouds, has developed software that provides simulated data for a variety of industries, including financial services, and provides consulting services using simulated data on private clouds so that companies can rapidly explore the use of cloud computing to develop innovative cloud computing applications, especially analytic applications.
hash-2.0.0
Posted by Christopher Brown in Blog, R on April 30, 2010
The hash-2.0.0 package has been uploaded to CRAN. This version was developed in conjunction with R-2.11.0 and was refactored for performance. hash-2.0.0 requires R-2.10.0 or later and will not be supported on earlier versions of R. This is a result of recent changes to the language itself.
R : NA vs. NULL
Posted by Christopher Brown in Blog, R on April 25, 2010

It is common for programming languages to have a NULL value. What often leads to confusion is the fact NULL can have two distinct meanings. In the first, NULL is used to represent missing or undefined values. This is well appreciated in SQL. In the second case, NULL is the logical representation a statement that is neither TRUE nor FALSE. This indeterminacy is the basis for ternary logic. While these meanings are distinct, they are very often related. When missing values (the first meaning) are evaluated, the desired result is often an ambiguous result (the second). That is, the former implies the latter. In programming, the distinction is often unnecessary and glossed over and the concepts become confounded.
GPU Computing’s Next Decade
Posted by Christopher Brown in Blog on March 3, 2010
Alan Dang at Tom’s Hardware has posted an article prognosticating what the next decade holds for GPUs. Even if you don’t usually find pundit predictions useful, Alan’s is worth the read. Alan has been there since the beginning and he takes his readers through the history, the motivating economics toward a coherent vision of the GPU’s future. The article compares and contrasts the product mixes, technology and strategies of the three existing competitors: Nvidia, AMD and Intel.
Despite an emphasis toward gaming and video — the primary market and impetus for technology — the high performance computing enthusiast should take away a better understanding the technology and what hardware and software toys are on the horizon.
Posted by: Christopher Brown, Principle Open Data Partnershash-1.99.x
Posted by Christopher Brown in Blog, R on February 17, 2010
hash-2.0.0 has been released please read about it here:
Earlier today, hash-1.99.x was released to CRAN. This is a stable release and adds some more functions to an already full-featured hash implementation. This version fixes some bugs, adds some features, improves performance and stability. You can read about the hash package in my previous blog post, The hash package: hashes come to R. All changes were responsible from users who wrote in and contributed, thoughts, ideas and use cases. Keep the good ideas coming. Two of the major changes are summarized below.