Software for Data Analysis

John Chambers

Mentioned 4

John Chambers turns his attention to R, the enormously successful open-source system based on the S language. His book guides the reader through programming with R, beginning with simple interactive use and progressing by gradual stages, starting with simple functions. More advanced programming techniques can be added as needed, allowing users to grow into software contributors, benefiting their careers and the community. R packages provide a powerful mechanism for contributions to be organized and communicated. This is the only advanced programming book on R, written by the author of the S language from which R evolved.

More on Amazon.com

Mentioned in questions and answers.

How do people learn about giving an R package a namespace? I find the documention in "R Extensions" fine, but I don't really get what is happening when a variable is imported or exported - I need a dummy's guide to these directives.

How do you decide what is exported? Is it just everything that really shouldn't required the pkg:::var syntax? What about imports?

Do imports make it easier to ensure that your use of other package functions doesn't get confused when function names overlap?

Are there special considerations for S4 classes?

Packages that I'm familiar with that use namespaces such as sp and rgdal are quite complicated - are there simple examples that could make things clearer?

The clearest explanation I've read is in John Chambers' Software for Data Analysis: Programming with R, page 103. I don't know of any free online explanations that are better than what you've already found in the R Extensions manual.

Statistical analysis/programming, is writing code. Whether for descriptive or inferential, You write code to: import data, to clean it, to analyse it and to compile a report.

Analyzing the data can involve many twists and turns of statistical procedures, and angles from which you look at your data. At the end, you have many files, with many lines of code, performing tasks on your data. Some of which is reusable and you capsulate it as a "good to have" function.

This process of "Statistical analysis" feels to me like "programming" But I am not sure it feels the same to everyone.

From the Wikipedia article on Software development:

The term software development is often used to refer to the activity of computer programming, which is the process of writing and maintaining the source code, whereas the broader sense of the term includes all that is involved between the conception of the desired software through to the final manifestation of the software. Therefore, software development may include research, new development, modification, reuse, re-engineering, maintenance, or any other activities that result in software products. For larger software systems, usually developed by a team of people, some form of process is typically followed to guide the stages of production of the software.

According to this simplistic definition (and my humble opinion), this sounds very much like building a statistical analysis. But I imagine it is not that simple.

Which leads me to my question: what differences can you outline between the two activities?

It can be in terms of the technical aspects, the different strategies or work styles, and what ever else you think is relevant.

This question came to me from the following threads:

As I said in my response to your other question, what you're describing is programming. So the short answer is: there is no difference. The slightly longer answer is that statistical and scientific computing should require even more controls around development than other programming.

A certain percentage of statistical analysis can be done using Excel, or in a point-and-click approach using SPSS, SAS, Matlab, or S-Plus (for instance). A more sophisticated analysis done using one of those programs (or R) that involves programming is clearly a form of software development. And this kind of statistical computing can benefit immensely from following all the best practices from software development: source control, documentation, a project plan, scope document, bug tracking/change control, etc.

Moreover, there are different kinds of statistical analyses that can follow different approaches, as with any programming project:

  • Exploratory data analysis should follow an iterative methodology, like the Agile methodology. In this case, when you don't know explicity the steps involved up front, it's critical to use a development methodology that is adaptive and self-reflective.
  • A more routine kind of analysis (e.g. an government annual survey such as the Census) could follow a more traditional methodology such as the waterfall approach since it would be following a very clear set of steps that are mostly known in advance.

I would suggest that any statistician would benefit from reading a book like "Code Complete" (look at the other top books in this post): the more organized you are with your analysis, the greater the likelihood of success.

Statistical analysis in some sense requires even more good practices around version control and documentation than other programming. If your program is just serving some business need, then the algorithm or software used is really of secondary importance so long as the program functions the way the specifications require. On the other hand, with scientific and statistical computing, accuracy and reproducibility are paramount. This is one of John Chambers' (the creator of the S language) major emphases in "Software for Data Analysis". That is another reason to add literate programming (e.g. with Sweave) as an important tool in the statistician's toolkit.

I am interested in (functional) vector manipulation in R. Specifically, what are R's equivalents to Perl's map and grep?

The following Perl script greps the even array elements and multiplies them by 2:

@a1=(1..8); 
@a2 = map {$_ * 2} grep {$_ % 2 == 0} @a1;
print join(" ", @a2)
# 4 8 12 16

How can I do that in R? I got this far, using sapply for Perl's map:

> a1 <- c(1:8)
> sapply(a1, function(x){x * 2})
[1]  2  4  6  8 10 12 14 16

Where can I read more about such functional array manipulations in R?

Also, is there a Perl to R phrase book, similar to the Perl Python Phrasebook?

Quick ones:

  • Besides sapply, there are also lapply(), tapply, by, aggregate and more in the base. Then there are loads of add-on package on CRAN such as plyr.

  • For basic functional programming as in other languages: Reduce(), Map(), Filter(), ... all of which are on the same help page; try help(Reduce) to get started.

  • As noted in the earlier answer, vectorisation is even more appropriate here.

  • As for grep, R actually has three regexp engines built-in, including a Perl-based version from libpcre.

  • You seem to be missing a few things from R that are there. I'd suggest a good recent book on R and the S language; my recommendation would be Chambers (2008) "Software for Data Analysis"

What good resources are there for R idioms, in the same line as there are for Java and Python?

Easy: 2200+ packages and counting on CRAN :)

Actually, jokes aside, the best description I have read was in Chambers (2008).