## Data Manipulation with R

Phil Spector

Mentioned 1

This book presents a wide array of methods applicable for reading data into R, and efficiently manipulating that data. In addition to the built-in functions, a number of readily available packages from CRAN (the Comprehensive R Archive Network) are also covered. All of the methods presented take advantage of the core features of R: vectorization, efficient use of subscripting, and the proper use of the varied functions in R that are provided for common data management tasks. Most experienced R users discover that, especially when working with large data sets, it may be helpful to use other programs, notably databases, in conjunction with R. Accordingly, the use of databases in R is covered in detail, along with methods for extracting data from spreadsheets and datasets created by other programs. Character manipulation, while sometimes overlooked within R, is also covered in detail, allowing problems that are traditionally solved by scripting languages to be carried out entirely within R. For users with experience in other languages, guidelines for the effective use of programming constructs like loops are provided. Since many statistical modeling and graphics functions need their data presented in a data frame, techniques for converting the output of commonly used functions to data frames are provided throughout the book.

In order to share some more tips and tricks for R, what is your single-most useful feature or trick? Clever vectorization? Data input/output? Visualization and graphics? Statistical analysis? Special functions? The interactive environment itself?

One item per post, and we will see if we get a winner by means of votes.

[Edit 25-Aug 2008]: So after one week, it seems that the simple `str()` won the poll. As I like to recommend that one myself, it is an easy answer to accept.

I do a lot of basic manipulation of data, so here are two built-in functions ( transform , subset ) and one library ( sqldf ) that I use daily.

## create sample sales data

``````sales <- expand.grid(country = c('USA', 'UK', 'FR'),
product = c(1, 2, 3))
sales\$revenue <- rnorm(dim(sales)[1], mean=100, sd=10)

> sales
country product   revenue
1     USA       1 108.45965
2      UK       1  97.07981
3      FR       1  99.66225
4     USA       2 100.34754
5      UK       2  87.12262
6      FR       2 112.86084
7     USA       3  95.87880
8      UK       3  96.43581
9      FR       3  94.59259
``````

## use transform() to add a column

``````## transform currency to euros
usd2eur <- 1.434
transform(sales, euro = revenue * usd2eur)

>
country product   revenue     euro
1     USA       1 108.45965 155.5311
2      UK       1  97.07981 139.2125
3      FR       1  99.66225 142.9157
...
``````

## use subset() to slice the data

``````subset(sales,
country == 'USA' & product %in% c(1, 2),
select = c('product', 'revenue'))

>
product  revenue
1       1 108.4597
4       2 100.3475
``````

## use sqldf() to slice and aggregate with SQL

The sqldf package provides an SQL interface to R data frames

``````##  recast the previous subset() expression in SQL
sqldf('SELECT product, revenue FROM sales \
WHERE country = "USA" \
AND product IN (1,2)')

>
product  revenue
1       1 108.4597
2       2 100.3475
``````

Perform an aggregation or GROUP BY

``````sqldf('select country, sum(revenue) revenue \
FROM sales \
GROUP BY country')

>
country  revenue
1      FR 307.1157
2      UK 280.6382
3     USA 304.6860
``````

For more sophisticated map-reduce-like functionality on data frames, check out the plyr package. And if find yourself wanting to pull your hair out, I recommend checking out Data Manipulation with R.

Realated tags