Software for Data Analysis

John Chambers

Mentioned 2

John Chambers turns his attention to R, the enormously successful open-source system based on the S language. His book guides the reader through programming with R, beginning with simple interactive use and progressing by gradual stages, starting with simple functions. More advanced programming techniques can be added as needed, allowing users to grow into software contributors, benefiting their careers and the community. R packages provide a powerful mechanism for contributions to be organized and communicated. This is the only advanced programming book on R, written by the author of the S language from which R evolved.

More on Amazon.com

Mentioned in questions and answers.

This is related to Looping over a Date object result in a numeric iterator

> dates <- as.Date(c("2013-01-01", "2013-01-02"))
> class(dates)
[1] "Date"
> for(d in dates) print(class(d))
[1] "numeric"
[1] "numeric"

I have two questions:

  1. What is the preferred way to iterate over a list of Date objects?
  2. I don't understand Joshua's answer (accepted answer from the question linked above), I'll quote it here: "So your Date vector is being coerced to numeric because Date objects aren't strictly vectors". So how is it determined that Date should be coerced to numeric?

There are two issues here. One is whether the input gets coerced from Date to numeric. The other is whether the output gets coerced to numeric.

Input

For loops coerce Date inputs to numeric, because as @DWin and @JoshuaUlrich point out, for loops take vectors, and Dates are technically not vectors.

> for(d in dates) print(class(d))
[1] "numeric"
[1] "numeric"

On the other hand, lapply and its simplifier offspring sapply have no such restrictions.

> sapply( dates, function(day) class(day) )
[1] "Date" "Date"

Output

However! The output of class() above is a character. If you try actually returning a date object, sapply is not what you want.

lapply does not coerce to a vector, but sapply does:

> lapply( dates, identity )
[[1]]
[1] "2013-01-01"

[[2]]
[1] "2013-01-02"

> sapply( dates, identity )
[1] 15706 15707

That's because sapply's simplification function coerces output to a vector.

Summary

So: If you have a Date object and want to return a non-Date object, you can use lapply or sapply. If you have a non-Date object, and want to return a Date object, you can use a for loop or lapply. If you have a Date object and want to return a Date object, use lapply.

Resources for learning more

If you want to dig deeper into vectors, you can start with John Cook's notes, continue with the R Inferno, and continue with SDA.

Context of application
I have a model with random slopes and intercepts. There are numerous levels of the random effects. The new data (to be predicted) may or may not have all of these levels.

To make this more concrete, I am working with music revenue at the album level (title). Each album may come in multiple types format2 (CD, vinyl, e-audio, etc). I have measurements for revenue for each album at each type of album. The model is specified as:

lmer(physical~ format2+ (0+format2|title))

The problem is that future data may not have all the levels of either title or format2. For random intercepts, this is easily resolved with predict(..., allow.new.levels= TRUE). But it is problematic for the fixed effects and random slopes. I am therefore trying to write a function to do flexible predictions of merMod objects, similar to lme4::predict.merMod; but that will handle the differences between the training data and the prediction data. This is a question asked as much out of ignorance to the exact details of lme4::predict.merMod as anything else.

Description of problem
The crux of the problem is getting the correct model.matrix() with fixed and random effects to calculate both predictions and SE's. The S3 method for class merMod returns only the fixed effects.

The base stats::model.matrix() function has very limited documentation. Unfortunately, I do not own either Statistical Models in S or Software for Data Analysis, which appear to have the details behind these functions.

model.matrix() is supposed to take a model formula and new data frame and produce a design matrix. But I'm getting an error. Any help you can provide would be much appreciated.

Example Data

dat1 <- structure(list(dt_scale = c(16, 16, 16, 16, 16, 16, 16, 16, 16, 
16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16), title = c("Bahia", 
"Jazz Moods: Brazilian Romance", "Quintessence", "Amadeus: The Complete Soundtrack Recording (Bicentennial Edition)", 
"Live In Europe", "We'll Play The Blues For You", "The Complete Village Vanguard Recordings, 1961", 
"The Isaac Hayes Movement", "Jazz Moods: Jazz At Week's End", 
"Blue In Green: The Concert In Canada", "The English Patient - Original Motion Picture Soundtrack", 
"The Unique Thelonious Monk", "Since We Met", "You're Gonna Hear From Me", 
"The Colors Of Latin Jazz: Cubop!", "The Colors Of Latin Jazz: Samba!", 
"Homecoming", "Consecration: The Final Recordings Part 2 - Live At Keystone Korner, September 1980", "More Creedence Gold", "The Stardust Session"), format2 = c("CD", "CD", 
"CD", "CD", "CD", "CD", "CD", "SuperAudio", "SuperAudio", "CD", "E Audio", "CD", 
"Vinyl", "CD", "E Audio", "CD", "CD", "CD", "CD", "CD"), mf_day = c(TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), xmas = c(FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE), vday = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
FALSE, FALSE, FALSE, FALSE), yr_since_rel = c(16.9050969937038, 
8.41815617876864, 9.2991404674865, 25.0870296783559, 39.1267038232812, 
27.9156764326061, 9.11596751812513, 23.3052837112449, 14.3123922258974, 
30.5208152866414, 5.83025071417496, 21.3090003877291, 7.75022155568392, 
11.3601605287827, 0.849006673421519, 31.9918631305662, 13.8861905547041, 
12.8342695062012, 29.6916671402534, 13.5912612705038), physical = c(1327.17849171096, 
-110.2265302258, -795.37376268564, 355.06192702004, -1357.3492884345, 
-1254.93442612023, -816.713683621225, 881.201935773452, -3092.02845691036, 
-2268.6304275652, 907.347941142021, -699.130275178185, 377.867849132077, 
-1047.50531157311, 1460.25978951805, 1376.84579069304, 3619.03629114089, 
962.888173535704, 2514.77880599199, 2539.14958588771)), .Names = c("dt_scale", 
"title", "format2", "mf_day", "xmas", "vday", "yr_since_rel", 
"physical"), row.names = c(1L, 2L, 5L, 6L, 7L, 8L, 9L, 11L, 12L, 
13L, 14L, 15L, 20L, 22L, 23L, 25L, 27L, 32L, 35L, 36L), class = "data.frame")

formula:

f1 <- as.formula(~1 + dt_scale + yr_since_rel + format2 + (0 + format2 + mf_day + 
xmas + vday | title))

execution / error

library(lme4)
model.matrix(f1, data= dat1)
Error in 0 + format2 : non-numeric argument to binary operator

Note I've also tried this with the Orthodont data; but, I get a different error.

library(lme4)
data("Orthodont",package="MEMSS")
fm1 <- lmer(formula = distance ~ age*Sex + (1+age|Subject), data = Orthodont)
newdat <- expand.grid(
  age=c(8,10,12,14)
  , Sex=c("Male","Female")
  , distance = 0
  , Subject= c("F01", "F02")
)


f1 <- formula(fm1)[-2] # simpler code via Ben Bolker below
mm <- model.matrix(f1, newdat) # attempt to use model.matrix
Warning message
In Ops.factor(1 + age, Subject) : | not meaningful for factors

# use lme4:::mkNewReTrms as suggested in comments
mm <- lme4:::mkNewReTrms(f1, newdat) 
Error in lme4:::mkNewReTrms(f1, newdat) : object 'ReTrms' not found
In addition: Warning message:
In Ops.factor(1 + age, Subject) : | not meaningful for factors

# check if different syntax would fix this
mm <- lme4::mkNewReTrms(f1, newdat)
Error: 'mkNewReTrms' is not an exported object from 'namespace:lme4'
mm <- mkNewReTrms(f1, newdat)
Error: could not find function "mkNewReTrms"

Editted 8/12/15: see changes on Github and GitHub Repo

Editted, 10/15/2014: This answer isn't yet perfect. There are still a couple of use-cases with errors (see comment chain below). But it works in most cases. I'll get around to finalizing it at some point.

I believe this function will solve the more important problem, accurate predictions for merMod objects. Dr Bolker, there are still some issues here (such as sparsity and efficiency); but I believe the method works:

data("Orthodont",package="MEMSS")
fm1 <- lmer(formula = distance ~ age*Sex + (1+age|Subject), data = Orthodont)
newdat <- expand.grid(
  age=c(8,10,12,14)
  , Sex=c("Male","Female")
  , distance = 0
  , Subject= c("F01", "F02")
)

predict.merMod2 <- function(object, newdat=NULL) {
# 01. get formula and build model matrix
  # current problem--model matrix is not sparse, as would be ideal
  f1 <- formula(object)[-2]
  z.fe <- model.matrix(terms(object), newdat)
  z.re <- t(lme4:::mkReTrms(findbars(f1), newdat)$Zt)
  mm <- cbind(z.fe, 
              matrix(z.re, nrow= dim(z.re)[1], ncol= dim(z.re)[2],
                     dimnames= dimnames(z.re)))

  # 02. extract random effect coefficients needed for the new data
  # (a) - determine number of coef
  len <- length(ranef(object)) 
  re.grp.len <- vector(mode= "integer", length= len) 
  for (i in 1:len) { # for each random group
    re.grp.len[i] <- dim(ranef(object)[[i]])[2] # number of columns (slope and intercept terms)
  }

  # (b) - create beta vector
  fe.names <- unique(colnames(mm)[1:length(fixef(object)) - 1])
  re.names <- unique(colnames(mm)[-c(1:length(fixef(object)) - 1)]) 
  beta.re <- as.vector(rep(NA, length= sum(re.grp.len) * length(re.names)), mode= "numeric")
  for (i in 1:len) {
    re.beta  <- ranef(object)[[i]][rownames(ranef(object)[[i]]) %in% re.names,] 
    ind.i <- sum(!is.na(beta.re)) + 1; ind.j <- length(as.vector(t(re.beta))) 
    beta.re[ind.i:ind.j]  <- as.vector(t(re.beta)) 
  }
  beta <- c(fixef(object)[names(fixef(object)) %in% fe.names], beta.re)
  # 03. execute prediction
  return(mm %*% beta)
}

predict.merMod2(fm1, newdat)
Realated tags

datelistlme4rs