Applied Econometrics with R

Christian Kleiber, Achim Zeileis

Mentioned 3

R is a language and environment for data analysis and graphics. It may be considered an implementation of S, an award-winning language initially - veloped at Bell Laboratories since the late 1970s. The R project was initiated by Robert Gentleman and Ross Ihaka at the University of Auckland, New Zealand, in the early 1990s, and has been developed by an international team since mid-1997. Historically, econometricians have favored other computing environments, some of which have fallen by the wayside, and also a variety of packages with canned routines. We believe that R has great potential in econometrics, both for research and for teaching. There are at least three reasons for this: (1) R is mostly platform independent and runs on Microsoft Windows, the Mac family of operating systems, and various ?avors of Unix/Linux, and also on some more exotic platforms. (2) R is free software that can be downloaded and installed at no cost from a family of mirror sites around the globe, the Comprehensive R Archive Network (CRAN); hence students can easily install it on their own machines. (3) R is open-source software, so that the full source code is available and can be inspected to understand what it really does, learn from it, and modify and extend it. We also like to think that platform independence and the open-source philosophy make R an ideal environment for reproducible econometric research.

More on Amazon.com

Mentioned in questions and answers.

I want do fit some sort of multi-variate time series model using R.

Here is a sample of my data:

   u     cci     bci     cpi     gdp    dum1 dum2 dum3    dx  
 16.50   14.00   53.00   45.70   80.63  0   0    1     6.39 
 17.45   16.00   64.00   46.30   80.90  0   0    0     6.00 
 18.40   12.00   51.00   47.30   82.40  1   0    0     6.57 
 19.35   7.00    42.00   48.40   83.38  0   1    0     5.84 
 20.30   9.00    34.00   49.50   84.38  0   0    1     6.36 
 20.72   10.00   42.00   50.60   85.17  0   0    0     5.78 
 21.14   6.00    45.00   51.90   85.60  1   0    0     5.16 
 21.56   9.00    38.00   52.60   86.14  0   1    0     5.62 
 21.98   2.00    32.00   53.50   86.23  0   0    1     4.94 
 22.78   8.00    29.00   53.80   86.24  0   0    0     6.25 

The data is quarterly, the dummy variables are for seasonality.

What I would like to do is to predict dx with reference to some of the others, while (possibly) allowing for seasonality. For argument's sake, lets say I want to use "u", "cci" and "gdp".

How would I go about doing this?

If you haven't done so already, have a look at the time series view on CRAN, especially the section on multivariate time series.

In finance, one traditional way of doing this is with a factor model, frequently with either a BARRA or Fama-French type model. Eric Zivot's "Modeling financial time series with S-PLUS" gives a good overview of these topics, but it isn't immediately transferable into R. Ruey Tsay's "Analysis of Financial Time Series" (available in the TSA package on CRAN) also has a nice discussion of factor models and principal component analysis in chapter 9.

R also has a number of packages that cover vector autoregression (VAR) models. In particular, I would recommend looking at Bernhard Pfaff's VAR Modelling (vars) package and the related vignette.

I strongly recommend looking at Ruey Tsay's homepage because it covers all these topics, and provides the necessary R code. In particular, look at the "Applied Multivariate Analysis", "Analysis of Financial Time Series", and "Multivariate Time Series Analysis" courses.

This is a very large subject and there are many good books that cover it, including both multivariate time series forcasting and seasonality. Here are a few more:

  1. Kleiber and Zeileis. "Applied Econometrics with R" doesn't address this specifically, but it covers the overall subject very well (see also the AER package on CRAN).
  2. Shumway and Stoffer. "Time Series Analysis and Its Applications: With R Examples" has examples of multivariate ARIMA models.
  3. Cryer. "Time Series Analysis: With Applications in R" is a classic on the subject, updated to include R code.

I found a site which explains exactly what I need to do for my data however it isn't in R. Can anyone suggest how I could create this in R?

http://people.duke.edu/~rnau/three.htm

I need to find the MSE, MAE, MAPE, ME, MPE, SSE to test the accuracy of the forecasts and this page is the closest i have found to explain how to do it.

data<-c(79160.56266,91759.73029,91186.47551,106353.8192,70346.46525,80279.15139,82611.60076,131392.7209,93798.99391,105944.7752,103913.1296,154530.6937,110157.4025,117416.0942,127423.4206,156751.9979,120097.8068,121307.7534,115021.1187,150657.8258,113711.5282,115353.1395,112701.9846,154319.1785,116803.545,118352.535)
forecasts<-c(118082.3,157303.8,117938.7,122329.8) # found using arima

(if you mark this question down can you explain specifically why please)

Here are a few examples to get you started, using the data set UKNonDurables from the package AER. This package accompanies the book Applied Econometrics with R, which is a pretty good introductory applied econometrics book, especially for people without a solid background in programming.

library(forecast)
library(AER) 
##
data("UKNonDurables")
## alias for convenience
Data <- UKNonDurables
## split data into testing and training
train <- window(
  Data,
  end=c(1975,4))
test <- window(
  Data,
  start=c(1976,1))
## fit a model on training data
aaFit <- auto.arima(
  train)
## forcast training model over
## the testing period
aaPred <- forecast(
  aaFit,
  h=length(test))
##
> plot(aaPred)

enter image description here

## extract point forecasts
yHat <- aaPred$mean
## a few functions:
## mean squared (prediction) error
MSE <- function(y,yhat)
{
  mean((y-yhat)**2)
}
## mean absolute (prediction) error
MAE <- function(y,yhat)
{
  mean(abs(y-yhat))
}
## mean absolute percentage (prediction) error
MAPE <- function(y,yhat,percent=TRUE)
{
  if(percent){
    100*mean(abs( (y-yhat)/y ))
  } else {
    mean(abs( (y-yhat)/y ))
  }
}
##
> MSE(test,yHat)
[1] 9646434
> MAE(test,yHat)
[1] 1948.803
> MAPE(test,yHat)
[1] 3.769978

So like I said, some or all of the above functions probably exist in base R or within external packages, but they are typically simple formulas that are trivial to implement. Try to work off these and / or adapt them to better suit your needs.

Edit: As Mr. Hyndman pointed out below, his package forecast includes the function accuracy, which provides a very convenient way of summarizing GOF measures of time series models. Using the same data from above, you can easily assess the fit of a forecast object over the training and testing periods:

> round(accuracy(aaPred,Data),3)
                   ME     RMSE      MAE   MPE  MAPE  MASE  ACF1 Theil's U
Training set    2.961  372.104  277.728 0.001 0.809 0.337 0.053        NA
Test set     1761.016 3105.871 1948.803 3.312 3.770 2.364 0.849     1.004

(where round(...,3) was used just so that the output would fit nicely in this post). Or, if you want to examine these measures for only the forecast period, you can call something like this:

> accuracy(yHat,test)
               ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
Test set 1761.016 3105.871 1948.803 3.312358 3.769978 0.8485389  1.004442

I learned to get a linear fit with some points using lm in my R script. So, I did that (which worked nice), and printed out the fit:

lm(formula = y2 ~ x2)

Residuals:
         1          2          3          4 
 5.000e+00 -1.000e+01  5.000e+00  7.327e-15 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   70.000     17.958   3.898  0.05996 . 
x2            85.000      3.873  21.947  0.00207 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 8.66 on 2 degrees of freedom
Multiple R-squared: 0.9959, Adjusted R-squared: 0.9938 
F-statistic: 481.7 on 1 and 2 DF,  p-value: 0.00207 

I'm trying to determine the best way to judge how great this fit is. I need to compare this fit with a few others (which are also linear using lm() function). What value from this summary would be the best way to judge how good this fit is? I was thinking to use the residual standard error. Any suggestions. Also, how do I extract that value from the fit variable?

There are some nice regression diagnostic plots you can look at with

plot(YourRegression, which=1:6)

where which=1:6 give you all six plots. The RESET test and bptest will test for misspecification and heteroskedasticity:

resettest(...)
bptest(...)

There are a lot of resources out there to think about this sort of thing. Fitting Distributions in R is one of them, and Faraway's "Practical Regression and Anova" is an R classic. I basically learned econometrics in R from Farnsworth's paper/book, although I don't recall if he has anything about goodness of fit.

If you are going to do a lot of econometrics in R, Applied Econometrics in R is a great pay-for book. And I've used the R for Economists webpage a lot.

Those are the first ones that pop to mind. I will mull a little more.