Applied Predictive Modeling

Max Kuhn, Kjell Johnson

Mentioned 2

This text is intended for a broad audience as both an introduction to predictive models as well as a guide to applying them. Non-mathematical readers will appreciate the intuitive explanations of the techniques while an emphasis on problem-solving with real data across a wide variety of applications will aid practitioners who wish to extend their expertise. Readers should have knowledge of basic statistical ideas, such as correlation and linear regression analysis. While the text is biased against complex equations, a mathematical background is needed for advanced topics. Dr. Kuhn is a Director of Non-Clinical Statistics at Pfizer Global R&D in Groton Connecticut. He has been applying predictive models in the pharmaceutical and diagnostic industries for over 15 years and is the author of a number of R packages. Dr. Johnson has more than a decade of statistical consulting and predictive modeling experience in pharmaceutical research and development. He is a co-founder of Arbor Analytics, a firm specializing in predictive modeling and is a former Director of Statistics at Pfizer Global R&D. His scholarly work centers on the application and development of statistical methodology and learning algorithms. Applied Predictive Modeling covers the overall predictive modeling process, beginning with the crucial steps of data preprocessing, data splitting and foundations of model tuning. The text then provides intuitive explanations of numerous common and modern regression and classification techniques, always with an emphasis on illustrating and solving real data problems. Addressing practical concerns extends beyond model fitting to topics such as handling class imbalance, selecting predictors, and pinpointing causes of poor model performance—all of which are problems that occur frequently in practice. The text illustrates all parts of the modeling process through many hands-on, real-life examples. And every chapter contains extensive R code for each step of the process. The data sets and corresponding code are available in the book’s companion AppliedPredictiveModeling R package, which is freely available on the CRAN archive. This multi-purpose text can be used as an introduction to predictive models and the overall modeling process, a practitioner’s reference handbook, or as a text for advanced undergraduate or graduate level predictive modeling courses. To that end, each chapter contains problem sets to help solidify the covered concepts and uses data available in the book’s R package. Readers and students interested in implementing the methods should have some basic knowledge of R. And a handful of the more advanced topics require some mathematical knowledge.

More on

Mentioned in questions and answers.

I have a matrix as below:

Real_Values Predicted_Values
5.5         5.67
6.9         7.01
9.8         9.2
6.5         6.1
10          9.7
1.5         1.0
7.7         7.01

I wish to compute the error rate of my model between the predicted and real values and ideally do a plot. I was wondering if R already has a package that neatly does this, so that I will avoid any for loops?

You can calculate regression error metrics like the root mean squared error (RMSE) or sum of squared errors (SSE) by hand as pointed out by @nathan-day. Most implementations will automatically do this for you, so you usually don't need to do this by hand.

For the purpose of plotting I'll use a slightly bigger example now, with more samples, as it will be better to understand (the iris dataset shipped with R). First we train a linear model to predict the 4th feature from the first 3 features, which already computes some metrics:

> model <- train(iris[,1:3], iris[,4], method = 'lm', metric = 'RMSE', trControl = trainControl(method = 'repeatedcv', number = 10, repeats = 10))
> print(model)
Linear Regression 

150 samples
3 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 

Summary of sample sizes: 134, 135, 135, 136, 134, 135, ... 

Resampling results

RMSE  Rsquared  RMSE SD  Rsquared SD
0.19  0.942     0.0399   0.0253   

The RMSE, SSE, etc. could now be calculated from the predicted and actual values of the target variable by hand too:

predicted <- predict(model, iris[,1:3]) # perform the prediction 
actual <- iris[,4]
sqrt(mean((predicted-actual)**2)) # RMSE
sum((predicted-actual)**2) # SSE

The slight differences to the results from the model training above results from utilizing a repeated cross-validation (hence the metrics are listed under "resampling results" there).

For the plotting part: regression error can easily be visualized by plotting the predicted against the actual target variable, and/or by plotting the error against the actual value. The perfect fit is represented by the additional line in those plots. This too can easily be achieved with standard tools:



PS: if you are not familiar with regression/classification error measure and robust ML procedures I would strongly recommend spending some time to read up upon those topics - it will likely save you lots of lime later. I personally would recommend Applied Predictive Modeling by Max Kuhn (maintainer of the caret package in R) and Kjell Johnson, as it's easy to read and very practical.

I want to tune the parameter C in ksvm. Now I'm wondering how this C is defined. The definition of C is

cost of constraints violation (default: 1) this is the `C'-constant of the regularization term in the Lagrange formulation.

Does this mean that the larger C is, the more misclassifications are allowed?

The cost parameter penalizes large residuals. So a larger cost will result in a more flexible model with fewer misclassifications. In effect the cost parameter allows you to adjust the bias/variance trade-off. The greater the cost parameter, the more variance in the model and the less bias.

So the answer to your question is no. The greater the cost, the fewer misclassifications are allowed.

This is explained in Chapter 7 Section 3 of the book Applied Predictive Modeling by Kuhn.

Note how this is the opposite of regularization which penalizes large coefficients, resulting in higher bias and lower variance. Here we penalize the residuals resulting in higher variance and lower bias.