Hadley Wickham

Mentioned 5

Provides both rich theory and powerful applications Figures are accompanied by code required to produce them Full color figures

More on

Mentioned in questions and answers.

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level.

I feel, however, that I don't know enough about statistics. Statistics are increasingly useful in computing, with statistical natural language processing helping fuel some of Google's algorithms for search and machine translation, with performance analysis of hardware, software, and networks needing proper statistical grounding to be at all believable, and with fields like bioinformatics becoming more prevalent every day.

I've read about how "Google uses Bayesian filtering the way Microsoft uses the if statement", and I know the power of even fairly naïve, simple statistical approaches to problems from Paul Graham's A Plan for Spam and Better Bayesian Filtering, but I'd like to go beyond that.

I've tried to look into learning more statistics, but I've gotten a bit lost. The Wikipedia article has a long list of related topics, but I'm not sure which I should look into. I feel like from what I've seen, a lot of statistics makes the assumption that everything is a combination of factors that linearly combine, plus some random noise in a Gaussian distribution; I'm wondering what I should learn beyond linear regression, or if I should spend the time to really understand that before I move on to other techniques. I've found a few long lists of books to look at; where should I start?

So I'm wondering where to go from here; what to learn, and where to learn it. In particular, I'd like to know:

  1. What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?
  2. What kind of statistical methods should I spend my time learning?
  3. What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

To clarify what I am looking for, I am interested in what problems that programmers typically need to deal with can benefit from a statistical approach, and what kind of statistical tools can be useful. For instance:

  • Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?
  • More generally, artificial intelligence has been moving away from discrete, symbolic approaches and towards statistical techniques. What statistical AI approaches have the most to offer now, to the working programmer (as opposed to ongoing research that may or may not provide concrete results)?
  • Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?
  • Simulation of physical systems, such as in computer graphics, frequently involves a stochastic approach.
  • Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

I'm surprised no one has mentioned a keen understanding of graphics as essential to good statistical practice. Machine learning and Bayesian analysis are great (try Gelman's book if you want a formal but approachable and applied introduction to Bayes), but you can get amazingly far at understanding a problem with really good visualizations. Tufte's classic is a good place to start, and the classic semiology and grammar of graphics books are worth a read. Finally, take a look at the R ggplot2 package for a simple way to begin implementing complex graphical ideas.

Interesting question. As a statistician whose interest is more and more aligned with computer science perhaps I could provide a few thoughts...

  1. Don't learn frequentist hypothesis testing. While the bulk of my work is done in this paradigm, it doesn't match the needs of business or data mining. Scientists generally have specific hypotheses in mind, and might wish to gauge the probability that, given their hypothesis isn't true, the data would be as extreme as it is. This is rarely the type of answer a computer scientist wants.

  2. Bayesian is useful, even if you don't know why you are assuming the priors that you are using. A baysian analysis can give you a precise probability estimate for various contingencies, but it is important to realize that the only reason you have this precise estimate is because you made a fuzzy decision regarding the prior probability. (For those not in the know, with baysian inference, you can specify an arbitrary prior probability, and update this based on the data collected to get a better estimate).

Machine learning and classification might be a good place to get started. The machine learning literature is more focused on computer science problems, though it's mission is almost identical to that of statistics ( see: ).

Since you spoke of large databases with large numbers of variables, here are a few algorithms that come in handy in this domain.

  • adaboost: If you have a large number of crappy classifiers, and want to make one good classifier. (see also logit boost)
  • Support Vector Machines: A powerful and flexible classifier. Can learn non-linear patterns (okay linear in the non-linear kernel space if you want to be picky about it).
  • k-nearest neighbor: A simple but powerful algorithm. It does not scale well, but there are approximate nearest neighbor alternatives that are not quite so pathological.
  • CART: This algorithm partitions the data based on a number of predictor variables. It is particularly good if there are variable interactions, or there exists a very good predictor that only works on a subset of the data.
  • Least angle regression: if the value that you are trying to predict is continuous and you have a lot of data and a lot of predictors.

This is by no means complete, but should give you a good jumping off point. A very good and accessible book on the subject is Duda, Hart, Stork: Pattern Classification

Also, a big part of statistics is descriptive visualizations and analysis. These are of particular interest to the programmer because they allow him/her to convey information back to the user. In R, ggplot2 is my package of choice for creating visualizations. On the descriptive analysis side (and useful in text analysis) is multi-dimensional scaling, which can give a spacial interpretation of non-spacial data (for example the ideologies of senators

Boy, some of these answers are good. I came from much the same background and have had to get into biostatistics largely by books and by osmosis from colleagues. Here are my recommendations:

  • Start with a solid grounding in probability, including conditional probability, Bayes' theorem, Markov models, and some of the basic statistical distributions.

  • If you don't have it, get some linear algebra, so you don't get scared off by matrices. If you are faced with tricky algebra and calculus, knuckle down and work through it. It's worth it.

  • Statistics theory falls into two camps, frequentist and Bayesian. Frequentist is older and solid. Bayesian is newer, more flexible, and more exciting. In particular, there are the exciting things that can be done with Markov Chain Monte Carlo and related techniques.

In my area, pharmacometrics, there is high payoff in being able to extract meaningful results from sparse and expensive data, so an ability in statistics is very important.

Added: Here are some favorite books (not a complete list):

More probability than statistics, but Bayesian Probabilty can be very useful (it underpins spam filters) and IMO more software should use it to infer a user's habits.

Head First Statistics is an excellent book to learn statistics (a mathematician/statistician informs me that it has not so much a few errors but a few simplications of the theoretical stuff).

I almost forgot to mention: How to Lie with Statistics

Just as a point, not as a critic, but your question should be formulated in a different way: "what statistics should any person know?".

Fact is, unfortunately we all deal with statistics. It's a fact of life. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding or being scammed, whatever that means.

Said that, I think the following points are important to understand

  • mean, median, standard deviation of a sample, and the difference between sample and population (this is very important)
  • the distributions, and why the gaussian distribution is so important (the central limit theorem)
  • What it is meant with Null Hypothesis testing.
  • What is variable transformation, correlation, regression, multivariate analysis.
  • What is bayesian statistics.
  • Plotting methods.

All these points are critical not only to you as a computer scientist, but also as a human being. I will give you some examples.

  • The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a confidence interval, not a definitive answer, like : there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine.

  • Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports a possible signal, it does not report a finding.

  • Bayesian. We all know the spam filter. but there's more. Suppose you go to a medical checkup and the result tells you have cancer (I seriously hope not, but it's to illustrate a point). Fact is: most of the people at this point would think "I have cancer". That's not true. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 8 per thousands people have cancer, picked out of thin air number) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm that you have cancer, then it's very probable you actually have it, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.

  • Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, or a pie chart.

Now, let's go to your questions. I think I overindulged in just a quick note, but since my answer was voted up quite a lot, I feel it's better if I answer properly to your questions as much as my knowledge allows (and here is vacation, so I can indulge as much as I want over it)

What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?

Normally, everything that has to do with data comparison which involves numerical (or reduced to numerical) input from unreliable sources. A signal from an instrument, a bunch of pages and the number of words they contain. When you get these data, and have to find a distilled answer out of the bunch, then you need statistics. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus to refer to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive screen) will send you a bunch of data about the finger, plus a bunch of data about random noise (air? don't know how it works). The driver must make sense out of this mess and give you a x,y coordinate on the screen. That needs (a lot of) statistics.

What kind of statistical methods should I spend my time learning?

The ones I told you are more than enough, also because to understand them, you have to walk through other stuff.

What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

I learned statistics mostly from standard university courses. My first book was the "train wreck book", and it's very good. I also tried this one, which focuses on R but it did not satisfy me particularly. You have to know things and R to get through it.

Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?

That depends on the question you need to answer using your dataset.

Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?

There are a lot of issues with measuring. Measuring is a fine and delicate art. Proper measuring is almost beyond human. The fact is that sampling introduces bias, either from the sampler, or from the method, or from the nature of the sample, or from the nature of nature. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution.

The examples from the blog you posted are relevant. Say you have a startup time for a database. If you take performance measures within that time, all your measures will be biased. There's no statistical method that can tell you this. Only your knowledge of the system can.

Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

Every time you have an ensemble of data producers, you have statistics, so scientific computing and data analysis is obviously one place. Folksonomy and social networking is pretty much all statistics. Even stackoverflow is, in some sense, statistical. The fact that an answer is highly voted does not mean that it's the right one. It means that there's a high probability that is right, according to the evaluation of a statistical ensemble of independent evaluators. How these evaluators behave make the difference between stackoverflow, reddit and digg.

What a great thread. There's plenty of good information in the question itself and in the answers, but I am really surprised nobody has mentioned the book Programming Collective Intelligence yet.

It's the best book I know if you are a novice in this subject (like me) and want to put machine learning and statistics theory into practice.

This book explains:

  • Collaborative filtering techniques that enable online retailers to recommend products or media
  • Methods of clustering to detect groups of similar items in a large dataset
  • Search engine features--crawlers, indexers, query engines, and the PageRank algorithm
  • Optimization algorithms that search millions of possible solutions to a problem and choose the best one
  • Bayesian filtering, used in spam filters for classifying documents based on word types and other features

  • Using decision trees not only to make predictions, but to model the way decisions are made

  • Predicting numerical values rather than classifications to build price models
  • Support vector machines to match people in online dating sites
  • Non-negative matrix factorization to find the independent features in adataset
  • Evolving intelligence for problem solving--how a computer develops its skill by improving its own code the more it plays a game

Apart from that, there's a great talk on TED on why everybody should learn Statistics.

I'm wondering how I can manipulate the size of strip text in facetted plots. My question is similar to a question on plot titles, but I'm specifically concerned with manipulating not the plot title but the text that appears in facet titles (strip_h).

As an example, consider the mpg dataset.

    qplot(hwy, cty, data = mpg) + facet_grid( . ~ manufacturer)

The resulting output produces some facet titles that don't fit in the strip.

I'm thinking there must be a way to use grid to deal with the strip text. But I'm still a novice and wasn't sure from the grid appendix in Hadley's book how, precisely, to do it. Also, I was afraid if I did it wrong it would break my washing machine, since I believe all technology is connected through The Force :-(

Many thanks in advance.

You can modify strip.text.x (or strip.text.y) using theme_text(), for instance

qplot(hwy, cty, data = mpg) + 
      facet_grid(. ~ manufacturer) + 
      opts(strip.text.x = theme_text(size = 8, colour = "red", angle = 90))

Update: for ggplot2 version > 0.9.1

qplot(hwy, cty, data = mpg) + 
      facet_grid(. ~ manufacturer) + 
      theme(strip.text.x = element_text(size = 8, colour = "red", angle = 90))

This is just a extension for a old question ggplot2 polar plot arrows Compass plot from MatLab

You will find the x axis is out of the most_out circle. In ggplot2, I use "panel.grid.major = theme_line(colour = "black", size = 0.2, linetype=2)" to get the dashed circle, just as below: using GGplot2 So my question is how to make the axis label (180, 135, 90, .....) outside of the circle, because the text are merge with the circular lines.

I try to use "hjust" or "vjust" to adjust the distance between text and axis. But it does not work. So do you have some ideas about this problem? Thanks first!!!!

You have not provided code to reproduce the problem so this will be just a guess.

I've used whitespace, \n in particular, to move text "away" in the past. Perhaps a custom formatter might work here. Here is how you can write a custom tick mark label formatter.

If this fails, you can always hide the axis labels and paint them yourself using geom_text by adding another layer.

Hope this helps. @hadley's book on ggplot2 is very good, by the way.

a perhaps simple question I tried to make an errorgraph like the one shown in page 532 of Field's "Discovering Statistics Using R".

The code can be found here :

line <- ggplot(gogglesData, aes(alcohol, attractiveness, colour = gender))
line + stat_summary(fun.y = mean, geom = "point") + 
stat_summary(fun.y = mean, geom = "line", aes(group= gender)) + 
stat_summary( = mean_cl_boot, geom = "errorbar", width = 0.2) + 
labs(x = "Alcohol Consumption", y = "Mean Attractiveness of Date (%)", colour = "Gender")  

I produced the same graph; my y-axis variable has only 4-points (it is a discrete scale, 1-4), now the y-axis has the points 1.5, 2, 2.5 in which the lines vary.

And the question is: what do these points and graphs describe? I assume that the important part is stat_summary( = mean_cl_boot, geom = "errorbar", width = 0.2) are they count of observations for that group and that level(x-axis)? Are they frequencies? Or, are they proportions?

I found this but it did not help me

Thank you

Here is what the ggplot2 book on page 83 says about mean_cl_boot()

Function          Hmisc original        Middle Range
mean_cl_boot() Mean Standard error from bootstrap

I think that it is the from Hmisc package but renamed as in ggplot2.

and here is the definition of original function from Hmisc package : is a very fast implementation of the basic nonparametric bootstrap for obtaining confidence limits for the population mean without assuming normality

I am trying to replicate figure 6.11 from Hadley Wickham's ggplot2 book, which plots R colors in Luv space; the colors of points represent themselves, and no legend is necessary. enter image description here

Here are two attempts:

myColors <- data.frame("L"=runif(10000, 0,100),"a"=runif(10000, -100, 100),"b"=runif(10000, -100, 100))
myColors <- within(myColors, Luv <- hex(LUV(L, a, b)))
myColors <- na.omit(myColors)
g <- ggplot(myColors, aes(a, b, color=Luv), size=2)
g + geom_point() + ggtitle ("mycolors")

enter image description here

Second attempt:

other <- data.frame("L"=runif(10000),"a"=runif(10000),"b"=runif(10000))
other <- within(other, Luv <- hex(LUV(L, a, b)))
other <- na.omit(other)
g <- ggplot(other, aes(a, b, color=Luv), size=2)
g + geom_point() + ggtitle("other")

enter image description here

There are a couple of obvious problems:

  1. These graphs don't look anything like the figure. Any suggestions on the code needed?
  2. The first attempt generates a lot of NA fields in the Luv column (only ~3100 named colors out of 10,000 runs, versus ~9950 in the second run). If L is supposed to be between 0-100 and u and v between -100 and 100, why do I have so many NAs in the first run? I have tried rounding, it doesn't help.
  3. Why do I have a legend?

Many thanks.

You're getting strange colors because aes(color = Luv) says "assign a color to each unique string in column Luv". If you assign color outside of aes, as below, it means "use these explicit colors". I think something like this should be close to the figure you presented.

x <- sRGB(t(col2rgb(colors())))
storage.mode(x@coords) <- "numeric" # as(..., "LUV") doesn't like integers for some reason
y <- as(x, "LUV")
DF <-
DF$col <- colors()
ggplot(DF, aes( x = U, y = V)) + geom_point(colour = DF$col)