Data Mining

Ian H. Witten, Eibe Frank, Mark Andrew Hall

Mentioned 4

Provides information on the tools and techniques of data mining, covering such topics as data transformation, ensemble learning, and datasets, and presents instructions on the Weka machine learning software.

More on

Mentioned in questions and answers.

I'm looking for a way to learn to be comfortable with large data sets. I'm a university student, so everything I do is of "nice" size and complexity. Working on a research project with a professor this semester, and I've had to visualize relationships between a somewhat large (in my experience) data set. It was a 15 MB CSV file.

I wrote most of my data wrangling in Python, visualized using GNUPlot.

Are there any accessible books or websites on the subject out there? Bonus points for using Python, more bonus points for a more "basic" visualization system than relying on gnuplot. Cairo or something, I suppose.

Looking for something that takes me from data mining, to processing, to visualization.

EDIT: I'm more looking for something that will teach me the "big ideas". I can write the code myself, but looking for techniques people use to deal with large data sets. I mean, my 15 MB is small enough where I can put everything I would ever need into memory and just start crunching. What do people do to visualize 5 GB data sets?

I'd say the most basic skill is a good grounding in math and statistics. This can help you assess and pick from the variety of techniques for filtering data, and reducing its volume and dimensionality while keeping its integrity. The last thing you'd want to do is make something pretty that shows patterns or relationships which aren't really there.

Specialized math

To tackle some types of problems you'll need to learn some math to understand how particular algorithms work and what effect they'll have on your data. There are various algorithms for clustering data, dimensionality reduction, natural language processing, etc. You may never use many of these, depending on the type of data you wish to analyze, but there are abundant resources on the Internet (and Stack Exchange sites) should you need help.

For an introductory overview of data mining techniques, Witten's Data Mining is good. I have the 1st edition, and it explains concepts in plain language with a bit of math thrown in. I recommend it because it provides a good overview and it's not too expensive -- as you read more into the field you'll notice many of the books are quite expensive. The only drawback is a number of pages dedicated to using WEKA, an Java data mining package, which might not be too helpful as you're using Python (but is open source, so you may be able to glean some ideas from the source code. I also found Introduction to Machine Learning to provide a good overview, also reasonably priced, with a bit more math.


For creating visualizations of your own invention, on a single machine, I think the basics should get you started: Python, Numpy, Scipy, Matplotlib, and a good graphics library you have experience with, like PIL or Pycairo. With these you can crunch numbers, plot them on graphs, and pretty things up via custom drawing routines.

When you want to create moving, interactive visualizations, tools like the Java-based Processing library make this easy. There are even ways of writing Processing sketches in Python via Jython, in case you don't want to write Java.

There are many more tools out there, should you need them, like OpenCV (computer vision, machine learning), Orange (data mining, analysis, viz), and NLTK (natural language, text analysis).

Presentation principles and techniques

Books by folks in the field like Edward Tufte and references like Information Graphics can help you get a good overview of the ways of creating visualizations and presenting them effectively.

Resources to find Viz examples

Websites like Flowing Data, Infosthetics, Visual Complexity and Information is Beautiful show recent, interesting visualizations from across the web. You can also look through the many compiled lists of of visualization sites out there on the Internet. Start with these as a seed and start navigating around, I'm sure you'll find a lot of useful sites and inspiring examples.

(This was originally going to be a comment, but grew too long)

I'm looking to port our home-grown platform of various machine learning algorithms from C# to a more robust data mining platform such as R. While it's obvious R is great at many types of data mining tasks, it is not clear to me if it can be used for text classification.

Specifically, we extract a list of bigrams from the text and then classify it into one of 15 different categories, eg:

Bigram list: jewelry, books, watches, shoes, department store -> Category: Shopping

We'd want to both train the models in R as well as hook up to a database to perform this on a larger scale.

Can it be done in R?

Hmm, I am rather starting to look into Machine Learning, but I might have a suggestion: have you considered Weka? There's a bunch of various algorithms around and there'S IS some documentation. Plus, there is an R package RWeka that makes use of the Weka jars.

EDIT: There is also a nice, comprehensive read by Witten et al. : Data mining that contains an extensive description of Weka among other interesting things. Look into the API opportunities.

I have a situation where I have collected mouse movement points from a website. I have a series of (x, y)-points, and I need to detect different repeated patterns of mouse movement from this data. For example, mouse moving very slowly, mouse moving very fast toward a direction and then stopping for a while, mouse scrolling etc... I need to detect such patterns from my data.

Is there a way to do that with OpenCV ...or maybe some other Library?

P.S. Please keep in mind, that I am a beginner in this kind of stuff.

Thanks in advance!

Although OpenCV does have some good data analysis and machine learning algorithms, it is really a library geared toward computer vision (thus the CV name). It sounds like you have already done the data capture, and now you want to perform what is called data mining.

Data mining toolkits have many more tools and algorithms for this type of analysis than does OpenCV, so I would point you toward those. A good open-source toolkit to get started with is Weka Sourceforge and Weka Home. It is written in Java, so it will run on just about anything. Here is the manual for Weka 3.6.0. There is also a good book available to help get you started using Weka available here.

Since you are a beginner, do understand that the learning curve for data-mining can seem a bit steep at first, but just take it slowly :) Maybe as a first project, just try to cluster the different (x, y) positions, then use some of Weka's visualization tools to see where the users are placing the mouse on the screen.

Once you are comfortable enough to perform basic clustering, then come back with more questions. Also, Cross Validated (a stackexchange site dedicated to statistics and data-mining) is where you'll want to direct future questions on this subject.

Hope you find this information helpful!

How KEA algorithm for kea phrase extraction use WEKA to find keyphrases from given text documents. I have understood the basic logic i.e., it first cleans the input and then generate n gram and remove stop words and does stemming. Generate feature value. I wan't to know what does KEA do next? For what purpose does it use WEKA?

KEA is based on machine learned model using supervised learning techniques. Basically what this means is we first need to get training data. This training data has example documents and list of key phrases selected by humans. You use this training data to calculate features and then you feed this features to machine learning algorithm which builds model. The model essentially tells you what should be the output if input is given set of features. One of the best resource to go more deeper in machine learning works is book from creators of WEKA themselves.

KEA uses Naive Bayes algorithm to calculate output from given set of features. WEKA has nice efficient implementation of Naive Bayes. But I wouldn't say WEKA is special in that. Nearly any machine learning toolkit worth its salt would have good implementation of Naive Bayes and so you could have used any toolkit really. But that's what happens next. You give these features to Naive Bayes implementation in WEKA and get the output which is ranking scores for key phrases.

Please see here for how KEA works starting at page 5. Page 6 describes what you have already found, i.e., it does stemming and ngrams. Page 7 and 8 describes the features it calculates. Next pages describes training and ML algo.