Data mining

Ian H. Witten, Eibe Frank

Mentioned 1

More on Amazon.com

Mentioned in questions and answers.

A developer I am working with is developing a program that analyzes images of pavement to find cracks in the pavement. For every crack his program finds, it produces an entry in a file that tells me which pixels make up that particular crack. There are two problems with his software though:

1) It produces several false positives

2) If he finds a crack, he only finds small sections of it and denotes those sections as being separate cracks.

My job is to write software that will read this data, analyze it, and tell the difference between false-positives and actual cracks. I also need to determine how to group together all the small sections of a crack as one.

I have tried various ways of filtering the data to eliminate false-positives, and have been using neural networks to a limited degree of success to group cracks together. I understand there will be error, but as of now, there is just too much error. Does anyone have any insight for a non-AI expert as to the best way to accomplish my task or learn more about it? What kinds of books should I read, or what kind of classes should I take?

EDIT My question is more about how to notice patterns in my coworker's data and identify those patterns as actual cracks. It's the higher-level logic that I'm concerned with, not so much the low-level logic.

EDIT In all actuality, it would take AT LEAST 20 sample images to give an accurate representation of the data I'm working with. It varies a lot. But I do have a sample here, here, and here. These images have already been processed by my coworker's process. The red, blue, and green data is what I have to classify (red stands for dark crack, blue stands for light crack, and green stands for a wide/sealed crack).

You should read about data mining, specially pattern mining.

Data mining is the process of extracting patterns from data. As more data are gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

A good book on the subject is Data Mining: Practical Machine Learning Tools and Techniques

Data Mining can be bought in Amazon.

Basically what you have to do is apply statistical tools and methodologies to your datasets. The most used comparison methodologies are Student's t-test and the Chi squared test, to see if two unrelated variables are related with some confidence.

What’s the best approach to recognize patterns in data, and what’s the best way to learn more on the topic?

The best approach is to study pattern recognition and machine learning. I would start with Duda's Pattern Classification and use Bishop's Pattern Recognition and Machine Learning as reference. It would take a good while for the material to sink in, but getting basic sense of pattern recognition and major approaches of classification problem should give you the direction. I can sit here and make some assumptions about your data, but honestly you probably have the best idea about the data set since you've been dealing with it more than anyone. Some of the useful technique for instance could be support vector machine and boosting.

Edit: An interesting application of boosting is real-time face detection. See Viola/Jones's Rapid Object Detection using a Boosted Cascade of Simple Features (pdf). Also, looking at the sample images, I'd say you should try improving the edge detection a bit. Maybe smoothing the image with Gaussian and running more aggressive edge detection can increase detection of smaller cracks.