Webbots, Spiders, and Screen Scrapers

Michael Schrenk

Mentioned 7

Provides information on ways to automate online tasks using webbots and spiders, covering such topics as parsing data from Web pages, managing cookies, sending and receiving email, and decoding encrypted files.

More on Amazon.com

Mentioned in questions and answers.

What built-in PHP functions are useful for web scraping? What are some good resources (web or print) for getting up to speed on web scraping with PHP?

There is a Book "Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" on this topic - see a review here

PHP-Architect covered it in a well written article in the December 2007 Issue by Matthew Turland

I want to crawl for specific things. Specifically events that are taking place like concerts, movies, art gallery openings, etc, etc. Anything that one might spend time going to.

How do I implement a crawler?

I have heard of Grub (grub.org -> Wikia) and Heritix (http://crawler.archive.org/)

Are there others?

What opinions does everyone have?


There's a good book on the subject I can recommend called Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL.

Does anyone out there had a good tutorial or introduction to the various forms of botting? This includes botting for video games and web pages (like spiders, botting, and web scraping). I am not looking for how to do anything malicious, this is purely educational, so please do not link to anything that teaches anything harmful to anyone.

Not a tutorial, but I can recommend the book Webbots, Spiders, and Screen Scrapers.

I need to do a fairly extensive project involving web scraping and am considering using Hpricot or Beautiful Soup (i.e. Ruby or Python). Has anyone come across a tutorial that they thought was particularly good on this subject that would help me start the project off on the right foot?

Not a tool, really, but a good discussion is Michael Shrenk's book, Webbots, Spiders, and Screen Scrapers.

The book succeeds very well in its stated mission: explaining how to build simple web bots and operate them in accordance with community standards. It’s not everything you need to know, but it’s the best introduction I’ve seen. The focus is on simple, single-threaded, bots. There’s some small mention of using multiple bots that store data in a central repository, but there’s no discussion of the issues involved in writing multi-threaded or distributed bots that can process hundreds of pages per second.

I recommend that you read this book if you’re at all interested in writing Web bots, even if you’re not familiar with or intending to use PHP. But be sure not to expect more than the book offers.

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine. I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications. Any help or directions greatly appreciated.

You may want to check the following books:

"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204

"HTTP Programming Recipes for C# Bots" http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677

"HTTP Programming Recipes for Java Bots" http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

Possible Duplicate:
How to parse and process HTML with PHP?

How do I go about pulling specific content from a given live online HTML page?

For example: http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967

I want to retrieve the text description, the path to the main image and the price only. So basically, I want to retrieve content which is inside specific divs with maybe specific IDs or classes inside a html page.

Psuedo code

$page = load_html_contents('http://www.gumtr..');
$price = getPrice($page);
$description = getDescription($page);
$title = getTitle($page);

Please note I do not intend to steal any content from gumtree, or anywhere else for that matter, I am just providing an example.

First of all, what u wanna do, is called WEBSCRAPING. Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc. Search after webscraping.

HERE is a basic tutorial

THIS book should be useful too.

Hello StackOverlow fellows,

I am trying to build a PHP based forum for CAs. I want the forum to be exclusive for CAs only. Now, at the sign up page their membership number would be asked. This membership number needs to be validated from another URL. Now, the problem is that the other site doesn't provide any api for this and one has to manually put the membership number and submit to get the membership details.

The URL of site where one can check status is:

Sample membership number: 406691

The site uses post data so no argument can be passed via URL.

Is there anyway this can be automated? Or i need to manually approve all registrations?

You can create a script that scrapes the content of that link. The problem is that you have to maintain that script everytime that the website gets updated.

As the form doesn't have any captcha or a mechanism to prevent automated queries you can setup something easy.

You can make the post request using CURL:

//set POST variables
$url = '';
$fields = array(
    'mrn' => "406691",

//url-ify the data for the POST
foreach($fields as $key=>$value) { $fields_string .= $key.'='.$value.'&'; }
rtrim($fields_string, '&');

//open connection
$ch = curl_init();

//set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL, $url);
curl_setopt($ch,CURLOPT_POST, count($fields));
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);

//execute post
$result = curl_exec($ch);

//close connection

Take a look to the following links: