Webbots, Spiders, and Screen Scrapers, 2nd Edition

Michael Schrenk

Mentioned 5

Provides information on ways to automate online tasks using webbots and spiders, covering such topics as parsing data from Web pages, managing cookies, sending and receiving email, and decoding encrypted files.

Mentioned in questions and answers.

I wish to screen scrape several Ajax based websites and simulate clicks which refresh part of the webpage, and then read the updated HTML. Is there any Java library which can do this?

These books should help you (although only the first one is intended to Java developers):

Could anyone share good resources/tutorials about php curl?

you can read this book Michael Schrenk-Webbots, Spiders, and Screen Scrapers_ A Guide to Developing Internet Agents with PHP_CURL-No Starch Press (2012) view the review http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593273975/ref=sr_1_1?ie=UTF8&qid=1392107672&sr=8-1&keywords=curl+php

I am trying to scrap data from asp, jsp sites which uses ajax and session and POST variables for data navigation and display. I have gone through various articles and SO for data scrapping but not helped much I parsed some sites with modifying headers but most of sites shows header redirection for my custom header. What is the proper way of parsing data from site which are javascript enabled and sites in asp having viewstate variable.

if you want to learn web scrapers i recomment you read that book

Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL

really this is the only book dedicated for web scrapers it's written for php developer but i think the basic that book teaches will help any developer to understand how web bots work

I also emailed the author about a some questions and he got back to me in few minutes really i highly recommend reading that book for any one wants to learn about web scrapping

I'm trying to put together a webbot/scraper using google script.

Here is a book teaching how to do it but it is in PHP.


Now, I know you can use PHP and link it into google spreadsheets. But i don't want to do that. I want everything to be in google script. Even running PHP in google script is ok. I just want to keep everything in the cloud.

Does anybody know the best way to approach this? Which libraries best to use? etc.


You should look at the UrlFetch Services provided in Apps Script. There are numerous questions already here about using this service, so you should be able to find many relevant examples.

I need to do some screen scraping on a web page where the content I need is generated by AJAX. On the initial page there is a table with 4 tabs. When you click on any of the tabs the content of the table changes. I need the content from the 3rd tab only. I have used the google chrome 'Inspect Element' tool to see what the requests and post data was and I can get the information I need when I put the information (session id and a lot of other cookie data as well as post data) from the inspect element result into a PHP curl request. But this only works for the 30 minutes that the session lasts. Does anyone know of a way I can get to this information?

I wont reproduce the code here but I will point you to the answer. Its within this book:


A must buy for someone doing what your doing.