Provides information on ways to automate online tasks using webbots and spiders, covering such topics as parsing data from Web pages, managing cookies, sending and receiving email, and decoding encrypted files.
I wish to screen scrape several Ajax based websites and simulate clicks which refresh part of the webpage, and then read the updated HTML. Is there any Java library which can do this?
These books should help you (although only the first one is intended to Java developers):
Could anyone share good resources/tutorials about php curl?
you can read this book Michael Schrenk-Webbots, Spiders, and Screen Scrapers_ A Guide to Developing Internet Agents with PHP_CURL-No Starch Press (2012) view the review http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593273975/ref=sr_1_1?ie=UTF8&qid=1392107672&sr=8-1&keywords=curl+php
if you want to learn web scrapers i recomment you read that book
really this is the only book dedicated for web scrapers it's written for php developer but i think the basic that book teaches will help any developer to understand how web bots work
I also emailed the author about a some questions and he got back to me in few minutes really i highly recommend reading that book for any one wants to learn about web scrapping
I'm trying to put together a webbot/scraper using google script.
Here is a book teaching how to do it but it is in PHP.
Now, I know you can use PHP and link it into google spreadsheets. But i don't want to do that. I want everything to be in google script. Even running PHP in google script is ok. I just want to keep everything in the cloud.
Does anybody know the best way to approach this? Which libraries best to use? etc.
I need to do some screen scraping on a web page where the content I need is generated by AJAX. On the initial page there is a table with 4 tabs. When you click on any of the tabs the content of the table changes. I need the content from the 3rd tab only. I have used the google chrome 'Inspect Element' tool to see what the requests and post data was and I can get the information I need when I put the information (session id and a lot of other cookie data as well as post data) from the inspect element result into a PHP curl request. But this only works for the 30 minutes that the session lasts. Does anyone know of a way I can get to this information?
I wont reproduce the code here but I will point you to the answer. Its within this book:
A must buy for someone doing what your doing.