Code in the Cloud

Mark C. Chu-Carroll

Mentioned 1

Provides information on building scalable Web applications using Google App Engine.

More on Amazon.com

Mentioned in questions and answers.

So I'm looking for a solution to parse a detail page like say, http://www.amazon.com/Code-Cloud-Pragmatic-Programmers-Chu-Carroll/dp/1934356638/ref=sr_1_1?ie=UTF8&qid=1359231803&sr=8-1&keywords=code+in+the+cloud but I'm unable to get the proper contents out of the page.

I've inspected the elements and found and id by the name of "btAsinTitle" that supposedly should grab the title of from that Amazon.com Product Details page but apparently, nothing comes up in PHP. In addition to that, I have also found that its not loaded via an external resource, say an JavaScript pulling in from an external resource on the Amazon.com side (though, I'm not entirely 100% sure). What I did was look at the documents that was loaded and it appears that a document has loaded on the exact URL that I had provided above in which does contain the proper "btAsinTitle" id that I am looking for.

This is really the first step of my little assignment to parse for details. There is also a few other criteria that I also need, inclusive of the author, price, availability (whether the product may be in stock). Below is the snippet that I am trying to run at the moment.

Also, just an additional curiosity to this question, what are some techniques that could be used to prevent scraping and is it possible that Amazon is preventing their product pages to be scraped? Other than that, I also know that I could use the API, but I'm trying to adhere to the assignments rules without using the API and also registering for an API key for an assignment. Thanks in advance!

class AmazonBook {
protected $doc;

public $url;
public $title;
public $author;
public $price;
public $availability;

public function __construct($url) {
    $this->url = $url;

    $this->set_dom();
    // $this->set_availability();
    // $this->set_price();
    // $this->set_author();
    $this->set_title();
}


// Sets the title
protected function set_title() {
    var_dump($this->doc->getElementById('btAsinTitle'));
    die();

    // foreach ($this->doc->getElementsByTagName('span') as $span) {
    //  var_dump($span->nodeValue);
    // }
    // die();
}

// Sets the DOM
protected function set_dom() {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $this->url);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1');

    $this->doc = new DOMDocument();
    @$this->doc->loadHTML(curl_exec($ch));
}
}

// Test code
$url = 'http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=code%20in%20the%20cloud';
$code_in_cloud = new AmazonBook($url);