PHP cURL POST - Web Scraper

Soldato
Joined
26 Nov 2003
Posts
4,656
Location
Brentwood
Creating a web scraper at work for a database we don't have direct access to but we can get the data in HTML.

Problem is its behind a login, I am assuming I can send premeditated POST data (username and password) into tricking the ASP script into letting me have a cookie.

Then, PHP would have to show the cookie, request a certain page and scrape the data into a MySQL table.

Never really looked into cURL and if I am honest, I am only a PHP novice. I've looked at Wez Furlongs guide, but I think I need something more basic ;)

Any pointers? (Sorry if this is really simple, its late on a friday and I am on a coffee high.)
 
I guess I don't have to use PHP, but its the language I know.
I'll have a look at that python library when I get on a desktop.
 
The place I used to work for took scraping to a high art, and i've done a lot of different use cases around logging in.

I'd suggest you get a copy of HTTP Analyzer, do some simulated logins and work out the Methods, Headers and Variables sent. Most logins are easy to simulate, some sites have a network of redirects and such.
 
You can also download curl.exe or equivalent for your operating system and play around with it on the command line before you start integrating it into PHP or other language. This way you can see the exact parameters you need to get to the content.

As for the scraping, I've used XPath expressions in the past and parsed the DOM objects with PHP.
 
Thanks RobH, I am having a play with cURL in Terminal =)

Here is what I've stolen so far.

Code:
<?php
// INIT CURL
$ch = curl_init();

// SET URL FOR THE POST FORM LOGIN
curl_setopt($ch, CURLOPT_URL, 'http://quote.ashwyk.com/pricing/login.asp');

// ENABLE HTTP POST
curl_setopt ($ch, CURLOPT_POST, 1);

// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'Username=NotOn&Password=YourNelly');

// IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');

# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
# not to print out the results of its query.
# Instead, it will return the results as a string return value
# from curl_exec() instead of the usual true/false.
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://quote.ashwyk.com/pricing/admin/quotes_reporting.asp');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);

// CLOSE CURL
curl_close ($ch); 

echo $content; 
?>

Which I get back:

Code:
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="./">here</a>.</body>

This is regardless of the login and password are correct or not.

What I want it to do,

Login & get cookie, go to /admin/quotes_reporting.asp, give me html. :D

Any ideas? I think its not posting correctly.

Edit: thinking about it, is it not following a redirect?
 
Last edited:
Code:
<?php
// INIT CURL
$ch = curl_init();

// SET URL FOR THE POST FORM LOGIN
curl_setopt($ch, CURLOPT_URL, 'http://quote.ashwyk.com/pricing/login.asp');

// ENABLE HTTP POST
curl_setopt ($ch, CURLOPT_POST, 1);

// ENABLE REDIRECT FOLLOW
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'Username=NotOn&Password=YourNelly');

// IMITATE CLASSIC BROWSER'S BEHAVIOUR : HANDLE COOKIES
curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');

# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
# not to print out the results of its query.
# Instead, it will return the results as a string return value
# from curl_exec() instead of the usual true/false.
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);

// EXECUTE 1st REQUEST (FORM LOGIN)
$store = curl_exec ($ch);

// SET FILE TO DOWNLOAD
curl_setopt($ch, CURLOPT_URL, 'http://quote.ashwyk.com/pricing/admin/quotes_reporting.asp');

// EXECUTE 2nd REQUEST (FILE DOWNLOAD)
$content = curl_exec ($ch);

// CLOSE CURL
curl_close ($ch); 

echo $content; 
?>


Strange.
 
Last edited:
Blargh, the cookie seems to hold a plan text Username & password in too which isn't being saved by cURL. Would this be the issue?
 
Thats the same as chromes inspect element and firebug though isn't it?

Also, I am on OSX.

Not really, it's more like Wireshark optimised for HTTP/HTTPS. It will generate a timeline of every HTTP request and response, you can derive what gets sent when from that and reverse engineer a login. Borrow a Windows box to run it on, they have a 30 day free trial.
 
Back
Top Bottom