[PHP] Scraping webpages

Soldato
Joined
12 Jun 2005
Posts
5,361
Hi there,

Does anyone know of a decent PHP function to scrape the HTML from a webpage because file_get_contents doesn't always work.

Trying this at the moment but it says "Bad Request":

Code:
	function GetPageHTML ($URL)
	{

        $userAgent = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)';

        $curl = curl_init($URL);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );
        
        return $html;
	}

Thanks.
 
The code you posted in the opening post works fine for me:


Are you hosting this locally or on a webhost? I wonder if accessing external URLs has been blocked?
 
What is the output of:
Code:
print_r(curl_getinfo($curl));
(Place it before your curl_close() line).

Returns:

Code:
Array ( [url] => http://epguides.com/Scrubs/ [content_type] => text/html [http_code] => 400 [header_size] => 129 [request_size] => 140 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.685881 [namelookup_time] => 0.342556 [connect_time] => 0.502051 [pretransfer_time] => 0.502106 [size_upload] => 0 [size_download] => 20 [speed_download] => 29 [speed_upload] => 0 [download_content_length] => 20 [upload_content_length] => 0 [starttransfer_time] => 0.685844 [redirect_time] => 0 ) 1
 
The code you posted in the opening post works fine for me:


Are you hosting this locally or on a webhost? I wonder if accessing external URLs has been blocked?

It sometimes works, but other times it doesn't.

I am downloading about 7 pages sequentially and it usually doesn't download them all, but when I try and download just one page...it sometimes works, sometimes doesn't. Its an intermittent problem.
 
LOL...turns out there was some whitespace on the end of the URLs I was trying to get the contents of which prevented it...and obviously when I printed the string that contained the url it looked fine.

The reason I didn't figure this out before is that for some reason my php service won't print any errors out in my php scripts (even when I manually set the error reporting)....anyone know why that is?
 
Ah whitespace - been where you are so many times before.
Now if I print out anything to debug my (dodgy) code I also put an asterisk each side so any whitespace shows up!!

So many hours lost to that problem :D
 
Back
Top Bottom