Web Scraping.

Associate
Joined
18 Oct 2002
Posts
761
Location
Berkshire
Does anyone have any experience in scraping data of a website? It currently provides a feed, but it's incomplete and after speaking to them there not going do anything about it!

I'd ideally like to scrape it, cache it, then serve it back up as json. I've limited knowledge of PHP, but always looking to learn!

Thanks
 
I've done it in Python and it's very easy to do. Would that be an option for you?

You can do it using the Beautiful Soup module, here's a scraping tutorial using it in combination with the httplib2 module.


If you have to use PHP, probably the easiest way is to use cURL. Something like this or this to obtain the raw page data, then you would extract the specific data you're after using regular expressions for example.
 
Thanks for the suggestions. I'll need to run it on my webhosting, which at the moment is on a PHP5 setup, so that seems the logical choice. I could run the python on Google App Engine though, which would be nice as it's not my server load then! I don't have anywhere to run the Java stuff, and GAE doesn't support it apparently.

I think I'll give the Python route a go, I've always wanted to give it a go.
 
Thanks for the suggestions. I'll need to run it on my webhosting, which at the moment is on a PHP5 setup, so that seems the logical choice. I could run the python on Google App Engine though, which would be nice as it's not my server load then! I don't have anywhere to run the Java stuff, and GAE doesn't support it apparently.

I think I'll give the Python route a go, I've always wanted to give it a go.

IIRC correctly it started off with Java:

Google App Engine supports apps written in several programming languages. With App Engine's Java runtime environment, you can build your app using standard Java technologies, including the JVM, Java servlets, and the Java programming language—or any other language using a JVM-based interpreter or compiler, such as JavaScript or Ruby. App Engine also features a dedicated Python runtime environment, which includes a fast Python interpreter and the Python standard library. The Java and Python runtime environments are built to ensure that your application runs quickly, securely, and without interference from other apps on the system.

Regardless, good luck anyway :)
 
I've scraped pages before for product information which was then stored into a database for access later. If you want to do it yourself using just PHP then all you need to learn is how to use cURL and the functions strpos() and substr(). Doing it this was will allow you to customize it exactly to your needs but it requires more effort.
 
you need to do a http request to your url pulling the response (the webpage) into a string. you can then manipulate the string using distinctive points in it to extract the data you need. in php, i'd use cURL (as Rozzy85) has already mentioned.
 
Here's an example I cobbled together, years ago. Excuse the bloated code, I was learning as I went along. :p

It would fetch at regular intervals, in this case 60 seconds.

Don't know about JSON, I just used the created csv files in a MySQL database.

Code:
<?php
 $date=date('Y-m-d'); $startTime='19:45'; $parentdir=$date.' 2000 hrs Fulham v Blackburn';
$i=1; $interval=60; 
if (!is_dir($parentdir)) {
  mkdir($parentdir, 0700);
}
$timeStamp=date('H:i');
echo $startTime; echo " ";
echo $parentdir; echo "\n\n";

while (strtotime($startTime)- strtotime("now")>0) {
  sleep(1); // 1 second
  //$timeStamp=date('H:i');
}

while ($i<=140) {
  $timeStamp=date('H:i:s');
  $sec=date('s');
  while ($sec<>0) {
    $sec=date('s');
    usleep(100000); // 100 milliseconds
  }
  $timeStamp=date('H:i:s');

  $handle1 = fopen("http://uk.site.sports.betfair.com/betting/LoadMarketDataAction.do?mi=100410408", "rb");//matchodds
  //sleep(2);
  $handle2 = fopen("http://uk.site.sports.betfair.com/betting/LoadMarketDataAction.do?mi=100410411", "rb");//uo1.5goals
  //sleep(2);
  $handle3 = fopen("http://uk.site.sports.betfair.com/betting/LoadMarketDataAction.do?mi=100410403", "rb");//uo2.5goals
  //sleep(2);
  $handle4 = fopen("http://uk.site.sports.betfair.com/betting/LoadMarketDataAction.do?mi=100410412", "rb");//uo3.5goals
  //sleep(2);
  $handle5 = fopen("http://uk.site.sports.betfair.com/betting/LoadMarketDataAction.do?mi=100410413", "rb");//uo4.5goals
  //sleep(2);
  $handle6 = fopen("http://uk.site.sports.betfair.com/betting/LoadMarketDataAction.do?mi=100410406", "rb");//Correct Score
	
  //$handle1 = fopen("g:\\php\\match_odds.htm", "rb");
 //*****************************************************************************************************************************************  
  $contents = ""; $myFile = "$parentdir\\match_odds.txt";
  if($handle1) {
    while (!feof($handle1)) {
      $contents .= fread($handle1, 8192);
    }
    fclose($handle1);
	$timeStamp=date('H:i:s'); // remove later
  }
  else
    echo "Failed to read page - match odds! )\n\n";  
    /*************************** [ Money Matched] ***************************/
  if (preg_match("/p.m_ReM\(.*\)/U", $contents, $tmp)) {
    preg_match("/\'.*\'/U", $tmp[0], $temp);
    $totalMoneyMatched = preg_replace("/[^0-9]/", '', $temp[0]);
    /*************************** [/Money Matched] ***************************/
  
    /*************************** [ Match Odds] ***************************/
    if(preg_match_all("/p.m_rRM\(.*\)/U", $contents, $odds)) {
    /*************************** [/Match Odds] ***************************/
      echo $timeStamp;
      echo ",$totalMoneyMatched,";
      echo $odds[0][0];
	  echo $odds[0][1];
	  echo $odds[0][2]."\n\n";
      
      
      $fh = fopen($myFile, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][0] . "," . $odds[0][1] . "," .  $odds[0][2] . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
    }
  }

  //*****************************************************************************************************************************************  
  $contents = ""; $myFile = "$parentdir\\uo1_5.txt";
  if($handle2) {
    while (!feof($handle2)) {
      $contents .= fread($handle2, 8192);
    }
    fclose($handle2);
	$timeStamp=date('H:i:s'); // remove later
  } 
  else
    echo "Failed to read page - under 1.5! )\n\n";  
    /*************************** [ Money Matched] ***************************/
  if (preg_match("/p.m_ReM\(.*\)/U", $contents, $tmp)) {
    preg_match("/\'.*\'/U", $tmp[0], $temp);
    $totalMoneyMatched = preg_replace("/[^0-9]/", '', $temp[0]);
    /*************************** [/Money Matched] ***************************/
  
    /*************************** [ Under 1.5 Goals] ***************************/
    preg_match("/p.m_rRM\(.*Under 1\.5.*\)/U", $contents, $U15);
    $contents = preg_replace("/p.m_rRM\(.*Under 1\.5.*\)/U", '', $contents);
    /*************************** [/Under 1.5 Goals] ***************************/

    /*************************** [ Over 1.5 Goals] ***************************/
    if(preg_match("/p.m_rRM\(.*Over 1\.5.*\)/U", $contents, $O15)) {
    /*************************** [/Over 1.5 Goals] ***************************/
      echo $timeStamp;
      echo ",$totalMoneyMatched,";
      echo "$U15[0],";
      echo "$O15[0]\n\n";
      
      
      $fh = fopen($myFile, 'a') or die("can't open file");
      $stringData =  $timeStamp . ",$totalMoneyMatched,$U15[0],$O15[0]\n";
      fwrite($fh, $stringData);
      fclose($fh);
    }
  }
   
  //*****************************************************************************************************************************************  
  $contents = ""; $myFile = "$parentdir\\uo2_5.txt";
  if($handle3) {
    while (!feof($handle3)) {
      $contents .= fread($handle3, 8192);
    }
    fclose($handle3);
	$timeStamp=date('H:i:s'); // remove later
  }
  else
    echo "Failed to read page - under 2.5! )\n\n";  
    /*************************** [ Money Matched] ***************************/
  if (preg_match("/p.m_ReM\(.*\)/U", $contents, $tmp)) {
    preg_match("/\'.*\'/U", $tmp[0], $temp);
    $totalMoneyMatched = preg_replace("/[^0-9]/", '', $temp[0]);
    /*************************** [/Money Matched] ***************************/
  
    /*************************** [ Under 2.5 Goals] ***************************/
    preg_match("/p.m_rRM\(.*Under 2\.5.*\)/U", $contents, $U25);
    $contents = preg_replace("/p.m_rRM\(.*Under 2\.5.*\)/U", '', $contents);
    /*************************** [/Under 2.5 Goals] ***************************/

    /*************************** [ Over 2.5 Goals] ***************************/
    if(preg_match("/p.m_rRM\(.*Over 2\.5.*\)/U", $contents, $O25)) {
    /*************************** [/Over 2.5 Goals] ***************************/
      echo $timeStamp;
      echo ",$totalMoneyMatched,";
      echo "$U25[0],";
      echo "$O25[0]\n\n";
      
      
      $fh = fopen($myFile, 'a') or die("can't open file");
      $stringData =  $timeStamp . ",$totalMoneyMatched,$U25[0],$O25[0]\n";
      fwrite($fh, $stringData);
      fclose($fh);
    }
  }
  
  //*****************************************************************************************************************************************  
  $contents = ""; $myFile = "$parentdir\\uo3_5.txt";
  if($handle4) {
    while (!feof($handle4)) {
      $contents .= fread($handle4, 8192);
    }
    fclose($handle4);
	$timeStamp=date('H:i:s'); // remove later
  }
  else
    echo "Failed to read page - under 3.5! )\n\n";  
    /*************************** [ Money Matched] ***************************/
  if (preg_match("/p.m_ReM\(.*\)/U", $contents, $tmp)) {
    preg_match("/\'.*\'/U", $tmp[0], $temp);
    $totalMoneyMatched = preg_replace("/[^0-9]/", '', $temp[0]);
    /*************************** [/Money Matched] ***************************/
  
    /*************************** [ Under 3.5 Goals] ***************************/
    preg_match("/p.m_rRM\(.*Under 3\.5.*\)/U", $contents, $U25);
    $contents = preg_replace("/p.m_rRM\(.*Under 3\.5.*\)/U", '', $contents);
    /*************************** [/Under 3.5 Goals] ***************************/

    /*************************** [ Over 3.5 Goals] ***************************/
    if(preg_match("/p.m_rRM\(.*Over 3\.5.*\)/U", $contents, $O25)) {
    /*************************** [/Over 3.5 Goals] ***************************/
      echo $timeStamp;
      echo ",$totalMoneyMatched,";
      echo "$U25[0],";
      echo "$O25[0]\n\n";
      
      
      $fh = fopen($myFile, 'a') or die("can't open file");
      $stringData =  $timeStamp . ",$totalMoneyMatched,$U25[0],$O25[0]\n";
      fwrite($fh, $stringData);
      fclose($fh);
    }
  }
  //*****************************************************************************************************************************************  
  $contents = ""; $myFile = "$parentdir\\uo4_5.txt";
  if($handle5) {
    while (!feof($handle5)) {
      $contents .= fread($handle5, 8192);
    }
    fclose($handle5);
	$timeStamp=date('H:i:s'); // remove later
  }
  else
    echo "Failed to read page - under 4.5! )\n\n";
      
    /*************************** [ Money Matched] ***************************/
  if (preg_match("/p.m_ReM\(.*\)/U", $contents, $tmp)) {
    preg_match("/\'.*\'/U", $tmp[0], $temp);
    $totalMoneyMatched = preg_replace("/[^0-9]/", '', $temp[0]);
    /*************************** [/Money Matched] ***************************/
  
    /*************************** [ Under 4.5 Goals] ***************************/
    preg_match("/p.m_rRM\(.*Under 4\.5.*\)/U", $contents, $U25);
    $contents = preg_replace("/p.m_rRM\(.*Under 4\.5.*\)/U", '', $contents);
    /*************************** [/Under 4.5 Goals] ***************************/

    /*************************** [ Over 4.5 Goals] ***************************/
    if(preg_match("/p.m_rRM\(.*Over 4\.5.*\)/U", $contents, $O25)) {
    /*************************** [/Over 4.5 Goals] ***************************/
      echo $timeStamp;
      echo ",$totalMoneyMatched,";
      echo "$U25[0],";
      echo "$O25[0]\n\n";
      
      
      $fh = fopen($myFile, 'a') or die("can't open file");
      $stringData =  $timeStamp . ",$totalMoneyMatched,$U25[0],$O25[0]\n";
      fwrite($fh, $stringData);
      fclose($fh);
    }
  }
  //*****************************************************************************************************************************************  
  $contents = ""; $myFile = "$parentdir\\correct_score.txt";
  
  $contents00 = ""; $myFile00 = "$parentdir\\correct_score_0-0.txt";
  $contents01 = ""; $myFile01 = "$parentdir\\correct_score_0-1.txt";
  $contents02 = ""; $myFile02 = "$parentdir\\correct_score_0-2.txt";
  $contents03 = ""; $myFile03 = "$parentdir\\correct_score_0-3.txt";
  $contents04 = ""; $myFile04 = "$parentdir\\correct_score_1-0.txt";
  $contents05 = ""; $myFile05 = "$parentdir\\correct_score_1-1.txt";
  $contents06 = ""; $myFile06 = "$parentdir\\correct_score_1-2.txt";
  $contents07 = ""; $myFile07 = "$parentdir\\correct_score_1-3.txt";
  $contents08 = ""; $myFile08 = "$parentdir\\correct_score_2-0.txt";
  $contents09 = ""; $myFile09 = "$parentdir\\correct_score_2-1.txt";
  $contents10 = ""; $myFile10 = "$parentdir\\correct_score_2-2.txt";
  $contents11 = ""; $myFile11 = "$parentdir\\correct_score_2-3.txt";
  $contents12 = ""; $myFile12 = "$parentdir\\correct_score_3-0.txt";
  $contents13 = ""; $myFile13 = "$parentdir\\correct_score_3-1.txt";
  $contents14 = ""; $myFile14 = "$parentdir\\correct_score_3-2.txt";
  $contents15 = ""; $myFile15 = "$parentdir\\correct_score_3-3.txt";
  $contents16 = ""; $myFile16 = "$parentdir\\correct_score_AUQ.txt";
  
  
  
  if($handle6) {
    while (!feof($handle6)) {
      $contents .= fread($handle6, 8192);
    }
    fclose($handle6);
	$timeStamp=date('H:i:s'); // remove later
  }
  else
    echo "Failed to read page - correct score! )\n\n";  
    /*************************** [ Money Matched] ***************************/
  if (preg_match("/p.m_ReM\(.*\)/U", $contents, $tmp)) {
    preg_match("/\'.*\'/U", $tmp[0], $temp);
    $totalMoneyMatched = preg_replace("/[^0-9]/", '', $temp[0]);
    /*************************** [/Money Matched] ***************************/
  
    /*************************** [ correct_score] ***************************/
    if(preg_match_all("/p.m_rRM\(.*\)/U", $contents, $odds)) {
    /*************************** [correct_score] ***************************/
      echo $timeStamp;
      echo ",$totalMoneyMatched,";
      echo $odds[0][0];
	  echo $odds[0][1];
	  echo $odds[0][2];
	  echo $odds[0][3];
	  echo $odds[0][4];
	  echo $odds[0][5];
	  echo $odds[0][6];
	  echo $odds[0][7];
	  echo $odds[0][8];
	  echo $odds[0][9];
	  echo $odds[0][10];
	  echo $odds[0][11];
	  echo $odds[0][12];
	  echo $odds[0][13];
	  echo $odds[0][14];
	  echo $odds[0][15];
	  echo $odds[0][16]."\n\n";
      
      
      $fh = fopen($myFile, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][0] . "," . $odds[0][1] . "," .  $odds[0][2] . "," . $odds[0][3] . "," . $odds[0][4] . "," . $odds[0][5] . "," .  $odds[0][6] . "," . $odds[0][7] . "," . $odds[0][8] . "," . $odds[0][9] . "," .  $odds[0][10] . "," . $odds[0][11] . "," . $odds[0][12] . "," . $odds[0][13] . "," .  $odds[0][14] . "," . $odds[0][15] . "," . $odds[0][16]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile00, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][0]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile01, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][1]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile02, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][2]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile03, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][3]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile04, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][4]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile05, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][5]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile06, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][6]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile07, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][7]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile08, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][8]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile09, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][9]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile10, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][10]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile11, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][11]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile12, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][12]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile13, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][13]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile14, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][14]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile15, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][15]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  $fh = fopen($myFile16, 'a') or die("can't open file");
      $stringData =  $timeStamp . "," . $totalMoneyMatched . "," . $odds[0][16]  . "\n";
      fwrite($fh, $stringData);
      fclose($fh);
	  
	  
	  
	  
    }
  }

  //*****************************************************************************************************************************************  
  sleep(2);
  $i++;
}
?>
 
Use Curl to fetch the page, Tidy to munge it into XHTML and Xpath to extract the bits you want. Regex is faster, but a lot more fragile. I work for a company that builds software to do precisely this, so have spent a fair amount of time profiling performance and such of the various approaches.
 
Back
Top Bottom