Stripping data from a website

Soldato
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
There's a directory website with publicly accessible information that we just need in an excel sheet at our end.

An external site lists these with consecutive IDs which then link to the entry's individual page, of which the ID seems to have no obvious order to it. Each individual page doesn't have a huge amount of information on, and each piece of info is labelled via a div or li. For example the address is listed in li's with the end of the id="Town", "Country", "Postcode" etc.

I'm only really familiar with PHP but the script would effectively need to:

- Start at www.directory.com/0001 and go to 3000
- For each, find the link that starts "www.directory2.com" (only one link per page) and go there
- Dump the contents of id="Town" etc into a database

Is this possible? I know PHP may not be clean or neat but for a one-off thing I'm not fussed about that. Hell it doesn't even need to dump it to a database, just echo it with commas for saving as a comma-delimited CSV.
 
Last edited:
Soldato
OP
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
Made a quick start whilst I had 10 minutes before 5pm. Is this on the right tracks? Not used preg_match much before.

PHP:
<?
$number = 6171;

$content = file_get_contents('http://www.xxx.net/$number/');

preg_match('#href="http://www.xxx2.co.uk/dir_(.*).htm#', $content, $match);
$id = $match[1];

$newcontent = file_get_contents('http://www.xxx2.co.uk/dir_$id.htm');

preg_match('#Name">(.*)</h1>#', $newcontent, $match);
$name = $match[1];

preg_match('#LocationDetail">(.*)</h2>#', $newcontent, $match);
$address = $match[1];

preg_match('#contactTelephone" class="infoDetail">(.*)</span>#', $newcontent, $match);
$telephone = $match[1];

preg_match('#contactEmail" class="infoDetail">(.*)</a>#', $newcontent, $match);
$email = $match[1];

etc.

And fyi there's 1500 records so not a massive amount.
 
Soldato
OP
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
Yeah managed it, running the final bit of it now as it happens :). Basics were pretty simple but then had a bit of time checking both page's headers for a 404 and then looping the file_get_contents until it succeeds (the second site in particular is terribly slow so just kept failing without it).

Got it to just echo the results with commas (after replacing commas in the source with //); a quick copy and paste into Notepad++, save as a .csv, replace // with , and job's a goodun!

Like you say, quite fun working it out, especially knowing that maybe 1/2 hours of coding has saved days of human time.
 
Soldato
OP
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
One last bit. I've got the IDs of 70 pages that failed. I don't usually do much with arrays, so how would one list them all in an array and then just run through the array until they're all done?

Nevermind, on it, foreach etc.
 
Last edited:
Back
Top Bottom