Stripping data from a website

Soldato
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
There's a directory website with publicly accessible information that we just need in an excel sheet at our end.

An external site lists these with consecutive IDs which then link to the entry's individual page, of which the ID seems to have no obvious order to it. Each individual page doesn't have a huge amount of information on, and each piece of info is labelled via a div or li. For example the address is listed in li's with the end of the id="Town", "Country", "Postcode" etc.

I'm only really familiar with PHP but the script would effectively need to:

- Start at www.directory.com/0001 and go to 3000
- For each, find the link that starts "www.directory2.com" (only one link per page) and go there
- Dump the contents of id="Town" etc into a database

Is this possible? I know PHP may not be clean or neat but for a one-off thing I'm not fussed about that. Hell it doesn't even need to dump it to a database, just echo it with commas for saving as a comma-delimited CSV.
 
Last edited:
Soldato
OP
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
Made a quick start whilst I had 10 minutes before 5pm. Is this on the right tracks? Not used preg_match much before.

PHP:
<?
$number = 6171;

$content = file_get_contents('http://www.xxx.net/$number/');

preg_match('#href="http://www.xxx2.co.uk/dir_(.*).htm#', $content, $match);
$id = $match[1];

$newcontent = file_get_contents('http://www.xxx2.co.uk/dir_$id.htm');

preg_match('#Name">(.*)</h1>#', $newcontent, $match);
$name = $match[1];

preg_match('#LocationDetail">(.*)</h2>#', $newcontent, $match);
$address = $match[1];

preg_match('#contactTelephone" class="infoDetail">(.*)</span>#', $newcontent, $match);
$telephone = $match[1];

preg_match('#contactEmail" class="infoDetail">(.*)</a>#', $newcontent, $match);
$email = $match[1];

etc.

And fyi there's 1500 records so not a massive amount.
 
Soldato
Joined
18 Oct 2002
Posts
15,411
Location
The land of milk & beans
I don't know about the specifics of PHP, but I have done this before in C# using XPath. The main thing to be careful is flooding the remote server with requests, especially if it belongs to a third party. You're probably best doing this in batches of ~100 requests a time spread over a couple of minutes, depending on how hot they are on DDOS.
 
Soldato
Joined
6 Mar 2008
Posts
10,079
Location
Stoke area
extremely easy to knock this up in Python, web scrapers are simple enough to do.

Writing to excel can be pain but a notepad file would be simple enough, or a csv file (with varied results). recently had to do this with a football website with some 25k entries and it was actually fun working it out.

Did you manage to get it done?
 
Soldato
OP
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
Yeah managed it, running the final bit of it now as it happens :). Basics were pretty simple but then had a bit of time checking both page's headers for a 404 and then looping the file_get_contents until it succeeds (the second site in particular is terribly slow so just kept failing without it).

Got it to just echo the results with commas (after replacing commas in the source with //); a quick copy and paste into Notepad++, save as a .csv, replace // with , and job's a goodun!

Like you say, quite fun working it out, especially knowing that maybe 1/2 hours of coding has saved days of human time.
 
Soldato
OP
Joined
27 Dec 2005
Posts
17,296
Location
Bristol
One last bit. I've got the IDs of 70 pages that failed. I don't usually do much with arrays, so how would one list them all in an array and then just run through the array until they're all done?

Nevermind, on it, foreach etc.
 
Last edited:
Associate
Joined
7 Aug 2011
Posts
726
Location
Planet Earth
also instead of preg_matching use XML->path as has mentioned before..

and just a tip for testing save one of the documents locally and test on that as not to hit the 3rd party slow site.
 
Back
Top Bottom