Stripping data from a website

Russinating · 2 Jul 2014 at 16:43

There's a directory website with publicly accessible information that we just need in an excel sheet at our end.

An external site lists these with consecutive IDs which then link to the entry's individual page, of which the ID seems to have no obvious order to it. Each individual page doesn't have a huge amount of information on, and each piece of info is labelled via a div or li. For example the address is listed in li's with the end of the id="Town", "Country", "Postcode" etc.

I'm only really familiar with PHP but the script would effectively need to:

- Start at www.directory.com/0001 and go to 3000
- For each, find the link that starts "www.directory2.com" (only one link per page) and go there
- Dump the contents of id="Town" etc into a database

Is this possible? I know PHP may not be clean or neat but for a one-off thing I'm not fussed about that. Hell it doesn't even need to dump it to a database, just echo it with commas for saving as a comma-delimited CSV.

Russinating · 2 Jul 2014 at 17:03

Made a quick start whilst I had 10 minutes before 5pm. Is this on the right tracks? Not used preg_match much before.

PHP:

<?
$number = 6171;

$content = file_get_contents('http://www.xxx.net/$number/');

preg_match('#href="http://www.xxx2.co.uk/dir_(.*).htm#', $content, $match);
$id = $match[1];

$newcontent = file_get_contents('http://www.xxx2.co.uk/dir_$id.htm');

preg_match('#Name">(.*)</h1>#', $newcontent, $match);
$name = $match[1];

preg_match('#LocationDetail">(.*)</h2>#', $newcontent, $match);
$address = $match[1];

preg_match('#contactTelephone" class="infoDetail">(.*)</span>#', $newcontent, $match);
$telephone = $match[1];

preg_match('#contactEmail" class="infoDetail">(.*)</a>#', $newcontent, $match);
$email = $match[1];

etc.

And fyi there's 1500 records so not a massive amount.

Russinating · 2 Jul 2014 at 18:22

Good call. I guess I could just delay each request by 60 seconds and leave it going overnight. No rush so better safe.

Both sites seem pretty slow as well.

Russinating · 6 Jul 2014 at 18:29

Yeah managed it, running the final bit of it now as it happens

. Basics were pretty simple but then had a bit of time checking both page's headers for a 404 and then looping the file_get_contents until it succeeds (the second site in particular is terribly slow so just kept failing without it).

Got it to just echo the results with commas (after replacing commas in the source with //); a quick copy and paste into Notepad++, save as a .csv, replace // with , and job's a goodun!

Like you say, quite fun working it out, especially knowing that maybe 1/2 hours of coding has saved days of human time.

Russinating · 6 Jul 2014 at 20:32

One last bit. I've got the IDs of 70 pages that failed. I don't usually do much with arrays, so how would one list them all in an array and then just run through the array until they're all done?

Nevermind, on it, foreach etc.