Stripping data from a website

Russinating · 2 Jul 2014 at 16:43

There's a directory website with publicly accessible information that we just need in an excel sheet at our end.

An external site lists these with consecutive IDs which then link to the entry's individual page, of which the ID seems to have no obvious order to it. Each individual page doesn't have a huge amount of information on, and each piece of info is labelled via a div or li. For example the address is listed in li's with the end of the id="Town", "Country", "Postcode" etc.

I'm only really familiar with PHP but the script would effectively need to:

- Start at www.directory.com/0001 and go to 3000
- For each, find the link that starts "www.directory2.com" (only one link per page) and go there
- Dump the contents of id="Town" etc into a database

Is this possible? I know PHP may not be clean or neat but for a one-off thing I'm not fussed about that. Hell it doesn't even need to dump it to a database, just echo it with commas for saving as a comma-delimited CSV.

Russinating · 2 Jul 2014 at 17:03

Made a quick start whilst I had 10 minutes before 5pm. Is this on the right tracks? Not used preg_match much before.

PHP:

<?
$number = 6171;

$content = file_get_contents('http://www.xxx.net/$number/');

preg_match('#href="http://www.xxx2.co.uk/dir_(.*).htm#', $content, $match);
$id = $match[1];

$newcontent = file_get_contents('http://www.xxx2.co.uk/dir_$id.htm');

preg_match('#Name">(.*)</h1>#', $newcontent, $match);
$name = $match[1];

preg_match('#LocationDetail">(.*)</h2>#', $newcontent, $match);
$address = $match[1];

preg_match('#contactTelephone" class="infoDetail">(.*)</span>#', $newcontent, $match);
$telephone = $match[1];

preg_match('#contactEmail" class="infoDetail">(.*)</a>#', $newcontent, $match);
$email = $match[1];

etc.

And fyi there's 1500 records so not a massive amount.

GravyMonster · 2 Jul 2014 at 17:21

I don't know about the specifics of PHP, but I have done this before in C# using XPath. The main thing to be careful is flooding the remote server with requests, especially if it belongs to a third party. You're probably best doing this in batches of ~100 requests a time spread over a couple of minutes, depending on how hot they are on DDOS.

Russinating · 2 Jul 2014 at 18:22

Good call. I guess I could just delay each request by 60 seconds and leave it going overnight. No rush so better safe.

Both sites seem pretty slow as well.

AHarvey · 6 Jul 2014 at 00:50

extremely easy to knock this up in Python, web scrapers are simple enough to do.

Writing to excel can be pain but a notepad file would be simple enough, or a csv file (with varied results). recently had to do this with a football website with some 25k entries and it was actually fun working it out.

Did you manage to get it done?

Russinating · 6 Jul 2014 at 18:29

Yeah managed it, running the final bit of it now as it happens

. Basics were pretty simple but then had a bit of time checking both page's headers for a 404 and then looping the file_get_contents until it succeeds (the second site in particular is terribly slow so just kept failing without it).

Got it to just echo the results with commas (after replacing commas in the source with //); a quick copy and paste into Notepad++, save as a .csv, replace // with , and job's a goodun!

Like you say, quite fun working it out, especially knowing that maybe 1/2 hours of coding has saved days of human time.

Russinating · 6 Jul 2014 at 20:32

One last bit. I've got the IDs of 70 pages that failed. I don't usually do much with arrays, so how would one list them all in an array and then just run through the array until they're all done?

Nevermind, on it, foreach etc.

ilcp · 7 Jul 2014 at 03:55

also instead of preg_matching use XML->path as has mentioned before..

and just a tip for testing save one of the documents locally and test on that as not to hit the 3rd party slow site.