Web Scrape Help!

wonder_lander · 4 Feb 2010 at 13:06

I'm looking for some assistance in scraping the content off a particualr website so that I can create a mysql database of information.

Anyone got any good guides or who fancies offering some assistance in doing this?

markeh · 4 Feb 2010 at 18:07

What I'd do in php is use get_file_contents() to get the page then use strpos() to locate what you want to chop out then use substr to plonk it into a variable for you to put in the db.

There's probably better ways....

wonder_lander · 4 Feb 2010 at 18:15

That vaguely sounds like a plan but I have no experience at all on this.

Are there any guides or tutorials that you can recommend?

Maybe the easiest thing will be to pay someone to knock this out for me?

markeh · 4 Feb 2010 at 18:31

Yeah, you can try to pick it up, you'd need to learn how to access and write to a db (how to create a db, tables etc as well). How to deal with variables and how to use the functions I've mentioned. I doubt there's one tutorial that would cover the bases, you'd need to combine a few I think.

If you have no php (or other language) experience it would probably not be worth the hassle versus paying someone to do it. Depending on the page being scraped and what you want to get out of it I doubt it would take very long for someone who knows what they're doing.

wonder_lander · 4 Feb 2010 at 18:44

This is the kind of thing that I am looking to scrape.

I'd like to grab each individual product page from say this category http://www.warehouseexpress.com/category/basecategory.aspx?cat03=3036 and then read write to a db all of the details including the tabs for stuff like images / reviews / specification etc. I am especially interested in grabbing th individual specification elements.

When scraping images do you just grab the URL or could you grab the image itself?

Does that sound like a difficult job?

RobH · 5 Feb 2010 at 14:23

You might want to look at scraping with XPATH as it's much easier to access specific page elements. It treats the page as an XML document and you can then use XML queries to get specific data. You can even use the FireBug addon in FireFox to assist you with writing the XPATH queries.

Just google "XPATH scraping" and the programming language you are using (eg: PHP XPATH scraping).

suarve · 5 Feb 2010 at 14:32

There's a class file called snoopy.php in the wordpress functions folder that has a lot of useful stuff for content scraping.

wonder_lander · 5 Feb 2010 at 15:04

Looking at the example pages I wish to scrape does it look complicated?

wonder_lander · 8 Feb 2010 at 08:20

wonder_lander said:
Looking at the example pages I wish to scrape does it look complicated?

Anyone care to comment?

wonder_lander · 18 Feb 2010 at 21:06

Just wanted to reply to my own post to say that I've found this product and it appears to be superb! http://www.visualwebripper.com/Default.aspx