Website Crawlers

Associate
Joined
17 Mar 2008
Posts
135
Hi all,

Just looking at creating another web portal, only this time I want it automated so looking into web crawlers. Checked out a few such as DataparkSearch, etc. But it's all a little heavy, was hoping to find a simple open source option that I could modify slightly.

Will be a targeted web crawl to specific web urls that I want to crawl. Would only need simple details found on the page. And store into a local DB for querying. Without giving to much away, think along the lines of property...yes it had been done, but there is a niche that I want to cover.

Anyone got any ideas on best way to tackle this?
 
Hi all,

Just looking at creating another web portal, only this time I want it automated so looking into web crawlers. Checked out a few such as DataparkSearch, etc. But it's all a little heavy, was hoping to find a simple open source option that I could modify slightly.

Will be a targeted web crawl to specific web urls that I want to crawl. Would only need simple details found on the page. And store into a local DB for querying. Without giving to much away, think along the lines of property...yes it had been done, but there is a niche that I want to cover.

Anyone got any ideas on best way to tackle this?

Can you not do what globrix does and just scrape the data from the URLs you need? Or as spunkey says, get a feed, nearly all property management/inventory software has the facility to export a rightmove feed and they're pish easy to import.
 
Feeds are a no no, as there are lot of sites that just have their own basic database, and no feeds. I want to cover the whole lot. So screen scrapping will be the way forward.

Will look into that idea, was hoping that there would be some sort of automated software that could analyze sites, etc. As I will need to create a screen scraper to adapt to each site (est. around 5000 sites I will be crawling).

globrix and ononemap is pretty much the direction i'm going in.
 
got me going on this today aswell.

so this afternoon i got bored at work and started on one.

i had the idea of scraping existing scrapers but wanted a faster response time.

so custom scraper r us.

kinda got the prototype working but need to add other sites.

thanks for the motivation :)
 
This is the problem, I'm looking at thousands of different sites with different layouts, will be very tricky to workout a screen scrapper that will work across every platform.
 
yes i found that also but on saying that no pain no gain.

if it was easy there would be a billion scrapers doing whatever your planning.

i'm lucky as my scraper will work on about 100 different items on approximately 30 different sites easy enough.

im just creating custom regular expressions for dealing with each site and bringing out what data i want from each. seems to work though its very crude.

not bad for 30 minutes work though. will stick some db access on and see what i can knock up tommorow its last day before i go on holiday and so no work is going on.
 
Back
Top Bottom