Screenscraper

Soldato
Joined
31 Dec 2003
Posts
4,768
Location
Stoke on Trent
Hi all,

I'm looking for a screenscraper, preferably in ASP.net which can grab live data from different websites.

What would people recommend as the easiest way to aquire such a piece of software, would the best way to be to write one from scratch?

If so, has anyone ever acheived this please? I'm keen to hear any experiences/difficulties with such a project.

thanks.
 
Take a look at the WebClient class - i have used it before for collecting and caching files at other URLs.

You may need to put together a bit of cleverness to go through the html file you cache and search for images/stylesheets/scripts etc but its not too difficult, just time consuming!
 
It depends entirely on what you want to scrape.

Naturally xml is stupidly easy to scrape (e.g. Rss feeds etc..)... some things are more difficult (news sites) as they don't define finite start and end points.

So, what is it you're wanting to scrape?
 
jdickerson said:
Naturally xml is stupidly easy to scrape (e.g. Rss feeds etc..)... some things are more difficult (news sites) as they don't define finite start and end points.
I'll go on a limb to say dealing with remote XML formats such as RSS, Atom or plain XML etc. is not considered screen-scraping; the formats were designed for information exchange, compatibility and ease of interpretation: you're not scraping your way through tag soup to gain what you need, you're simply parsing semantic data intended for parsing of some sort.
 
psyr33n said:
I'll go on a limb to say dealing with remote XML formats such as RSS, Atom or plain XML etc. is not considered screen-scraping; the formats were designed for information exchange, compatibility and ease of interpretation: you're not scraping your way through tag soup to gain what you need, you're simply parsing semantic data intended for parsing of some sort.
:p
Quite right... but the xml I deal with usually doesn't stick to their own defining structure.. I only call this scraping as I need to do quite a bit of regular expression stuff to sort it out.

I inherantly find xml NOT to be semantic, alas. Even the BBC's rss feeds have occasionally messed up structure wise.
 
Last edited:
Back
Top Bottom