Screenscraper

steinooo · 4 Jun 2007 at 10:20

Hi all,

I'm looking for a screenscraper, preferably in ASP.net which can grab live data from different websites.

What would people recommend as the easiest way to aquire such a piece of software, would the best way to be to write one from scratch?

If so, has anyone ever acheived this please? I'm keen to hear any experiences/difficulties with such a project.

thanks.

huppy · 4 Jun 2007 at 15:34

Take a look at the WebClient class - i have used it before for collecting and caching files at other URLs.

You may need to put together a bit of cleverness to go through the html file you cache and search for images/stylesheets/scripts etc but its not too difficult, just time consuming!

meghatronic · 4 Jun 2007 at 16:09

It depends entirely on what you want to scrape.

Naturally xml is stupidly easy to scrape (e.g. Rss feeds etc..)... some things are more difficult (news sites) as they don't define finite start and end points.

So, what is it you're wanting to scrape?

steinooo · 4 Jun 2007 at 17:07

Thanks for the repsonses. prices on live sites such as auction sites etc....

meghatronic · 4 Jun 2007 at 17:14

stokefan said:
Thanks for the repsonses. prices on live sites such as auction sites etc....

Should be easy then.

Most sites use css classes to define bid price etc...

psyr33n · 4 Jun 2007 at 17:26

jdickerson said:
Naturally xml is stupidly easy to scrape (e.g. Rss feeds etc..)... some things are more difficult (news sites) as they don't define finite start and end points.

I'll go on a limb to say dealing with remote XML formats such as RSS, Atom or plain XML etc. is not considered screen-scraping; the formats were designed for information exchange, compatibility and ease of interpretation: you're not scraping your way through tag soup to gain what you need, you're simply parsing semantic data intended for parsing of some sort.

meghatronic · 4 Jun 2007 at 17:32

psyr33n said:
I'll go on a limb to say dealing with remote XML formats such as RSS, Atom or plain XML etc. is not considered screen-scraping; the formats were designed for information exchange, compatibility and ease of interpretation: you're not scraping your way through tag soup to gain what you need, you're simply parsing semantic data intended for parsing of some sort.

Quite right... but the xml I deal with usually doesn't stick to their own defining structure.. I only call this scraping as I need to do quite a bit of regular expression stuff to sort it out.

I inherantly find xml NOT to be semantic, alas. Even the BBC's rss feeds have occasionally messed up structure wise.

psyr33n · 5 Jun 2007 at 20:18

Yeah, it's irritating that people can't write good XML for ****.

meghatronic · 6 Jun 2007 at 20:03

psyr33n said:
Yeah, it's irritating that people can't write good XML for ****.

The thing that annoys me is when the structure changes for no apparent reason.... why!?!??!