web scraping - general advice, tips etc..

Caporegime
Joined
29 Jan 2008
Posts
59,074
Say I want to scrape some data from a company website once a day, few thousand pages, am I likely to run into trouble if I say spread the requests out over perhaps a few hours and change IPs every so often?

Do I need to be more cautious and/or are there any tips anyone could share?

(this isn't anything illegal or breaching copyright or anything along those lines just I'm aware companies sometimes don't like this sort of thing)
 
Well if you are just scraping text there's probably no issues about doing that.

Things like email address, phone numbers, passwords etc would be a bit of a problem as it's people's personal details.

What is it you are using it for exactly?
 
no there is no API - there are no legal issues per say, this isn't personal data/information it is simply data that is openly available by browsing the website so indeed fair game as you've said

I'm just not sure how happy they'd be with someone scraping it daily and whether I need to take precautions - hints and tips for doing so etc..
 
No it is not fair game, as it infringes copyright. You cannot just take content from another site without permission/if their license prohibits it. If no such license is on display, full copyright is assumed and thus scraping is prohibited.
 
Surely your browser is web scraping too? If they're publishing it openly what is the big deal? It is factual information I'm interested in not copyright stuff.
 
Depends on what you do with the scraped content really.

As for avoiding detection slow crawl rate and changing IPs will defeat the simplest anti scraping plugins (assuming they even run one).

Not sure how many IPs you'd have access to though. Scrapers are normally pretty easy to spot in logs even with a couple different IPs.
 
I'm not going to do anything illegal with the scraped content, it isn't going to be published, just used to make decisions.

That last bit is what I'm worried about - that it is easily spotted, I guess I'd need to subscribe to a few proxy services. Presumably if puttin in random delays and spreading the entire thing over a few hours it isn't going to be as noticeable for the site owners.
 
Ahh if it's not published you don't really have to worry about copyright then.

Do you know what the sites based on? Is it likely that others are scraping it?

Things like Wordpress have a plugin anti scraper but that can be defeated with very few IPs and slow requests. If it's a site that's likely to have anti scraping software then it might be worth passing links between bots rather than progressively following links and swapping IPs midway. I.e. 3 links on homepage have IP1 follow link one IP follow link 2 and so on. If you add in your random timing it should appear natural enough.
 
Jesus.. so much blatant misinformation is such a small space. Robots.txt is licensing for search engines to index a site only. Not for anyone to scrape it for other means. You browsing a site is not scraping, no. Publishing (or not) is inconsequential to copyright infringement. You think pirating a game and/or film is legal as long as you don't "publish" it?

Think for once.
 
Ahh if it's not published you don't really have to worry about copyright then.

Do you know what the sites based on? Is it likely that others are scraping it?

Things like Wordpress have a plugin anti scraper but that can be defeated with very few IPs and slow requests. If it's a site that's likely to have anti scraping software then it might be worth passing links between bots rather than progressively following links and swapping IPs midway. I.e. 3 links on homepage have IP1 follow link one IP follow link 2 and so on. If you add in your random timing it should appear natural enough.

thanks for the tips - there are actually a handful of sites that could be useful for this particular idea, they don't appear to use Wordpress, only had a very quick look - I'm not particularly experienced on the web side of things, more the data side... but this could be an interesting thing to attempt/play around with in my spare time. :)
 
Jesus.. so much blatant misinformation is such a small space. Robots.txt is licensing for search engines to index a site only. Not for anyone to scrape it for other means. You browsing a site is not scraping, no. Publishing (or not) is inconsequential to copyright infringement. You think pirating a game and/or film is legal as long as you don't "publish" it?

Think for once.

If you're going to be rude then don't bother replaying thanks, as has already been said I'm not doing anything with copyrighted material - this isn't too different to me using a browser and jotting down factual information it is simply at a larger scale.
 
right... so google and other search engines are breaching copyright because they use web crawlers and price comparison sites too... reading factual information from the web and making use of facts (not copyright material) is not going to breach copyright laws, please stop posting nonsense in my thread, especially if you've got nothing useful to add to the actual question
 
Back
Top Bottom