web scraping - general advice, tips etc..

dowie · 6 Apr 2017 at 00:57

Say I want to scrape some data from a company website once a day, few thousand pages, am I likely to run into trouble if I say spread the requests out over perhaps a few hours and change IPs every so often?

Do I need to be more cautious and/or are there any tips anyone could share?

(this isn't anything illegal or breaching copyright or anything along those lines just I'm aware companies sometimes don't like this sort of thing)

Deleted member 77746 · 6 Apr 2017 at 08:45

What data are you harvesting?

dowie · 6 Apr 2017 at 12:57

Just text

dowie · 8 Apr 2017 at 16:55

do you know something about the subject @mrbell1984 or were you just asking out of curiosity?

Deleted member 77746 · 8 Apr 2017 at 17:28

Well if you are just scraping text there's probably no issues about doing that.

Things like email address, phone numbers, passwords etc would be a bit of a problem as it's people's personal details.

What is it you are using it for exactly?

chroniclard · 8 Apr 2017 at 17:38

If its on the web its fair game.

1000 pages of data though, doesnt this site have some kind of API to request data.

dowie · 8 Apr 2017 at 18:12

no there is no API - there are no legal issues per say, this isn't personal data/information it is simply data that is openly available by browsing the website so indeed fair game as you've said

I'm just not sure how happy they'd be with someone scraping it daily and whether I need to take precautions - hints and tips for doing so etc..

Dj_Jestar · 8 Apr 2017 at 19:45

No it is not fair game, as it infringes copyright. You cannot just take content from another site without permission/if their license prohibits it. If no such license is on display, full copyright is assumed and thus scraping is prohibited.

kwerk · 8 Apr 2017 at 20:38

It should be fine IMO as long as you abide by their robots.txt

dowie · 8 Apr 2017 at 21:01

Surely your browser is web scraping too? If they're publishing it openly what is the big deal? It is factual information I'm interested in not copyright stuff.

Mynight · 8 Apr 2017 at 21:08

Depends on what you do with the scraped content really.

As for avoiding detection slow crawl rate and changing IPs will defeat the simplest anti scraping plugins (assuming they even run one).

Not sure how many IPs you'd have access to though. Scrapers are normally pretty easy to spot in logs even with a couple different IPs.

dowie · 8 Apr 2017 at 21:15

I'm not going to do anything illegal with the scraped content, it isn't going to be published, just used to make decisions.

That last bit is what I'm worried about - that it is easily spotted, I guess I'd need to subscribe to a few proxy services. Presumably if puttin in random delays and spreading the entire thing over a few hours it isn't going to be as noticeable for the site owners.

Mynight · 8 Apr 2017 at 23:17

Ahh if it's not published you don't really have to worry about copyright then.

Do you know what the sites based on? Is it likely that others are scraping it?

Things like Wordpress have a plugin anti scraper but that can be defeated with very few IPs and slow requests. If it's a site that's likely to have anti scraping software then it might be worth passing links between bots rather than progressively following links and swapping IPs midway. I.e. 3 links on homepage have IP1 follow link one IP follow link 2 and so on. If you add in your random timing it should appear natural enough.

Dj_Jestar · 9 Apr 2017 at 02:28

Jesus.. so much blatant misinformation is such a small space. Robots.txt is licensing for search engines to index a site only. Not for anyone to scrape it for other means. You browsing a site is not scraping, no. Publishing (or not) is inconsequential to copyright infringement. You think pirating a game and/or film is legal as long as you don't "publish" it?

Think for once.

rexehuk · 9 Apr 2017 at 07:10

PowerBI can do this. Depends what you mean by decision making.

If it's storing the data each day, then might not be right solution.

dowie · 9 Apr 2017 at 11:50

Mynight said:
Ahh if it's not published you don't really have to worry about copyright then.

Do you know what the sites based on? Is it likely that others are scraping it?

Things like Wordpress have a plugin anti scraper but that can be defeated with very few IPs and slow requests. If it's a site that's likely to have anti scraping software then it might be worth passing links between bots rather than progressively following links and swapping IPs midway. I.e. 3 links on homepage have IP1 follow link one IP follow link 2 and so on. If you add in your random timing it should appear natural enough.

thanks for the tips - there are actually a handful of sites that could be useful for this particular idea, they don't appear to use Wordpress, only had a very quick look - I'm not particularly experienced on the web side of things, more the data side... but this could be an interesting thing to attempt/play around with in my spare time.

dowie · 9 Apr 2017 at 11:50

rexehuk said:
PowerBI can do this. Depends what you mean by decision making.

If it's storing the data each day, then might not be right solution.

the data/analytics side I'm ok with, thanks for the suggestion though

dowie · 9 Apr 2017 at 11:52

Dj_Jestar said:
Jesus.. so much blatant misinformation is such a small space. Robots.txt is licensing for search engines to index a site only. Not for anyone to scrape it for other means. You browsing a site is not scraping, no. Publishing (or not) is inconsequential to copyright infringement. You think pirating a game and/or film is legal as long as you don't "publish" it?

Think for once.

If you're going to be rude then don't bother replaying thanks, as has already been said I'm not doing anything with copyrighted material - this isn't too different to me using a browser and jotting down factual information it is simply at a larger scale.

Dj_Jestar · 9 Apr 2017 at 11:53

Yes, it is.

dowie · 9 Apr 2017 at 12:13

right... so google and other search engines are breaching copyright because they use web crawlers and price comparison sites too... reading factual information from the web and making use of facts (not copyright material) is not going to breach copyright laws, please stop posting nonsense in my thread, especially if you've got nothing useful to add to the actual question

web scraping - general advice, tips etc..

Wise Guy