web scraping - general advice, tips etc..

dowie · 6 Apr 2017 at 00:57

Say I want to scrape some data from a company website once a day, few thousand pages, am I likely to run into trouble if I say spread the requests out over perhaps a few hours and change IPs every so often?

Do I need to be more cautious and/or are there any tips anyone could share?

(this isn't anything illegal or breaching copyright or anything along those lines just I'm aware companies sometimes don't like this sort of thing)

dowie · 6 Apr 2017 at 12:57

Just text

dowie · 8 Apr 2017 at 16:55

do you know something about the subject @mrbell1984 or were you just asking out of curiosity?

dowie · 8 Apr 2017 at 18:12

no there is no API - there are no legal issues per say, this isn't personal data/information it is simply data that is openly available by browsing the website so indeed fair game as you've said

I'm just not sure how happy they'd be with someone scraping it daily and whether I need to take precautions - hints and tips for doing so etc..

dowie · 8 Apr 2017 at 21:01

Surely your browser is web scraping too? If they're publishing it openly what is the big deal? It is factual information I'm interested in not copyright stuff.

dowie · 8 Apr 2017 at 21:15

I'm not going to do anything illegal with the scraped content, it isn't going to be published, just used to make decisions.

That last bit is what I'm worried about - that it is easily spotted, I guess I'd need to subscribe to a few proxy services. Presumably if puttin in random delays and spreading the entire thing over a few hours it isn't going to be as noticeable for the site owners.

dowie · 9 Apr 2017 at 11:50

Mynight said:
Ahh if it's not published you don't really have to worry about copyright then.

Do you know what the sites based on? Is it likely that others are scraping it?

Things like Wordpress have a plugin anti scraper but that can be defeated with very few IPs and slow requests. If it's a site that's likely to have anti scraping software then it might be worth passing links between bots rather than progressively following links and swapping IPs midway. I.e. 3 links on homepage have IP1 follow link one IP follow link 2 and so on. If you add in your random timing it should appear natural enough.

thanks for the tips - there are actually a handful of sites that could be useful for this particular idea, they don't appear to use Wordpress, only had a very quick look - I'm not particularly experienced on the web side of things, more the data side... but this could be an interesting thing to attempt/play around with in my spare time.

dowie · 9 Apr 2017 at 11:50

rexehuk said:
PowerBI can do this. Depends what you mean by decision making.

If it's storing the data each day, then might not be right solution.

the data/analytics side I'm ok with, thanks for the suggestion though

dowie · 9 Apr 2017 at 11:52

Dj_Jestar said:
Jesus.. so much blatant misinformation is such a small space. Robots.txt is licensing for search engines to index a site only. Not for anyone to scrape it for other means. You browsing a site is not scraping, no. Publishing (or not) is inconsequential to copyright infringement. You think pirating a game and/or film is legal as long as you don't "publish" it?

Think for once.

If you're going to be rude then don't bother replaying thanks, as has already been said I'm not doing anything with copyrighted material - this isn't too different to me using a browser and jotting down factual information it is simply at a larger scale.

dowie · 9 Apr 2017 at 12:13

right... so google and other search engines are breaching copyright because they use web crawlers and price comparison sites too... reading factual information from the web and making use of facts (not copyright material) is not going to breach copyright laws, please stop posting nonsense in my thread, especially if you've got nothing useful to add to the actual question

dowie · 9 Apr 2017 at 15:04

oh great another nonsense reply

you're of course missing the important part:

To gain access to the flight information, PR Aviation had to agree to Ryanair's terms and conditions which prohibited the use of an automated system or software to extract data from the website for commercial purposes, unless Ryanair consented to the activity.

The CJEU ruled that the flight data on Ryanair's website did not qualify for either database rights or copyright protection, upholding previous findings of a Dutch court.
[...]
Copyright law alone cannot offer protection to database creators where the database contains facts, as only the expression of facts and not the facts themselves can be copyrighted.

the issue here is you needed to agree to the terms and conditions on the Ryan Air flight before using it, this doesn't apply in my case, as is clear from your own link the data itself isn't copyright protected, it is simply factual information but don't let actual details get in the way of you cherry picking a case that doesn't really apply here. You've actually managed to completely misunderstand (or perhaps didn't bother reading) your own link as it doesn't support your assertion re: copyrighted material at all - the case is related to a breach of contract.

so again, as I've already stated previously - I'm not doing anything illegal and this doesn't concern copyright protected content just data/facts, if you've got nothing useful to add re: the actual thread topic/query other than ill informed legal opinion then please don't bother replying

dowie · 9 Apr 2017 at 16:32

Dj_Jestar said:
Given you are seeking means to avoid being "detected" it's pretty damn clear they don't want you scraping. Thus, you don't have permission, thus you don't have any legal grounds. The only person telling with nonsense is your self, @dowie.

You seem to have a chip on your shoulder regarding this, yes what you've posted previously is nonsense as already explained - your cited case concerned breach of contract not copyright issues. Now that has been shot down you've come up with a handwaving argument: f because I don't want to be detected it is illegal. You've got nothing helpful to add here and your previous argument was flawed your current argument doesn't even rely on facts and you seem to be pursuing it now out of frustration.

dowie · 9 Apr 2017 at 16:35

peterwalkley said:
The copyright issue would be a lot easier to argue if you were to say what sites you are going to scrape, what data you want and what you are going to use it for.

What you have said so far is too vague to make a fair and impartial judgement.

Well I'm not too interested in debating the copyright issue much other than pointing out that it is factual data and really not likely to be an issue. For some reason another poster has taken it upon himself to offer nothing useful by try to present some flawed arguments.

I'll give an unrelated example - suppose I was an electronics retailer competing with OCUK and I wanted to grab the prices of their latest graphics cards to keep an eye on the competitions - am I breaching copyright if, instead of browsing through the OCUK website and writing down the prices manually I instead scrape the pages and extract the prices automatically?

dowie · 9 Apr 2017 at 16:56

Dj_Jestar said:
Flawed? Ha, k. How about your lack of argument at all?

And yes, you would be in breach of copyright if you scraped OcUK's prices without their permission.

How? Did you even read your own link - see my previous post, in particular the bold part. You're replying with nonsense as a result of your own stupidity.

dowie · 9 Apr 2017 at 17:29

Dj_Jestar said:
Have you had permission from OcUK to scrape their site? No? Breach of copyright. Iirc full copyright ("all rights reserved") is declared by OcUK (and is assumed anywhere it isn't expressly stated as other wise, as per earlier which your sieve of a brain has "forgotten" already).

For somebody to browse a site is reasonable use. Automated scraping is not.

Jerez you're going round in circles, read the previous post as explained before. Factual information lile that isn't protected by copyright, it is there in your own linked court case, I even highlighted it in bold for you. You've made several nonseense posts in my thread now offering nothing helpful other than your own flawed opinions on what is or isn't legal.

dowie · 9 Apr 2017 at 17:30

AHarvey said:
Can you have copyright over a price?

Nope of course you can't, the other poster is clueless.

dowie · 9 Apr 2017 at 18:10

Dj_Jestar said:
You do have copyright over the content presenting the price abuse of a service and/or content is the breach.

More hand waving... you're even wilfully ignoring the case you cited - just scroll up, it is highlighted in bold in my other post... Either present something factual or stop disrupting my thread with your nonsense.

dowie · 9 Apr 2017 at 19:30

Dj_Jestar said:
Factual.. like both links pointing out scraping is illegal without permission. Kk.

see the previous post and try applying some basic reading comprehension, you've chucked in a few rude comments yourself about goldfish brain, brain live a sieve etc.. yet ironically you've already had it posted out to you where that Ryan Air comparison falls down - that was unrelated to copyright as has already been pointed out to you

if you don't have any factual, sensible comments to add then please don't carry on posting in this thread as you're just creating unnecessary noise

dowie · 9 Apr 2017 at 19:34

andshrew said:
Your question is like asking how long is a piece of string.

How much traffic does the site process, does the owner have any reason to want to prevent people viewing thousands of pages per day? If the answer is millions and no, then as long as you're not flooding them with requests you're more than likely to go unnoticed. If they do care, or you suddenly account for 80% of their traffic, they're more than likely going to do something to prevent your access. You've already answered how you may get around this.

thanks for the reply, definitely a high traffic website I don't think scraping this data from their website will involve flooding them with requests or harming their service in anyway

Have you approached the company to ask if they can provide the data you want on a regular basis, rather than just resorting to brute forcing it off their web site?

Absolutely not. What I'd use it for has no impact on their business but also it has value if used in a certain way so isn't something I'd want known ergo why I'm not really able to give full details on here

dowie · 9 Apr 2017 at 20:47

Dj_Jestar said:
reading comprehension is something you lack. Not I. So why do you want to mask your scraping, if you don't think you have any reason to hide it? Do carry on pretending you think there is nothing wrong, by all means, but this proves you know you have no permission to do it.

The funny thing about scraping that you are woefully trying to dodge (though round of applause for not tripping yourself up): you aren't scraping just the "data/facts" (still a pathetically weak point). By fact of scraping a site you are scraping the entirety of the site's content. Markup etc. that are all proprietary. Copyright infractions ahoy.

All that still adjacent to the point that you need permission to use site like this anyway, if you want to be free of copyright infringement.

Of course you could just ask them, maybe even broker a deal for the data so you don't need to scrape at all. But hey, why do that when you can do it illegally for free?

from your own link:

The CJEU ruled that the flight data on Ryanair's website did not qualify for either database rights or copyright protection, upholding previous findings of a Dutch court.

Is that so hard to understand?

Now can you please stop posting off topic drivel in my thread - I'm asking about web scraping, I'm not asking about doing anything illegal nor breaking any copyright laws. If you've got something constructive to add re: web scraping then please do contribute - if you're going to carry on trying to flog a dead horse re: copyright infringement then please go and start your own thread - I'd also suggest that if you do so then pay a bit more attention to the link that you yourself posted as it doesn't back your position.