web scraping - general advice, tips etc..

Scotty123 · 9 Apr 2017 at 23:21

What are you using to scrape? I used Python last year to scrape off hotel rooms and prices for a site to book them. The original website was very javascript/ajax heavy, but I managed to get around it with Selenium to wait for certain things to appear and click certain buttons. Was a fun thing to do.

peterwalkley · 10 Apr 2017 at 08:50

Copyright/legal/moral/ethical issues aside:

- Run the scrape during a period when traffic on the site would be quiet e.g 2am if its a UK site.
- Scrape slow. Pulling a page every two seconds is 30 a minute, 1800 in an hour. For the volume you indicated, its 2-3 hours to get the lot.
- Scrape smart. How often are the pages likely to change. Do you need a daily run ? Will they also change at the weekend. Will they change during public holidays.

There is no 'best' language or tool, its whatever you feel comfortable working with. You can google code for scrapers in every language going, its a classic tutorial programming assignment.

Dj_Jestar · 10 Apr 2017 at 14:29

dowie said:
please stop posting off topic drivel in my thread - I'm asking about web scraping

Web scraping without permission is illegal. Hardly off topic, and certainly not drivel.

I was waiting for you to repeat that line about the CJEU's ruling, because on the very next paragraph you've failed to read:

As a result, PR Aviation could not rely on those provisions to defeat the contractual breach claims brought by Ryanair, it said.

dowie · 10 Apr 2017 at 15:01

Dj_Jestar said:
Web scraping without permission is illegal. Hardly off topic, and certainly not drivel.

I was waiting for you to repeat that line about the CJEU's ruling, because on the very next paragraph you've failed to read:

The reading failure isn't mine, you're going round in circles - you've made an assertion you can't back up re: illegality, you've made claims re: copyright that your case shows to be untrue - the case related to breach of contract and was won by Ryan air as to use their website the company concerned had to agree to terms and conditions which they then broke, I've already replied to this - please note the last part of my reply. If you still don't understand then I'd rather you started another thread instead of carrying on repeating the same nonsense in here.

dowie said:
the issue here is you needed to agree to the terms and conditions on the Ryan Air flight before using it, this doesn't apply in my case, as is clear from your own link the data itself isn't copyright protected, it is simply factual information but don't let actual details get in the way of you cherry picking a case that doesn't really apply here. You've actually managed to completely misunderstand (or perhaps didn't bother reading) your own link as it doesn't support your assertion re: copyrighted material at all - the case is related to a breach of contract.

so again, as I've already stated previously - I'm not doing anything illegal and this doesn't concern copyright protected content just data/facts, if you've got nothing useful to add re: the actual thread topic/query other than ill informed legal opinion then please don't bother replying

dowie · 10 Apr 2017 at 15:06

Scotty123 said:
What are you using to scrape? I used Python last year to scrape off hotel rooms and prices for a site to book them. The original website was very javascript/ajax heavy, but I managed to get around it with Selenium to wait for certain things to appear and click certain buttons. Was a fun thing to do.

thanks for the on topic reply

I was looking at making use of BeautifulSoup with python - any libraries you'd recommend?

dowie · 10 Apr 2017 at 15:13

peterwalkley said:
Copyright/legal/moral/ethical issues aside:

- Run the scrape during a period when traffic on the site would be quiet e.g 2am if its a UK site.
- Scrape slow. Pulling a page every two seconds is 30 a minute, 1800 in an hour. For the volume you indicated, its 2-3 hours to get the lot.
- Scrape smart. How often are the pages likely to change. Do you need a daily run ? Will they also change at the weekend. Will they change during public holidays.

There is no 'best' language or tool, its whatever you feel comfortable working with. You can google code for scrapers in every language going, its a classic tutorial programming assignment.

thanks, there do seem to be quite a few libraries out there - it might be the case that scraping even 2 - 3 days is workable instead. I'm unsure about weekends/public holidays - possibly not updated then but I'll need to test that.

The site is likely quite busy and able to handle a lot of traffic so re: your 2am suggestion, as I'm unlikely to be causing much issue for them re: traffic, would that change things re: your advice there - would scraping at a time with less traffic mean I'm more likely to be detected in logs etc..?

beh · 10 Apr 2017 at 16:09

Is the site behind a CDN or some form of protection? If so you might be stopped by a captcha.

Why not just ask permission?

dowie · 10 Apr 2017 at 16:38

doesn't seem to be

I don't really want people to know what this is for ergo I'd like to do it discretely

peterwalkley · 10 Apr 2017 at 16:41

dowie said:
The site is likely quite busy and able to handle a lot of traffic so re: your 2am suggestion, as I'm unlikely to be causing much issue for them re: traffic, would that change things re: your advice there - would scraping at a time with less traffic mean I'm more likely to be detected in logs etc..?

The nail that sticks up is the one that gets hit. Do anything you can to minimise being that nail.

dowie · 10 Apr 2017 at 16:48

that's what I was getting at - so perhaps not doing it at 2am and instead at busy times when there is plenty of other traffic - daytime, early evening etc..

beh · 10 Apr 2017 at 16:59

dowie said:
I don't really want people to know what this is for ergo I'd like to do it discretely

Why not just a quick email asking if they mind you scraping some information from their site? You don't necessarily have to say up front what you'd be using it for. If you're going to be doing it in the long term it seems likely they'd notice eventually. All this talk of being discrete or that they might not be happy suggests you perhaps shouldn't be doing it otherwise.

What do the T&Cs say?

dowie · 10 Apr 2017 at 17:05

I think in general most websites aren't necessarily open to scraping(at least I'd assume so?) - though like I said before I'd want to be discrete about this, I can't really expand much more regarding that other than to say approaching them isn't going to be useful in that respect - so even if they'd be welcoming it doesn't really make much difference in this instance

nothing in the T&Cs about web scraping so the issue previously highlighted re: the Ryan Air case doesn't apply

Scotty123 · 10 Apr 2017 at 20:58

dowie said:
thanks for the on topic reply

I was looking at making use of BeautifulSoup with python - any libraries you'd recommend?

Possibly if the site is not ajax heavy or requires automated clicking. My web scraper had to enter lots of stuff into fields, click buttons, navigate the site etc to get data, so Selenium was best for that as it has a lot of nice features.

dowie · 11 Apr 2017 at 06:12

Ah fortunately this site is a fair bit simpler than the sort of hotel reservation system you've worked on, I am however completely new to this as most of my programming experience is stats/machine learning stuff.

NutterzUK · 22 Apr 2017 at 21:29

Is there any common structure to the data on the pages you are wanting to scrape, and where to you intend to deposit (and in what format) the data once you have it?

E.g:
- "I want to scrape all headings", or "I want to scrape a paragraph below a particular heading". These are easy if they are consistent. If the question is more.. "I want to run some kind of logic to understand which part of the page to scrape, then I need to format it based on some rules", then it's more difficult.

You want to run it once per day - do you have your own server, or are you leaving a cron job (or similar) running locally?
For the amount it would cost (peanuts), i'd recommend running a aws lambda function daily. This can be written in Node.js, Java, C# or Python. If it was me, as I work mainly with Java, i'd do it in Java... but all of those languages can easily request web pages and parse them. Using Java, i'd recommend JSoup. I don't know about the others. From there it's easy to store in S3, or in a database.

Regarding the legality, i'd be very careful. I suspected it may not be legal (I'm no legal expert), but some quick googling comes up with things like:

"Most courts are in agreement that web scraping is unlawful, but that does not mean that all web scrapers are identified and punished."

"In 2001, the legality question was brought up again when a travel agency sued another company for scraping information. The rival company made use of the pricing information taken, without consent and undercut the competition – and resulted in less customers and income for that agency. This brought to attention the importance of authorized and unauthorized access of information on websites and how to ensure that no unauthorized users could scrape information."
https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-illegal/

So yes, I would imagine it is illegal to scrape OcUKs prices and then use them to undercut them - but i'm not an expert in the matter. I just know it seems to be a contested subject.

DarrenM343 · 28 Apr 2017 at 22:03

Personally think some of you guys are taking this too seriously without knowing what the info/data is. Sounds to me like OP wants something like stock prices. Some sites actually allow you to download such information as csv's. If OP is using the information for himself only I don't see a problem with it, they're not going to be passing it off as their own work, or giving it to others. And it sound like the data is freely available anyway, ie, not like music where you have to pay for it, not a subscription service.

Of course the company might not like the artifical load being placed on their servers, but that's a different matter.

It's actually pretty easy to scrape info.
worth noting that there are engines out there that will search the net for text to help with decision, such as the buying of stocks. The engines will find news articles about companies. I have no idea if they help however but it's in use and is automation of what a person could do., just much quicker. I suppose these engines parse data rather than scrape it though.

I'm not saying it's right though but really depends what it is they're intending to scrape.

dowie · 6 May 2022 at 13:08

Just to update this US courts have affirmed recently that web scraping is legal.

Web scraping is legal, US appeals court reaffirms | TechCrunch

The landmark web scraping case was bounced back to the Ninth Circuit by the U.S. Supreme Court.

techcrunch.com

Good news for archivists, academics, researchers and journalists: Scraping publicly accessible data is legal, according to a U.S. appeals court ruling.

The landmark ruling by the U.S. Ninth Circuit of Appeals is the latest in a long-running legal battle brought by LinkedIn aimed at stopping a rival company from web scraping personal information from users’ public profiles. The case reached the U.S. Supreme Court last year but was sent back to the Ninth Circuit for the original appeals court to re-review the case.

In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law.

This wasn't (directly) about stock prices (that data is easily available from a range of providers/doesn't require scraping) but rather was data tangentially related to stock prices, incidentally over the past 5 years the use of "alternative data" has become a bit more widespread.

Blackjack Davy · 7 May 2022 at 20:11

It might be legal (and thats a US ruling) but that doesn't mean the site owner has to like it - I've known people get IP banned for scraping sites without permission, always ask first.