web scraping - general advice, tips etc..

Web scraping without permission is illegal. Hardly off topic, and certainly not drivel.

I was waiting for you to repeat that line about the CJEU's ruling, because on the very next paragraph you've failed to read:

The reading failure isn't mine, you're going round in circles - you've made an assertion you can't back up re: illegality, you've made claims re: copyright that your case shows to be untrue - the case related to breach of contract and was won by Ryan air as to use their website the company concerned had to agree to terms and conditions which they then broke, I've already replied to this - please note the last part of my reply. If you still don't understand then I'd rather you started another thread instead of carrying on repeating the same nonsense in here.

the issue here is you needed to agree to the terms and conditions on the Ryan Air flight before using it, this doesn't apply in my case, as is clear from your own link the data itself isn't copyright protected, it is simply factual information but don't let actual details get in the way of you cherry picking a case that doesn't really apply here. You've actually managed to completely misunderstand (or perhaps didn't bother reading) your own link as it doesn't support your assertion re: copyrighted material at all - the case is related to a breach of contract.

so again, as I've already stated previously - I'm not doing anything illegal and this doesn't concern copyright protected content just data/facts, if you've got nothing useful to add re: the actual thread topic/query other than ill informed legal opinion then please don't bother replying
 
Last edited:
What are you using to scrape? I used Python last year to scrape off hotel rooms and prices for a site to book them. The original website was very javascript/ajax heavy, but I managed to get around it with Selenium to wait for certain things to appear and click certain buttons. Was a fun thing to do.

thanks for the on topic reply

I was looking at making use of BeautifulSoup with python - any libraries you'd recommend?
 
Copyright/legal/moral/ethical issues aside:

- Run the scrape during a period when traffic on the site would be quiet e.g 2am if its a UK site.
- Scrape slow. Pulling a page every two seconds is 30 a minute, 1800 in an hour. For the volume you indicated, its 2-3 hours to get the lot.
- Scrape smart. How often are the pages likely to change. Do you need a daily run ? Will they also change at the weekend. Will they change during public holidays.

There is no 'best' language or tool, its whatever you feel comfortable working with. You can google code for scrapers in every language going, its a classic tutorial programming assignment.

thanks, there do seem to be quite a few libraries out there - it might be the case that scraping even 2 - 3 days is workable instead. I'm unsure about weekends/public holidays - possibly not updated then but I'll need to test that.

The site is likely quite busy and able to handle a lot of traffic so re: your 2am suggestion, as I'm unlikely to be causing much issue for them re: traffic, would that change things re: your advice there - would scraping at a time with less traffic mean I'm more likely to be detected in logs etc..?
 
that's what I was getting at - so perhaps not doing it at 2am and instead at busy times when there is plenty of other traffic - daytime, early evening etc..
 
I think in general most websites aren't necessarily open to scraping(at least I'd assume so?) - though like I said before I'd want to be discrete about this, I can't really expand much more regarding that other than to say approaching them isn't going to be useful in that respect - so even if they'd be welcoming it doesn't really make much difference in this instance

nothing in the T&Cs about web scraping so the issue previously highlighted re: the Ryan Air case doesn't apply
 
Ah fortunately this site is a fair bit simpler than the sort of hotel reservation system you've worked on, I am however completely new to this as most of my programming experience is stats/machine learning stuff.
 
Just to update this US courts have affirmed recently that web scraping is legal.

Good news for archivists, academics, researchers and journalists: Scraping publicly accessible data is legal, according to a U.S. appeals court ruling.

The landmark ruling by the U.S. Ninth Circuit of Appeals is the latest in a long-running legal battle brought by LinkedIn aimed at stopping a rival company from web scraping personal information from users’ public profiles. The case reached the U.S. Supreme Court last year but was sent back to the Ninth Circuit for the original appeals court to re-review the case.

In its second ruling on Monday, the Ninth Circuit reaffirmed its original decision and found that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, or CFAA, which governs what constitutes computer hacking under U.S. law.

This wasn't (directly) about stock prices (that data is easily available from a range of providers/doesn't require scraping) but rather was data tangentially related to stock prices, incidentally over the past 5 years the use of "alternative data" has become a bit more widespread.
 
Back
Top Bottom