TL;DR: In the UK, it is mostly legal to scrape data for non-commercial research. Great for doing research from home.
Much research data nowadays is sourced directly from the Web, either from traditional websites or from social media platforms. Economists, sociologists, and geographers often rely on web scraping to collect large datasets about the behaviour of many human systems. This includes, for example, getting flight prices from Expedia to model transport market dynamics, collecting Facebook messages to analyse hate speech, and scraping Airbnb listings to study the housing crisis in London. I have done some web scraping for my research, and I always assumed that this approach was technically illegal, as it usually infringes the Terms of Service of data owners.
It had escaped my attention that, in 2014, the UK introduced clear legislation to define exceptions to copyright for non-profit research. In summary, scraping copyright-protected material from the web is allowed if one has access to it and if the research is non-commercial. The text from the British Intellectual Property Office (Exceptions to Copyright section) unambiguously states:
Non-commercial research and private study
You are allowed to copy limited extracts of works when the use is non-commercial research or private study, but you must be genuinely studying (like you would if you were taking a college course). Such use is only permitted when it is ‘fair dealing’ and copying the whole work would not generally be considered fair dealing.
The purpose of this exception is to allow students and researchers to make limited copies of all types of copyright works for non-commercial research or private study. In assessing whether your use of the work is permitted or not you must assess if there is any financial impact on the copyright owner because of your use. Where the impact is not significant, the use may be acceptable.
If your use is for non-commercial research you must ensure that the work you reproduce is supported by a sufficient acknowledgement.
Text and data mining for non-commercial research
Text and data mining is the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information. Text and data mining usually requires copying of the work to be analysed.
An exception to copyright exists which allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work (that is, they have ‘lawful access’ to the work). This exception only permits the making of copies for the purpose of text and data mining for non-commercial research. Researchers will still have to buy subscriptions to access material; this could be from many sources including academic publishers.
Publishers and content providers will be able to apply reasonable measures to maintain their network security or stability but these measures should not prevent or unreasonably restrict researcher’s ability to text and data mine. Contract terms that stop researchers making copies to carry out text and data mining will be unenforceable.
Obviously, one should bear in mind that some scraping practices can still be unethical while being legal, and extreme caution is always advisable, particularly when scraping personal data (see this in-depth discussion on the subject).
It may be good practice to use a note to research articles that use scraped data (found in Scarborough et al 2020):
The use of data collected by the researchers for this study is covered by the Intellectual Property Office’s Exceptions to Copyright for Non-Commercial Research and Private Study (https://www.gov.uk/guidance/exceptions-to-copyright).
As most platforms currently do not allow access to their data through APIs, this is fantastic news to all web researchers like myself, and for anybody who was planning to do fieldwork and now, sadly, has to rely more on secondary data that can be safely collected from home.