Text and Data Mining

Text and data mining (TDM) is the process of extracting information from existing files, usually using computational methods. The files being mined can range from a single document, to a database, to entire social media platforms. There are several steps to most TDM activity, including data cleansing and indexing, but the first step is to gather the sources to be used. A temporary copy of the files is often made to enable the content to be processed, and this has implications for copyright.

Copyright and Text & Data Mining

An exception to copyright, introduced into UK law in 2014, allows you to make copies of whole works to which you have ‘lawful access’ (such as via a library subscription) for this purpose. This exception does not permit commercial use, nor does it allow the copies to be transferred to those who do not already have access to the original materials.

Open licences such as Creative Commons can pose a technical barrier to this kind of reuse for mining. The law allows some relaxation of attribution, but you will still need to make reasonable efforts to give credit to the creators of the work you use. Releasing data under a licence such as CC0 - no attribution required - can facilitate TDM activity.

Support for Text & Data Mining from the University Library

Many of the databases the Library subscribes to explicitly permit text & data mining. We have collated a list of publishers and resources which students and staff at The University of Sheffield can mine.

If the resource you would like to use is not on this list, or you would like assistance in mining one of our databases, please contact us at copyright@sheffield.ac.uk. We are particularly keen to hear from researchers who are at the planning stage of their projects to see how we can support you.

 Text and data mining is enabled by default due to the exception for research provided in Section 29A of the Copyright, Designs and Patents Act 1988. This document aims to ensure that we do not sign contracts which would undermine these rights and instead include technical measures that allow TUoS researchers to easily use computational techniques on the content.

Been blocked from Text & Data Mining?

We need researchers to tell us about times when they've been blocked from mining content due to licensing or technical barriers. European copyright law is changing in 2021. In order to maintain parity, and to secure the strongest possible rights for UK based researchers, we require examples to help us to advocate on your behalf. Fill out LIBER’s short survey here:

Complete the survey