Mine or Not Mine?
Text and Data Mining (TDM) “is the process of deriving information from machine-read material” (Intellectual Property Office).
By analysing large quantities of text and data using automated techniques it is possible to spot new patterns and trends which would not be visible otherwise.
TDM techniques have been adopted across many disciplines, and locally we have been working with the Digital Humanities Institute and use these methods in many of their projects.
As the use of TDM is growing, it is widely accepted that the use of TDM in academic research (a key methodology in what is being called ‘Digital Scholarship’) is growing and we here in the Library have seen an increase in enquiries for support from our academic community.
We need to ensure that researchers can use our collections computationally as easily as they can using traditional close reading.
Legally this shouldn’t be a problem, as an exception to UK copyright law allows individuals to make copies of whole works to which they have ‘lawful access’ (such as via a library subscription) for this purpose.
In practice it is not always easy to get a full copy of the archives and collections we subscribe to; manual downloading is time-consuming and quickly leads to accounts being suspended due to concerns about fraud, and publishers vary hugely in their awareness of TDM and the costs they apply to provide copies of datasets.
The difficulty is amplified by the fact that publishers who are more engaged with TDM are also often seeking to own the infrastructure on which the mining happens, so they restrict other forms of access to encourage researchers onto platforms they control.
In response to these challenges, we have developed a set of criteria we will take into future negotiations in order to support TDM.
Before outlining these criteria it is worth briefly expanding on the challenges of supporting TDM, and particularly providing computational access to Library collections.
Whilst the library has held datasets for a long time this project has forced us to consider in more detail what our criteria should be for ‘collecting’ and hosting a dataset, and the processes we need to build around this.
TDM also provides specific challenges due to the scale of the datasets involved and the often complex licences which accompany them – this project has been a true cross-library collaboration requiring the expertise of a range of staff!
At the beginning of the project, we contacted various publishers to explore how we might gain computational access to the content that we already have a subscription for.
The most common methods are an API, a publisher platform or sending (yes, physically sending) a hard drive with the content loaded onto it.
The publishers we contacted were concentrated in the arts and humanities and provided access to primary sources, they were selected as this was the area where there is the clearest demand for Library support in TDM.
The responses we received varied considerably, from publishers with no process and offers of providing this for free, to others who would charge tens of thousands for hard drives and potentially more for access to their platform if more than a handful of researchers were using it.
For example, Adam Matthew offered a free API whereas ProQuest and Gale, who provide similar content, charge for all of the TDM options they provide.
To reiterate, this is for the content for which the university already pays a subscription fee. Whilst recognising that there is a cost to providing content in several formats, some of the fees we were quoted seemed unreasonable and an example of a supplier taking advantage of having an effective monopoly, often based on content digitised from publicly owned archives.
In addition to these financial and technical barriers, we also found that some contracts with suppliers are not clear on whether TDM is permissible, and if it is, how their content can be mined.
Whilst the exception enshrined in copyright law cannot be overridden by contractual agreement, there is a risk involved in ignoring contracts and our researchers should feel confident when mining datasets we provide.
We also have many suppliers who state that they permit TDM but do not provide a sensible access mechanism or any way of finding out how researchers can do this.
Given the number of resources the Library is subscribed to, individually chasing up with each supplier to find out how we can undertake an activity which is enshrined in law would be time-consuming and frustrating.
Given the technical, financial and contractual barriers which we encountered during this project, we are clear that effective content negotiations will be key to ensuring that we can support the Digital Scholarship activities of our researchers in the future.
Our position for future content negotiations
The position statement we have compiled for future negotiations with publishers is designed to ensure that researchers at the University of Sheffield get the best value from content the Library purchases, a particular concern given the budgetary pressures due to COVID-19.
We are also keen to avoid the situation of paying twice for the same content, whilst recognising that there are costs involved in providing computational access.
This is an opportunity for libraries, with our established relationships with publishers and role in supporting research across the university, to improve the resources available to researchers.
At a minimum, we will be making sure that contracts do not actively block TDM, which conflicts with the legal right to mine, and that there are no technical measures in place which would actively prevent TDM.
We are also looking for more realistic retention periods as some publishers we contacted stated that data derived from their content could not be retained.
As a Library, we support researchers to make their research open and reproducible, which is not possible if we sign contracts with content providers which don’t even allow researchers to comply with funders’ data retention requirements.
Research projects vary in length but to support research integrity and reproducibility we are working on the assumption that projects are 5 years and researchers are then required to retain their data for 10 years, so our researchers need to be able to hold derived data for a minimum of 15 years.
In terms of technical access mechanisms we have to recognise that researchers doing TDM come from different backgrounds and have different technical abilities so we would ideally like to be able to access datasets via a user interface and API.
This variation in the abilities of those undertaking TDM is a problem which was recognised by the suppliers we met, who have generally had to choose one audience to cater for when developing their TDM platforms (the technically capable or the beginner/domain expert).
We also recognise that for many suppliers supporting TDM means enhancing platforms or building entirely new ones and this comes with a cost.
However, the pricing of this needs to be proportionate and transparent, not an add-on at a later date where costs appear to be arbitrary and have little relation to what is actually being provided.
Finally, we want to be able to share the outputs derived from TDM. Researchers are expected by many journals and funders to share the data underlying their articles and monographs, and as we support researchers to share their data it seems hypocritical to supply data which does not permit this.
During this project, we have improved our online advice, worked with the DHI to better understand their needs, established our options for future service and purchased a dataset which we are working on making available to our researchers.
We are currently taking stock and considering what support we can feasibly provide in the future to our academic community in their Digital Scholarship endeavours.
However, we are clear that we will be looking carefully at TDM clauses in future content negotiations with publishers in order that we can mine that which we are allowed to mine.
If you have any questions or suggestions about how the library can support your TDM and Digital Scholarship ambitions please contact Gavin Boyce, Head of Faculty and Research Partnership at the University of Sheffield library.
With thanks to Angus Taggart, Ruth Mallalieu and Steve McIndoe for their work on the TDM project.
Rosie Higman – Research Data Manager
Peter Barr – Head of Content & Collections
Gavin Boyce – Head of Faculty & Research Partnership