Information Extraction and Entity Linkage in Historical Crime Records
Applications are invited for the above EPSRC project studentship commencing on 1 October 2018. This project will develop and refine information extraction techniques by working with one of the most intractable, largely unstructured, sources in the humanities, historical newspapers. Addressing a challenge identified during the recently completed project, the Digital Panopticon: Tracing London Convicts in Britain & Australia, 1780-1925, this project will develop methods of extracting information about crimes and police court trials from English newspapers for linkage to the existing 'life archives' of convicts in the Digital Panopticon.
This is an interdisciplinary project which involves both humanities and computer science perspectives on data analysis. Applicants from a humanities or engineering background are welcome. They should either have a background in computer science together with an interest in the humanities, or background in a humanities discipline together with numeracy skills and an aptitude for programming.
The student will become part of the flourishing research communities of both the History Department and the Department of Computer Science, Natural Language Processing Research Group. In addition, they will join a growing body of researchers at our Digital Humanities Institute, one of the UK’s leading Digital Humanities centres.
Application deadline: 5pm, Wednesday 6 June 2018
This project addresses a major technical challenge which arose during work on the recently completed Digital Panopticon project, concerning the extraction of relevant information from unstructured, poorly OCR'd (Optical Character Recognition) texts for linking to structured text. This is a problem scholars frequently encounter in the Digital Humanities: the messiness and complexity of large bodies of historical text prevents the automated extraction of relevant data.
The Digital Panopticon is an AHRC funded Digital Transformations project which linked together information about 90,000 criminals who were convicted at the Old Bailey, London's Central Criminal Court, between 1780 and 1868. Using advanced record-linkage techniques, the project linked together records from fifty datasets of criminal justice and civil records to enable 'life archives' to be compiled, mapping the life story of a convict from cradle to grave.
It was intended that one of the major datasets to be included was nineteenth-century newspapers, which contain tens of thousands of reports of offences, trials, and punishments for crimes. These reports, particularly of meetings of the police courts, document a much larger and wider range of criminal activity than that which reached the Old Bailey, and form a potentially vital part of many criminal life stories. However, the newspaper texts are voluminous, largely unstructured and frequently contain OCR errors, making information extraction a real challenge. Since the DP lacked the technical expertise to address these challenges, this element was dropped.
Concurrently with the Digital Panopticon project, researchers in the Natural Language Processing Group in Computer Science at Sheffield have been making advances in information extraction (IE) techniques. IE, sometimes referred to as text mining, aims to develop algorithms to identify mentions in unstructured text of a specified set of entity types (e.g. persons, locations, dates), their attributes (e.g. age, gender) and specified relations between entities of these types (e.g. person lives in location, person was born on date). The aim is to extract information about particular types of entities in order to create and maintain a structured knowledge base from unstructured text sources. Some of the challenges concern ambiguous names (e.g. John Smith), such as often occur in newspaper records.
The proposed project will refine and develop existing techniques of information extraction in order to identify relevant information from crime reports and the police courts in the newspapers and link the names in these records to their life archives in DP, where these exist. The main challenges of the project are to 1) identify relevant information in the digitised newspapers; 2) extract as much relevant personal information (such as age, dwelling place, and occupation) about named individuals as possible from those texts; and 3) automatically link such individuals to relevant life archives in DP. This is cumulatively a tremendously exciting challenge: the project offers an opportunity for genuine cross-disciplinary work between historians and computer scientists that will be a world-first in terms of applying state-of-the-art IE techniques to historical data of real interest to historians.
Methods and outcomes:
The project will proceed in four stages:
First, the student will conduct a user survey of scholars who are currently conducting research in the DP and newspapers, to identify their research questions and search strategies, and the information retrieval techniques they currently use. A diverse group of professional historians, post-docs, and PGR students in history, English and criminology will be interviewed.
Second, using these and other applicable methods, a sample of newspaper reports will be read and relevant content manually identified, in order to provide a foundation for machine learning of the linguistic patterns of the information that needs to be extracted. These machine learning techniques will be refined through an iterative process of testing and development.
Third, algorithms will be developed to allow the information identified to be linked to the Life Archives in the DP. Again, these methods will be refined in an iterative process. Of particular interest here are questions of how to merge and reconcile redundant and potentially conflicting information from different sources and how deal with information that has been extracted with differing levels of confidence.
Finally, the results of this process will be evaluated by re-interviewing the user group, in order to assess the accuracy and relevance of the evidence generated. A wider sample of scholars in the humanities and social sciences will also be interviewed to consider the potential for the methods developed to be applied to other research questions involving the use of large bodies of messy, semi-structured texts.
Project outcomes will be:
- Greater depth and comprehensiveness of content in the DP which will enhance scholarship in history and criminology, as well as its public impact
- Advances in information extraction and entity linkage techniques which can be applied to a much wider range of datasets in other subject domains.
The studentship will commence on 1 October 2018. The studentship will cover tuition fees at the EU/UK rate and provide an annual maintenance stipend at standard Research Council rates (£14,777 in 2018/19) for 3.5 years.
The general eligibility requirements are:
- Applicants should normally have studied in a relevant field to a very good standard at MA level or equivalent experience.
- Applicants should also have a 2.1 in a BA degree, or equivalent qualification, in a related discipline.
- ESRPC studentships are only available to students from the UK or European Union. Applications cannot be accepted from students liable to pay fees at the Overseas rate. Normally UK students will be eligible for a full award which pays fees and a maintenance grant if they meet the residency criteria and EU students will be eligible for a fees-only award, unless they have been resident in the UK for 3 years immediately prior to taking up the award.
How to apply
To apply for the studentship, applicants need to apply directly to the University of Sheffield for entrance into the doctoral programme:
- Complete an application for admission to the standard PhD programme here.
- Supporting documents can be uploaded to your application.
Any academic enquiries for the Sheffield studentship can be directed to Professor Bob Shoemaker (firstname.lastname@example.org) or Professor Rob Gaizauskas (email@example.com). Any questions about the application process should be directed to Beky Hasnip (firstname.lastname@example.org).