Natural Language Processing
The Natural Language Processing Research Group , established in 1993 , is one of the largest and most successful language processing groups in the UK and has a strong global reputation.
Natural Language Processing (NLP) is an interdisciplinary field that uses computational methods:
Follow us on Twitter @SheffieldNLP
The group's research interests fall into the broad areas of:
Information Access: Building applications to improve access to information in massive text collections, such as the web, newswires and the scientific literature. Subtopics include: information extraction, text mining and semantic annotation, question answering, summarization.
Language Resources and Architectures for NLP: Providing resources - both data and processing resources - for research and development in NLP. Includes platforms for developing and deploying real world language processing applications, most notably GATE, the General Architecture for Text Engineering.
Machine Translation: Building applications to translate automatically between human languages, allowing access to the vast amount of information written in foreign languages and easier communication between speakers of different languages.
Human-Computer Dialogue Systems: Building systems to allow spoken language interaction with computers or embodied conversational agents, with applications in areas such as keyboard-free access to information, games and entertainment, articifial companions.
Detection of Reuse and Anomaly: Investigating techniques for determining when texts or portions of texts have been reused or where portions of text do not fit with surrounding text. These techniques have applications in areas such as plagiarism and authorship detection and in discovery of hidden content.
Foundational Topics: Developing applications with human-like capabilities for processing language requires progress in foundational topics in language processing. Areas of interest include: word sense disambiguation, semantics of time and events.
The NLP group's research has received support from: the EU's Framework Programmes (Frameworks 4, 5, 6 and 7) as well as Horizon 2020 and the European Research Council, the UK Research Councils (EPSRC, BBSRC, MRC and AHRC) and various governmental and industrial sponsors, including GlaxoSmithKline and IBM.
These are currently the members of NLP group. Click on a name to see a home page.
2018 - 2019
4 July 2019 - Daniel Beck (University of Melbourne) - Natural Language Generation in the Wild
Traditional research in NLG focuses on building better models and assessing their performance using clean, preprocessed and curated datasets, as well as standard automatic evaluation metrics. From a scientific point-of-view, this provides a controlled environment where different models can be compared and robust conclusions can be made.
Bio: Daniel is a Lecturer at The University of Melbourne. His main research topic is Natural Language Generation, with a focus on Machine Translation. He is particularly interested in using tools from Machine Learning, Theoretical Computer Science and Statistics to address challenges in NLG that go beyond the usual input-output pipeline. He obtained a PhD from The University of Sheffield, United Kingdom, and his thesis on using Gaussian Processes for NLP applications received a Best Thesis Award from the European Association for Machine Translation. Daniel is also an advocate for queer and LGBT+ visibility in STEM, in particular within NLP and Machine Learning. He is currently a board member of the Widening NLP initiative (www.winlp.org), which foster inclusivity from underrepresented groups in NLP. His personal webpage can be found at https://beckdaniel.wordpress.com and he tweets at https://twitter.com/beck_daniel
23 May 2019 - Karin Verspoor (University of Melbourne) - Natural Language Processing (NLP) for structuring complex biomedical texts: progress and remaining challenges
The NLP community has been focused on methods for identifying and extracting key concepts and relations from highly specialised and terminology-rich texts; these texts have posed a challenge to general NLP tools as well as providing an opportunity to explore the robustness of relation extraction methods to domain-specific applications. In this talk I will present our recent studies with graph kernels and neural methods for relation extraction from the biomedical literature, present empirical work on core supporting tasks such as syntactic analysis of these texts, and discuss open challenges for work in this direction and beyond.
Bio:Karin Verspoor is a Professor in the School of Computing and Information Systems and Deputy Director of the Health and Biomedical Informatics Centre at the University of Melbourne. Trained as a computational linguist, Karin’s research primarily focuses on extracting information from clinical texts and the biomedical literature using machine learning methods to enable biological discovery and clinical decision support. Karin held previous posts as the Scientific Director of Health and Life Sciences at NICTA Victoria Research Laboratory, at the University of Colorado School of Medicine, and Los Alamos National Laboratory. She also spent 5 years in start-ups during the US Tech bubble, where she helped design an early artificial intelligence system.
11 April 2019 - Ryan Cotterell (University of Cambridge) - Probabilistic Typology: Deep Generative Models of Vowel Inventories
Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while most—but not all—languages have an [u] sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology: What makes a natural vowel inventory? We introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages.
Bio: Ryan is a lecturer (≈assistant professor) of computer science at the University of Cambridge. He specializes in natural language processing, computational linguistics and machine learning, focusing on deep learning and statistical approaches to phonology, morphology, linguistic typology and low-resource languages. He will receive his Ph.D. in Spring 2019 from the computer science department of the Johns Hopkins University, where he was affiliated with the Center for Language and Speech Processing; he was co-advised there by Jason Eisner and David Yarowsky. He has received best paper awards at ACL 2017 and EACL 2017 and two honorable mentions for best paper at EMNLP 2015 and NAACL 2016. Previously, he was a visiting Ph.D. student at the Center for Information and Language Processing at LMU Munich supported by a Fulbright Fellowship and a DAAD Research Grant under the supervision of Hinrich Schütze. His PhD was supported by an NDSEG graduate fellowship, the Fredrick Jelinek Fellowship, and a Facebook Fellowship.
4 April 2019 - Walid Magdy (University of Edinburgh) - Online Users' Behaviour Understanding and Prediction with Data Science
Large concern by public has emerged recently about social media data can reveal about users. In this talk, some examples are presented of how “public” social media data could be explored with data science to predict users’ behaviour and societies trends, including public interest, individual preferences, and personal information. Example studies on the US election, hate-speech, opinion change, and fake accounts are covered in this talk.
Bio: Walid Magdy is an assistant professor at the school of Informatics, the University of Edinburgh (UoE) and a faculty fellow at the Alan Turing Institute. His main research interests include computational social science, information retrieval, and data mining. He holds his PhD from the School of Computing at Dublin City University (DCU), Ireland. He has an extensive industrial background from working earlier for IBM, Microsoft, and QCRI. Walid has over 60 peer-reviewed published articles in top tier conferences and journals. He also has a set of 9 patents filed under his name. Some of his work was featured in popular press, such as CNN, BBC, Washington Post, National Geographic, and MIT Tech reviews.
28 March 2019 - Arpit Mittal (Amazon Research Cambridge) - Learning when not to answer
I will talk about our recent work where we investigate the challenges of using reinforcement learning agents for question-answering over knowledge graphs for real-world applications. We examine the performance metrics used by state-of-the-art systems and determine that they are inadequate for such settings. More specifically, they do not evaluate the systems correctly for situations when there is no answer available and thus agents optimized for these metrics are poor at modelling confidence. We introduce a simple new performance metric for evaluating question-answering agents that is more representative of practical usage conditions, and optimize for this metric by extending the binary reward structure used in prior work to a ternary reward structure which also rewards an agent for not answering a question rather than giving an incorrect answer. We show that this can drastically improve the precision of answered questions while only not answering a limited number of previously correctly answered questions. Employing a supervised learning strategy using depth-first-search paths to bootstrap the reinforcement learning algorithm further improves performance.
Bio: Dr Arpit Mittal is a Senior Machine Learning Scientist at Amazon Research Cambridge. He is currently working on projects involving knowledge extraction, information retrieval and question answering. Before joining Amazon, Arpit worked on augmented reality (AR) and made fundamental contributions to an industrial AR SDK: Vuforia. He received his PhD from the University of Oxford in Computer Vision and Machine Learning. Within Amazon, Arpit manages the research internship program for their Cambridge UK office.
21 March 2019 - Vlad Niculae (Instituto de Telecomunicações, Lisbon, Portugal) - Learning with Sparse Latent Structure
Structured representations are a powerful tool in machine learning, and in particular in natural language processing: The discrete, compositional nature of words and sentences leads to natural combinatorial representations such as trees, sequences, segments, or alignments, among others. At the same time, deep, hierarchical neural networks with latent representations are increasingly widely and successfully applied to language tasks. Deep networks conventionally perform smooth, soft computations resulting in dense hidden representations.
We study deep models with structured and sparse latent representations, without sacrificing differentiability. This allows for fully deterministic models which can be trained with familiar end-to-end gradient-based methods. We demonstrate sparse and structured attention mechanisms, as well as latent computation graph structure learning, with successful empirical results on large scale problems including sentiment analysis, natural language inference, and neural machine translation.
Joint work with Claire Cardie, Mathieu Blondel, and André Martins.
Bio: Vlad is a postdoc in the DeepSpin project at the Instituto de Telecomunicações in Lisbon, Portugal. His research aims to bring structure and sparsity to neural network hidden layers and latent variables, using ideas from convex optimization, and motivations from natural language processing. He earned a PhD in Computer Science from Cornell University in 2018, advised by Claire Cardie. He is co-organizing the NAACL 2019 Workshop on Structured Prediction for NLP (http://structuredprediction.github.io/SPNLP19), and the ACL 2019 Tutorial on Latent Structure Models for NLP.
12 March 2019 - Alfredo Kalaitzis (Element AI) - Enabling human rights experts through data-science and machine learning
I will present our lab's joint work Amnesty International, leveraging crowd-sourcing to study online abuse against women on Twitter. This is the first hand-in-hand collaboration between human rights activists and machine learners. On a technical front, we carefully curate an unbiased yet low-variance dataset of labeled tweets, analyze it to account for the variability of abuse perception, and establish baselines, preparing it for release to community research efforts. On a social impact front, this study provides the technical backbone for a media campaign aimed at raising public and deciders’ awareness and elevating the standards expected from social media companies.
Bio: Alfredo is a Research Engineer in the AI for Good lab in London, working on applications that enable NGOs.
I argue that both human and machine actions are more opaque than is generally realized, will require explanation that an ethical orthosis might provide in both cases, as aspects of artificial Companions for both human and machine actors.
24 January 2019 - Loïc Barrault (LIUM, University of Le Mans) - Some recent work on neural machine translation
Neural Machine Translation systems are more and more effective. However, they are still far from reaching the human level.
29 November 2018 - Adam Tsakalidis (University of Warwick) - Nowcasting User Behaviour with Social Media and Smart Devices
The adoption of social media and smart devices by millions of users worldwide over the last decade has resulted in an unprecedented opportunity for natural language processing and social sciences. Users publish their thoughts and opinions on everyday issues through social media platforms, while they record their digital traces through their smart devices. Mining these rich resources offers new opportunities in sensing real-world events and indices in a longitudinal fashion. This talk will focus on how to utilise such user-generated content in order to "nowcast" (i.e., predict the current state of) user-specific (a) political and (b) mental health indices, under a real-world and longitudinal setting. The talk will be divided into two parts. In the first part, we will focus on mining social media to infer user voting intention. We model social media users based on the content they share and their network structure over time, aiming to nowcast their political stance under a time constrained setting (i.e., Greek bailout referendum 2015). In the second part, we will also account for heterogeneous information sources about the user (e.g., information derived from users' smart phones, SMS and social media messages), aiming this time to nowcast time-varying and user-specific mental health indices on a longitudinal basis. We will emphasise the importance of sticking to a real-world evaluation setting and present the challenges that current state-of-the-art face, when tested under such an evaluation framework. Finally, we will outline open challenges in both domains and provide directions for future research.
Bio: Adam Tsakalidis is a final stage PhD candidate at the University of Warwick (Supervisors: A. I. Cristea and M. Liakata) and is currently working as a Research Associate at The Alan Turing Institute. He holds a PG Diploma in Computer and Communications Engineering (University of Thessaly, Greece) and a MSc in Computer Science and Applications (University of Warwick). Before his PhD, he had worked as a Research Assistant in the SocialSensor project (CERTH/ITI, Greece). His research interests lie in the area of natural language processing, with a particular focus on the longitudinal modelling of user-generated information as a step towards real-time monitoring of real-world indices.
7 November 2018 - Yanai Elazar (Bar-Ilan University) - Adversarial Removal of Demographic Attributes from Text Data
Recent advances in Representation Learning and Adversarial Training seem to succeed in removing unwanted features from the learned representation. We show that demographic information of authors is encoded in -- and can be recovered from -- the intermediate representations learned by text-based neural classifiers. The implication is that decisions of classifiers trained on textual data are not agnostic to -- and likely condition on -- demographic attributes. When attempting to remove such demographic information using adversarial training, we find that while the adversarial component achieves chance-level development-set accuracy during training, a post-hoc classifier, trained on the encoded sentences from the first part, still manages to reach substantially higher classification accuracies on the same data. This behavior is consistent across several tasks, demographic properties and datasets. We explore several techniques to improve the effectiveness of the adversarial component. Our main conclusion is a cautionary one: do not rely on the adversarial training to achieve invariant representation to sensitive features.
18 October 2018 - Shadrock Roberts (Ushahidi) - Natural Language Processing for Humanitarian Response: a view from the field
Drawing on real-life case studies from Nepal, Indonesia, and Kenya, I will provide an overview of how crowdsourced and social media data are used or ignored in humanitarian response and the challenges they pose for practitioners. Designed in order to respond to these challenges, I will present early stage software prototypes using the GATE open source NLP toolkit to identify context, actionability, and veracity in social media and crowdsourced data in order to speed and prioritize the delivery of humanitarian aid. Speaking as a practitioner, I will also propose avenues for impactful research and design to help increase the adoption of new tools and methods.
Bio: Shadrock Roberts is a humanitarian geographer and the Director of resilience and research programs at the Kenyan non-profit, Ushahidi, which builds open source software to crowdsource information for humanitarian response. He has worked for a variety of humanitarian and development organizations in multiple countries and holds a Ph.D. in Geography from the University in Georgia. His career has focused on the intersection of geographic information systems, information and communication technologies, and community engagement to improve the availability of data for humanitarian and development assistance. He has only recently learned what a “chip butty” is, and remains unclear on the concept.
2017 - 2018
7 June 2018 - Peter Cochrane (University of Suffolk) - Self Awareness: The Next BIG Breakthrough in NLP
For >50 years the dream of talking to a machine at a (human) conversational level has always been 30 years in the future. However, recent advances in computer, sensor, network, robotic, and mobile device hardware has brought that horizon much closer. In short; transistor density and connectivity per chip, along with network complexity crossed a critical threshold and accelerated the abilities of AI.
22 March 2018 - Marco Damonte (University of Edinburgh) - Natural Language Understanding with Abstract Meaning Representation
Abstract meaning representation (Banarescu et al, 2013), or AMR for short, is a semantic representation that provides sentences with a deep semantic interpretation. AMR includes most of the shallow-semantic NLP tasks that are usually addressed separately, such as named entity recognition, semantic role labeling and coreference resolution. AMR is not an interlingua, but AMR graphs can be exploited for a number of NLP tasks such as machine translation, summarisation and paraphrasing. Text-to-AMR (parsing) and AMR-to-text (generation) is however far from providing and using sufficiently accurate graphs for downstream applications. Moreover, not much work has been carried out on AMR for languages other than English. In this talk I’ll present my work on addressing these issues.
1 March 2018 - Wang Ling (Google DeepMind)
1 February 2018 - Johannes Welbl (University College London) - Constructing Datasets for Multi-hop Reading Comprehension Across Documents
Contemporary Reading Comprehension (RC) datasets — SQuAD, TriviaQA, etc. — are dominated by queries that can be answered with a single paragraph or document. However, enabling models to combine pieces of textual information from different sources would drastically extend the scope of RC. In this talk, I will introduce a novel Multi-hop RC task, where a model has to learn how to find and combine disjoint pieces of textual evidence, effectively performing multi-step (alias multi-hop) inference.
18 January 2018 - Horacio Saggion (Universitat Pompeu Fabra) - Mining and Enriching Scientific Text Collections
In the current online Open Science context, scientific datasets and tools for deep text analysis, visualization and exploitation play a major role. I will present a system developed over the past three years for “deep” analysis and annotation of scientific text collections. After a brief overview of the system and its main components, I will present our current work on the development of a bi-lingual (Spanish and English) fully annotated text resource in the field of natural language processing that we have created with our system. Moreover, a faceted-search and visualization system to explore the created resource will be also discussed.
I will take the opportunity to present further areas of research carried out in our Natural Language Processing group.
7 December 2017 - Miquel Espla-Gomis (Universitat d'Alacant) - Identifying insertion positions in word-level machine translation quality estimation
Machine translation (MT) quality estimation (QE) is the task of predicting the quality of a translation produced by an MT system without having a reference translation. At the level of sentences, quality is usually estimated in terms of the effort required to fix the translation, trying to predict metrics such as translation error rate (TER) or post-editing time. When it comes to word level, QE is usually tackled as the task of identifying which words in the translation need to be replaced or deleted. The main advantage of word-level MT QE in front of MT sentence- or document-level MT QE is that it can be used to help post-editors to focus their attention on those parts of the translation that need to be fixed. However, with the current approach of only identifying the words that need to be fixed, post-editors using word-level MT QE could be disregarding missing words. In order to improve the performance of such systems, we propose an approach capable to identifying both the words that need to be deleted and the positions where one or more words need to be inserted. The work presented compares different types of simple neural network architectures that build on different sources of bilingual information in order to provide such predictions. The results obtained not only confirm the feasibility of the approach proposed, but also that a reasonably high performance on both tasks can be obtained using relatively simple architectures.
16 November 2017 - Zeerak Waseem (The University of Sheffield) - Why the F*ck do You Talk Like That?
Over the past year, abusive language detection has received a surge in interest from the NLP community. In spite of this surge in interest, very little work bases itself in the social scientific theories on abusive language. In addition, little work deals with the social contexts surrounding abusive statements or bridging the gaps that are introduced by switching to different social contexts.
The paper examines Bostrom’s notion of Superintelligence and argues that, although we should not be sanguine about the future of AI or its potential for harm, superintelligent AI is highly unlikely to come about in the way Bostrom imagines.
2 November 2017 - Emem Rita Usanga (Bnkability) - Rethinking how deals investment is raised in Africa using NLP
With a $100bn annual infrastructure funding deficit over the next 10years and a npopulation anticipated to double by 2045, the need for infrastructure across the African continent is a pressing need. Government acknowledge this can only be done in partnership with private investors. Problem - international private investor often argue there's a lack of bankable projects in Africa.
This is an interactive session where we present our challenges in the application of NLP to our business solution and attendees propose possible solutions.
26 October 2017 - NLP Student Talks
Chiraag Lala - Multimodal Lexical Translation
Inspired by the tasks of Multimodal Machine Translation and Visual Sense Disambiguation we introduce a task called Multimodal Lexical Translation (MLT). The aim of this new task is to correctly translate an ambiguous word given its context - an image and a sentence in the source language. To facilitate the task, we introduce the MLT datasets, where each data point is a $4$-tuple consisting of an ambiguous source word, its visual context (an image), its textual context (a source sentence), and its translation that conforms with the visual and textual contexts. The dataset has been created from the Multi30K corpus using word-alignment followed by human inspection for English to German and English to French language directions. These datasets form a very valuable multimodal and multilingual language resource with several potential uses including evaluation of lexical disambiguation within (Multimodal) Machine Translation systems.
Fernando Manchego - Sentence Simplification via Sequence Labeling
Text Simplification aims to modify the content and structure of a text, in order to make it easier to read and understand. At the sentence-level, several rewriting operations can be performed to achieve this goal: replacing complex words or phrases for simpler synonyms, deleting unimportant content, splitting the sentence, etc. Most research treats sentence simplification as machine translation (MT), with complex and simple as source and target languages, respectively. In this talk, we will first present an in-depth analysis on the potential and limitations of end-to-end MT-style models using automatic and manual evaluations. To deal with some of the identified problems, we devise a two-step sequence labeling method: (i) identify the simplification operations that need to be performed (if any) in each token of sentence, and (ii) execute the operation using transformation-specific strategies. We show that this operation-based approach is able to produce simpler texts than end-to-end models.
19 October 2017 - Kris Cao (University of Cambridge) - Latent variable models of language
Behind the observed surface form of language exist underlying structures and themes, such as syntax, topic and utterance intent. In this talk, I will present some work which composes graphical models to learn underlying variables with powerful data likelihood functions to model the observed surface form. One such application is in open-domain dialogue modelling, where the latent variables capture the variation in the possible responses to a user utterance. We show that the latent variable approach generates more acceptable diverse output, as measured by human annotators. Another is extending topic models to instead learn topics underlying entire sentences, rather than just words. This lets the model learn topics which capture compositional meaning, which a standard word-level model has difficult doing.
12 October 2017 - Sasha Narayan (University of Edinburgh) - Text-to-text Generation Beyond Machine Translation
In recent years we have witnessed the achievements of sequence-to-sequence encoder-decoder models for machine translation.
In this talk I will discuss two examples, sentence simplification and document summarization, that explore the hypothesis that tailoring the model with knowledge of the task structure and linguistic requirements leads to better performance. In the first part, I will propose a new sentence simplification task (split-and-rephrase) where the aim is to split a complex sentence into a meaning preserving sequence of shorter sentences. I will show that the semantically-motivated split model is a key factor in generating fluent and meaning preserving rephrasings.
BIO: Shashi Narayan is a postdoctoral researcher in the School of Informatics at the University of Edinburgh. He obtained his PhD in Computer Science at the University of Lorraine, INRIA under Claire Gardent in 2014. His research focuses on natural language generation and understanding with an aim to develop general frameworks for generation from underlying meaning representation or for text rewriting such as summarization, text simplification and paraphrase generation. He also has experience with parsing and other structured prediction problems.
4 September 2017 - Thushari Atapattu (University of Adelaide) - Disclosure Analysis of Educational Big Data
NLP Reading Group
The target audience is all the members of the NLP group and other possible interested participants.
The meeting will take place weekly for one hour usually on Mondays from 1-2:30pm.
The meetings of the group will be informal and no necessary preparation will be required with the exception of the moderator reading the current paper and the rest having at least a brief overview of it.
Full details of the reading group can be found at https://github.com/sheffieldnlp/reading_group/blob/master/reading_group.md
Funded Research Projects
|Resources||Group member resources|