Past Seminars

2016 - 2017

6 July 2017 - Iacer Calixto (Dublin City University) - Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

In this talk, I discuss the Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.

3 July 2017 - Lieve Maken (Universiteit Gent) - Product and process in translation

Machine translation (MT) is more and more integrated in the translation workflow and under certain circumstances post-editing will presumably become an integral part of the translation process. In addition, raw (unedited) MT output is increasingly being used "as is", e.g. on support sites.

In this talk I will elaborate on how insights of (translation) process research and translation product research can be combined to gain a better understanding of how translators handle MT output and how human readers process raw MT output.

I will illustrate this by means of 4 projects in which my research team is currently involved:

ROBOT: A comparative study of process and quality of human translation and the post-editing of machine translation
SCATE: Smart computer-aided translation environment
ArisToCAT: Assessing the comprehensibility of automatic translations
PreDicT: Predicting Difficulty in Translation

29 June 2017 - PhD Student Talks

22 June 2017 Nikola Mrksic (University of Cambridge) - Neural Belief Tracker: Data-Driven Dialogue State Tracking using Semantically Specialised Vector Spaces

One of the core components of modern spoken dialogue systems is the belief tracker, which estimates the user's goal at every step of the dialogue. However, most current approaches have difficulty scaling to larger, more complex dialogue domains. This is due to their dependency on either: a) Spoken Language Understanding models that require large amounts of annotated training data; or b) hand-crafted lexicons for capturing some of the linguistic variation in users' language. We propose a novel Neural Belief Tracking (NBT) framework which overcomes these problems by building on recent advances in representation learning. NBT models reason over pre-trained, semantically specialised word vectors, learning to compose them into distributed representations of user utterances and dialogue context. Our evaluation on two datasets shows that this approach surpasses past limitations, matching the performance of state-of-the-art models which rely on hand-crafted semantic lexicons and outperforming them when such lexicons are not provided. Finally, we will discuss how the properties of underlying vector spaces impact model performance, and how the fact that the proposed model operates purely over word vectors allows immediate deployment of belief tracking models for other languages.

8 June 2017 Dirk Hovy (University of Copenhagen) - NLP, the Perfect Social (Media) Science?

1 June 2017 Ondrej Dusek (Charles University Prague) -

11 May 2017 PhD Student Talks

4 May 2017 Julie Weeds (University of Sussex) -

30 March 2017 Yannis Konstas (University of Washington) -

23 March 2017 Sebastian Reidel (University College London) -

16 March 2017 Joachim Bingel (University of Copenhagen) -

9 March 2017 Marek Rei (University of Cambridge) -

16 February 2017 Pranava Madhyastha (The University of Sheffield) -

26 January 2017 Marco Turchi (Fondazione Bruno Kessler) -

8 December 2016 PhD student talks

1 December 2016 Francesca Toni (Imperial College London) - From computational argumentation to relation-based argument mining

In this talk I will overview foundations, tools and (some) applications of computational argumentation, focusing on three popular frameworks, namely Abstract Argumentation, Bipolar Argumentation and Quantitative Argumentation Debates (QuADs). These frameworks can be supported by and support the mining of attack/support relations amongst arguments. Moreover, I will discuss the following questions: is the use of quantitative measures of strength of arguments, as proposed e.g. for QuADs, a good way to assess the dialectical strength of arguments mined from text or the goodness of argument mining techniques?

24 November 2016 Mikel Forcada (Universitat d'Alacant (Spain)) - Gap-filling as a method to evaluate the usefulness of raw machine translation

Most machine translation is consumed raw by ordinary people wanting to make sense of text in languages they cannot understand, for a variety of purposes. In contrast, while subjective judgements of machine translation quality (fluency, adequacy) have been commonplace for decades, surprisingly very little research has addressed the evaluation of the actual usefulness of raw machine-translated text - and almost none about the actual way in which readers make sense of it. Direct evaluation is costly as it has to look into the success of machine-translation-mediated tasks. After a quick review of existing indirect approaches, I describe a possible low-cost method to indirectly evaluate the comprehension of machine-translated text by target-language monolinguals, which may effectively be seen as a simplification -and perhaps a generalization- of reading comprehension tests based on questionnaires. Readers of machine-translated excerpts are asked to fill word gaps in a professional translation of the same excerpt. Word gaps can be ______ anywhere in the reference _______, but preferably at content _______. Results of preliminary gap-filling evaluation work are critically reviewed, and suggestions for future research are outlined.

17 November 2016 Gerasimos Lampouras (University of Sheffield) - Imitation learning for language generation from unaligned data

Natural language generation (NLG) is the task of generating natural language from a meaning representation. Rule-based approaches require domain-specific and manually constructed linguistic resources, while most corpus based approaches rely on aligned training data and/or phrase templates. The latter are needed to restrict the search space for the structured prediction task defined by the unaligned datasets. In this work we propose the use of imitation learning for structured prediction which learns an incremental model that handles the large search space while avoiding explicitly enumerating it. We adapted the Locally Optimal Learning to Search framework which allows us to train against non-decomposable loss functions such as the BLEU or ROUGE scores while not assuming gold standard alignments. We evaluate our approach on three datasets using both automatic measures and human judgements and achieve results comparable to the state-of-the-art approaches developed for each of them. Furthermore, we performed an analysis of the datasets which examines common issues with NLG evaluation.

3 November 2016 Mark-Jan Nederhof (University of St Andrews) - Transition-based dependency parsing as latent-variable constituent parsing

We provide a theoretical argument that a common form of projective transition-based dependency parsing is less powerful than constituent parsing using latent variables. The argument is a proof that, under reasonable assumptions, a transition-based dependency parser can be converted to a latent-variable context-free grammar producing equivalent structures.

26 October 2016 Barbara Plank (University of Groningen) - What to do about non-canonical data in Natural Language Processing

Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio- demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language.

In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It might be in plain sight, but is neglected (available but not used), or it is in raw form and first needs to be refined (almost ready). It is the unintended yield of a process, or side benefit. Examples include hyperlinks to improve sequence taggers, or annotator disagreement that contains actual signal informative for a variety of NLP tasks. More distant sources include the side benefit of behavior. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? In this talk I will present recent (on-going) work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.

20 October 2016 Leon Derczynski (University of Sheffield) - Building a Diverse Named Entity Recognition Resource

This talk presents a new benchmark in corpus construction methodology. One of the main obstacles, hampering method development and comparative evaluation of named entity recognition in social media, is the lack of a sizeable, diverse, high quality annotated corpus, analogous to the CoNLL'2003 news dataset. For instance, the biggest Ritter tweet corpus is only 45 000 tokens - a mere 15% the size of CoNLL'2003. Another major shortcoming is the lack of temporal, geographic, and author diversity. This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire. The corpus is released openly, including source text and intermediate annotations.

13 October 2016 Isabelle Augenstein (University College London) - Weakly Supervised Machine Reading

The state of the art in natural language processing for high-level end user tasks has advanced to a point where are seeing more and more usable commercial applications. These include question answering and dialogue systems such as Google Now or Amazon Echo. One of the things that is crucial for building such applications is to automatically understand text, which is also known as machine reading. In this talk, I will highlight methods for different components of machine reading, namely representation learning, structured prediction and automatically generating training data. I will then present ongoing research of applying these techniques to the tasks of sentiment analysis, semantic error correction and question answering.

6 October 2016 Sumithra Velupillai, King's College London and KTH Sweden - Extracting temporal information from clinical narratives: existing models, approaches - and challenges for the mental health domain

Accurately extracting temporal information from clinical documentation is crucial for understanding e.g. disease progression and treatment effects. In addition to time-stamped and other structured information in electronic health records, temporal information is conveyed in narrative form. Although techniques for extracting events such as symptoms ("anxiety") and treatments ("Xanax"), time expressions ("May 1st") and time relations ("anxiety before Xanax") from clinical notes have been developed in the Natural Language Processing community with promising results in the past few years, most studies have been performed on heterogeneous clinical specialties and use-cases. Mental health documentation poses several unique challenges, one of which will be addressed in my project on extracting symptom and treatment onset for psychosis patients, to better understand duration of untreated psychosis. In this talk, I will describe my previous work on automated extraction of temporal expressions from clinical text using the clearTK package, a framework for machine learning and NLP with UIMA. I will also describe other state-of-the-art approaches for temporal reasoning in clinical text, and discuss challenges involved in applying and adapting these for extracting onset information from mental health records.

29 September 2016 Savelie Cornegruta (King's College London)

Timeline extraction using distant supervision and joint inference

In timeline extraction the goal is to order the events in which a target entity is involved in a timeline. Due to the lack of explicitly annotated data, previous work is rule-based and uses temporal linking systems trained on previously annotated data. Instead, we propose a distantly supervised approach by heuristically aligning timelines with documents. The noisy training data created allows us to learn models that anchor events to temporal expressions and entities; during testing, the predictions of these models are combined to produce the timeline. Furthermore, we show how to improve performance using joint inference.

Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks
Motivated by the need to automate medical information extraction from free-text radiological reports, we present a bi-directional long short-term memory (BiLSTM) neural network architecture for modelling radiological language. The model has been used to address two tasks, medical named-entity recognition and negation detection. We further investigate whether learning several types of word embeddings improved the performance on BiLSTM on those tasks. Using a dataset of chest x-ray reports, we show that the BiLSTM model outperforms a baseline rule-based system on the NER task while for the negation detection it approaches the performance of a hybrid system that leverages the hand-crafted rules of the NegEx algorithm and the grammatical relations obtained from the Stanford Dependency Parser.

16 September 2016 Philip Schulz (University of Amsterdam) - Word Alignment with NULL workds

Most existing word alignment models assume that source words that do not have a lexical translation in the target language were generated from a hypothetical target NULL word. This NULL word is assumed to exist in any target sentence. From a modeling perspective this is unsatisfactory since our linguistic knowledge tells us that untranslatable source word (e.g. certain prepositions) are required by the source context in which they are found. Moreover, the NULL word does have a position in the target sentence and thus troubles distortion-based alignment models by influencing their distortion distributions in unexpected ways.

We present a Bayesian word alignment model that does not postulate NULL words. Instead, source words that don't have lexical translations are generated from the source context. In the final alignment step, such source words are left unaligned. This leads to more informed distributions over unaligned words because these distributions are now conditioned on source contexts. Our model is also general enough to incorporate different distortion models. Finally, we have developed a fast auxiliary variable Gibbs sampler that makes our model competitive with existing models in terms of training time.

After having presented our alignment model I will shortly discuss plans to extend it to a probabilistic phrase extraction model for machine translation.

8 September 2016 Mikel Forcada (Universitat d'Alacant (Spain)) - Towards an effort-driven combination of translation technologies in computer-aided translation

The talk puts forward a general framework for the measurement and estimation of professional translation effort in computer-aided translation. It then outlines the application of this framework to optimize and seamlessly combine available translation technologies in a principled manner to reduce professional translation effort.

2015 - 2016

28th July 2016 Genevieve Gorrell, University of Sheffield - Identifying First Episodes of Psychosis in Psychiatric Patient Records using Machine Learning.

Natural language processing is being pressed into use to facilitate the selection of cases for medical research in electronic health record databases, though study inclusion criteria may be complex, and the linguistic cues indicating eligibility may be subtle. Finding cases of first episode psychosis raised a number of problems for automated approaches, providing an opportunity to explore how machine learning technologies might be used to overcome them. A system was delivered that achieved an AUC of 0.85, enabling 95% of relevant cases to be identified whilst halving the work required in manually reviewing cases. The techniques that made this possible are presented.

9th June 2016 Rina Dutta, King's College London - Introduction to E-HOST-IT.

e-HOST-IT (Electronic health records to predict HOspitalised Suicide attempts: Targeting Information Technology solutions) aims to determine whether structured and free-text data in electronic mental health records (MHRs) can be used to quantify changes in symptoms, behaviour patterns and health service-utilisation and predict serious suicide attempts. The South London and Maudsley NHS Foundation Trust (SLaM), is one of the largest mental health care providers in Europe. The staff all use an MHR which replaces the previous paper note system with a completely electronic system, where daily activities, observations, medication, diagnosis, correspondence and all other information relating to patients is recorded.

George Gkotsis, King's College London - Don't Let Notes Be Misunderstood: A Negation Detection Method for Assessing Risk of Suicide in Mental Health Records.

Mental Health Records (MHRs) contain free-text documentation about patients' suicide and suicidality. In this talk, we address the problem of determining whether grammatic variants (inflections) of the word "suicide" are affirmed or negated. To achieve this, we populate and annotate a dataset with over 6,000 sentences originating from a large repository of MHRs. The resulting dataset has high Inter-Annotator Agreement (k=0.93). Furthermore, we develop and propose a negation detection method that leverages syntactic features of text ( Using parse trees, we build a set of basic rules that rely on minimum domain knowledge and render the problem as binary classification (affirmed vs. negated). Since the overall goal is to identify patients who are expected to be at high risk of suicide, we focus on the evaluation of positive (affirmed) cases as determined by our classifier. Our negation detection approach yields a recall (sensitivity) value of 94.6% for the positive cases and an overall accuracy value of 91.9%. We believe that our approach can be integrated with other clinical Natural Language Processing tools in order to further advance information extraction capabilities.

19th May 2016 Gerasimos Lampouras, University of Sheffield - Using imitation learning for language generation from unaligned data.

"Natural language generation (NLG) is the task of generating natural language from a meaning representation. Current rule-based approaches require domain-specific and manually constructed linguistic resources, while most machine-learning based approaches rely on aligned training data and/or phrase templates. The latter are needed in order to restrict the search space for the structured prediction task defined by the unaligned NLG training datasets. I will talk about how we can use an imitation learning algorithm for structured prediction to learn an incremental model and handle the large search space by avoiding explicit enumeration of the outputs. The model focuses on the Locally Optimal Learning to Search framework which allows the model to be trained against non-decomposable loss functions such as the BLEU score while not assuming gold standard alignments.

12th May 2016 Udo Kruschwitz, University of Essex - Steps towards Profile-Based Web Site Search and Navigation.

Web search in all its flavours has been the focus of research for decades with thousands of highly paid researchers competing for fame. Web site search has however attracted much less attention but is equally challenging. In fact, what makes site search (as well as intranet and enterprise search) even more interesting is that it shares some common problems with general Web search but also offers a good number of additional problems that need to be addressed in order to make search on a Web site no longer a waste of time. At previous visits to Sheffield I talked about turning the log files collected on a Web site into usable, adaptive data structures that can be used in search applications (and which we call user or cohort profiles). This time I will focus on applying these profiles to a navigation scenario and illustrate how the automatically acquired profiles provide a practical use case for combining natural language processing and information retrieval techniques. I will touch on the different types of evaluations we have performed which appear to provide a suitable framework for evaluations within the SENSEI project that both Sheffield and Essex are part of.

5th May 2016 James O'Sullivan, University of Sheffield, HRI - How Literary Scholars are using Computers

"Quantitative approaches to literature represent elements or characteristics of literary texts numerically, applying the powerful, accurate, and widely accepted methods of mathematics to measurement, classification, and analysis" (Hoover 517). Digital Literary Studies seeks to equip literary and cultural scholars with the instruments necessary to isolate specific literary elements, and use these to conduct some experiment or calculation in an effort to provide additional insight. It is clear why literary scholars avail of computer-assisted methods: quantitative approaches to literary criticism can lend new forms of empirical evidence to interpretations. These methods are not better and do not seek to subvert the established practices of literary critics, rather, they simply provide new ways of exploring texts, and new types of quantitative evidence through qualitative arguments might be founded or reinforced. This talk will outline how literary scholars are using computation to re-engage with old debates in new ways, validate existing claims, and encounter new discoveries.
Hoover, David L. "Quantitative Analysis and Literary Studies." A Companion to Digital Literary Studies. Ed. Ray Siemens and Susan Schreibman. Malden: Blackwell Publishing, 2013. 517-533. Print.

28th April 2016 Daniel Kershaw, University of Lancaster - Language of online Social Networks Detection, Diffusion, Prediction

Language exists in a constant flux: existing in the duality of social structure and the actions of the agents with the system. As a result, new words come and go from people's vocabularies - e.g. the recent introduction of `e-cig' and `vape'. Traditionally this work required to identify these 'innovations' is both time consuming and subjective. The talk therefore explores identifying variation in language, in particular language innovations that occur in and across multiple online social networks (innovations could be a change of spelling to the introduction of new words), and how such investigations can be performed at scale.

21st April 2016 Chris Reed, University of Dundee - Argument Technology and Argument Mining

Argument Technology is that part of the overlap between theories of argumentation and reasoning and those of AI where an engineering focus leads to applications and tools that are deployed. One significant step in the past decade has been the development of the Argument Web -- the idea that many of these tools can interact using common infrastructure, with benefits to academic, commercial and public user groups. More recently, there has been a move towards linguistic aspects of argument, with NLP techniques facilitating the development of the field of Argument Mining. Drawing on the academic success and commercial uptake of techniques such as opinion mining and sentiment analysis, argument mining seeks to build on systems which use data mining to summarise *what* people think by explaining also *why* they hold the opinions they do. In 2013 there were just a handful of papers on the topic; by 2016 there are hundreds, with dozens of research groups worldwide gearing up to tackling the problem. The task is enormously demanding, but the commercial appetite strong, and as a result Argument Mining is currently emerging as an extremely lively and creative area of NLP research.

14th April 2016 
Christoph Meili, WWF Switzerland - Natural Language Processing to protect Wildlife and Humanity?

WWF's ultimate goal has always been "people living in harmony with nature" - so WWF is about respecting and valuing the natural world and finding ways to share the Earth's resources fairly. To achieve that, WWF spends a lot of time working with communities, with politicians and with businesses too. Earth Hour is a global environmental movement, parented by WWF, which motivates millions of people around the World to use their power to change climate change.
This talk will show how, in collaboration with Sheffield and other research partners, WWF learns how to increase the social and environmental impact of out Earth Hour campaigns.

Miriam Fernandez, The Open University - Talking Climate Change via Social Media: Communication, Engagemen and Behaviour.

While individual behaviour change is considered a central strategy to mitigate climate change, public engagement is still limited. Aiming to raise awareness, and to promote behaviour change, governments and organisations are conducting multiple pro-environmental campaigns, particularly via social media. However, these campaigns are neither based on, nor do they take advantage of, the existing theories and studies of behaviour change, and better target and inform users.
In this talk we discuss our approach for analysing user behaviour towards climate change based on the 5 Doors Theory of behaviour change. Our approach automatically identifies behavioural stages in which users are based on their social media contributions. To do so, NLP tools provided by GATE have been used to automatically identify the linguistic patterns that emerge from different behavioural stages.
In this talk we will describe this linguistic preprocessing performed with GATE, and how it has been applied to analyse the online behaviour of participants of the Earth Hour 2015 and the COP21 Twitter movements. Results of our analysis are used to provide guidelines on how to improve communication via these campaigns.

7th April 2016 Pascal Vaillant, University of Paris - XML annotation of language contact phenomena in plurilingual corpora

Methods in corpus processing have until recently been more focused on multilingual corpora (texts in different languages about the same domain) than on plurilingual corpora (corpora with an internal linguistic heterogeneity). This may be due to the fact that they have emerged in natural language processing contexts, mostly in practical applications to written texts, and not in the field of applied linguistics, where the focus is rather on spontaneous, genuine utterances of non-standard speech, and where phenomena of combined use of different languages are not rare. However, observing -and understanding- language contact phenomena has a growing appeal not only to linguistic specialists, but also to all those who have an interest in mining corpora of spoken language, or non-standard written language. Within the frame of the (French ANR-funded) CLAPOTY project, we have developed an annotation schema in compliance with the latest standards with respect to transcription (Unicode) and markup (XML). This schema follows the inspiration of the TEI (Text Encoding Initiative), extending it where needed (namely, for the annotation of language plurality). In this model, linguistic units (at all levels) may be described as pertaining to one language or another, and even to many languages at the same time. The model is able to represent the richness and versatility of spontaneous linguistic utterances, where speakers actually often "float" between two languages.

17th March 2016 Dirk Hovy, University of Copenhagen - Texts Come from People - How Demographic Factors Influence NLP Models

The way we express ourselves is heavily influenced by our demographic background. I.e., we don't expect teenagers to talk the same way as retirees. Natural Language Processing (NLP) models, however, are based on a small demographic sample and approach all language as uniform. As a result, NLP models perform worse on language from demographic groups that differ from the training data, i.e., they encode a demographic bias. This bias harms performance and can disadvantage entire user groups.Sociolinguistics has long investigated the interplay of demographic factors and language use, and it seems likely that the same factors are also present in the data we use to train NLP systems. In this talk, I will show how we can combine statistical NLP methods and sociolinguistic theories to the benefit of both fields. I present ongoing research into large-scale statistical analysis of demographic language variation to detect factors that influence the performance (and fairness) of NLP systems, and how we can incorporate demographic information into statistical models to address both problems.

10th March 2016 Ted Briscoe, University of Cambridge - Creating Automated English Language Learning Tools Using Machine Learning

Massive on-line open courseware (MOOCs) and adaptive learning management systems have been much hyped, but their impact outside of a few subjects like computer science has been modest so far. Andrew Ng's machine learning course (Stanford, Udacity) had over 100,000 students in its first iteration and was life changing for some of its successful students. FutureLearn, the UK HE MOOC platform developed by the Open University has several English as a Further Language (EFL) MOOCs but take up and impact has been modest. Why?
In this talk, I will argue that the missing ingredient and essential difference between on-line computer science and EFL courses is objective automated and meaningful learning-oriented assessment (LOA). LOA to be useful must include both a summative and formative component, must address the appropriate linguistic tasks and skills, must be interpretable and actionable by the learner, and must be integrated into an adaptive and therefore personalised learning platform.
I'll describe how we have exploited supervised and semi-supervised, sometimes deep, machine learning techniques applied to automatically analysed non-native English text to develop LOA for EFL writing. I'll show that discriminative ranking algorithms combined with multitask learning allow us to create state-of-the-art systems for grading text and detecting errors. Finally, I'll discuss some of the difficulties of assessing and providing useful feedback on discourse organisation.
Some of our writing assessment technology is already in use in courseware such as CUP's Empower, in an IELTS practice examination recently launched in China, and in Cambridge English Write and Improve. Much work remains to be done on the integration of full LOA into EFL courseware, but widescale deployment of our systems opens up exciting possibilities to incrementally improve their performance and utility.

25th February 2016 
Carla Parra Escartin, Hermes Traducciones y Servicios Linguisticos - Establishing productivity thresholds for Machine Translation Post-Editing Tasks

Machine Translation has become a reality in the translation industry. Over the past few years, translators have experienced the introduction of Machine Translation Post-Editing tasks in their workflows, However, the question of whether MT output has a positive impact in productivity is still open. In this talk, I will present an experiment involving 10 professional translators in which we measured their productivity when translating from scratch, post-editing Translation Memory fuzzy matches and post-editing Machine Translation output. I will discuss the results of our experiment, and the productivity thresholds that we identified.
Short bio: Carla Parra Escartin is a post-doctoral researcher at Hermes Traducciones y Servicios Linguisticos, a Spanish translation company. She works as Experienced Researcher within the EXPERT ITN. To see her work within the EXPERT project, visit: 

Rohit Gupta, Univeristy of Wolverhampton - ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks

I will discuss about recently proposed machine translation evaluation metric "ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks". The metric is based on dense vector spaces and recurrent neural networks (RNNs) and evaluates translations into English. In the WMT15 metric task, our metric is the best performing metric overall according to Spearman and Pearson and second best according to Pearson system level correlation. 
Short Bio: Rohit Gupta is working as an Early Stage Researcher (PhD student) at University of Wolverhampton under EXPERT project

18th February 2016 Elaine Toms, University of Sheffield (Information School) - Making Sense of Search Results: Getting Beyond 1 to 10 of 10 million

Searching for almost anything using a typical search engine will net a large number of results presented 10 at a time in an abbreviated form. It may take the search engine 1/4 of a second to produce it, but a person may spend hours to days iteratively and laboriously trolling through the results, clicking on relevant items based on that abbreviated form, and then scanning and reading to extract useful information. This is particularly a problem when the work based scenario that brought the user to the search engine in the first place is complex and multi-dimensional. In essence search systems do not understand the complexity of the problem facing the user and does not also understand how and how much the user learns while working out a solution. The next generation interface will need a suite of tools that augment human cognition to assist the user in extracting useful nuggets and then to assist the user in making sense of that information which may require metaphorically the Swiss army knife of information tools. This talk will discuss some of these cognitive prostheses that clearly need NLP for development.

28th January 2016 Hoang Cuong, University of Sheffield (Visiting Researcher) - Latent Domain Models for Statistical Machine Translation

Mismatch in translation distributions between test data (target domain) and train data is known to harm performance of statistical translation systems. In this talk I will do a brief survey of recent domain adaptation works that aim to address the challenge. I will also present our new, simple yet effective latent variable framework, revealing how it can be applied to broad domain adaptation tasks with minor modification including data selection, word-alignment adaptation, phrase-based adaptation, and translation adaptation with rewarding hidden domain invariance. In principle, the induction of hidden domains could be learned from scratch, but I will also show how to improve the model with partial supervision of domain-annotated data.

21st January 2016 Danushka Bollegala, University of Liverpool - Joint Word Representation Learning using Corpora and Ontologies

Learning semantic representations for words is an important step for numerous natural language processing tasks. Classical distributional methods that count the contexts of a word and represent the co-occurrence statistics using vectors have been employed by the NLP community for several decades. Recently, distributed word representation methods that learn representations that can accurately predict the occurrence of a word in a given context have received much attention. Complementary to such data-driven approaches for word representations, much manual effort has been invested in creating high-quality ontologies such as the WordNet. A natural question that arises is whether it is possible to combine these two approaches for semantic representations to construct more accurate word representations. In this talk, I will introduce some recent work we have conducted in this direction.

14th January 2016 Tim Rocktaeschel, University College London - Reasoning about Entailment with Neural Attention

Automatically recognizing entailment relations between pairs of natural language sentences has so far been the dominion of classifiers employing hand engineered features derived from natural language processing pipelines. End-to-end differentiable neural architectures have failed to approach state-of-the-art performance until very recently. In this paper, we propose a neural model that reads two sentences to determine entailment using long short-term memory units. We extend this model with a word-by-word neural attention mechanism that encourages reasoning over entailments of pairs of words and phrases. Furthermore, we present a qualitative analysis of attention weights produced by this model, demonstrating such reasoning capabilities. On a large entailment dataset this model outperforms the previous best neural model and a classifier with engineered features by a substantial margin.

10th December 2015 Marina Fomicheva Institut Universitari de Lingüística Aplicada - Using Contextual Information for Machine Translation Evaluation

Automatic Machine Translation (MT) evaluation is based on the idea that the closer system's output is to human translation, the higher its quality. Thus, the task is typically approached by measuring some sort of similarity between machine and human translations. In spite of the important progress in the field, the correlation between automatic and manual evaluation scores is still not satisfactory. We suggest that this is primarily due to the following reason: when comparing candidate and reference translations, most of the existing metrics cannot distinguish acceptable linguistic variation from true divergences that are indicative of MT errors.
In our view, the key for discriminating between acceptable and non-acceptable differences is to use contextual information. Variation between two translation options can be considered meaning-preserving if semantically similar words in the corresponding sentences occur in equivalent syntactic environments. In case of translation errors either the lexical choice is inappropriate or the syntactic contexts of the otherwise equivalent words are divergent (word order errors, wrong choice of function words, etc.).
In this talk, I will present a new evaluation system, UPF-Cobalt, that exploits this idea. I will describe the results obtained in the well known WMT evaluation task where the metric has shown highly competitive performance.

3rd December 2015 Andreas Vlachos The University of Sheffield - Natural language understanding with imitation learning

In the first part of this talk I will give an overview of my recent research in natural language understanding focusing on applications in information extraction, semantic parsing, biomedical text mining, language modelling and fact-checking. In the second part, I will focus on the machine learning approach used in most of these applications, namely imitation learning. In the latter paradigm, structured prediction is converted into a sequence of classification actions which are learnt by appropriate generation of cost-sensitive training examples, so that the impact of the actions is assessed globally while maintaining flexibility in feature extraction. Experiments in a variety of tasks demonstrate the broad applicability and the competitive performance of imitation learning.

25th November 2015 Isabelle Augenstein The University of Sheffield - Distant Supervision with Imitation Learning

Distantly supervised approaches have become popular in recent years as they allow training relation extractors without text-bound annotation, using instead known relations from a knowledge base and a large textual corpus from an appropriate domain. While state of the art distant supervision approaches use off-the-shelf named entity recognition and classification (NERC) systems to identify relation arguments, discrepancies in domain or genre between the data used for NERC training and the intended domain for the relation extractor can lead to low performance. This is particularly problematic for ``non-standard'' named entities such as ``album'' which would fall into the MISC category. We propose to ameliorate this issue by jointly training the named entity classifier and the relation extractor using imitation learning which reduces structured prediction learning to classification learning. The talk will give an introduction to distant supervision and imitation learning and present experiments from our EMNLP 2015 paper [1]. 
[1] Isabelle Augenstein, Andreas Vlachos, Diana Maynard (2015). Extracting Relations between Non-Standard Entities using Distant Supervision and Imitation Learning. Proceedings of EMNLP.

19th November 2015 Shawn Wen University of Cambridge - Neural Language Generation for Spoken Dialogue Systems

Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality. Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language. They are also not easily scaled to systems covering multiple domains and languages. 

In this talk I'm going to introduce you our recently proposed language generator based on a Recurrent Neural Network architecture, dubbed the Semantic Conditioned Long-short Term Memory generator. The SC-LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates. With fewer heuristics, an objective evaluation in two differing test domains showed the proposed method improved performance compared to previous methods. Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems. More recently, we also compared it with the encoder-decoder, attention-based generator, in terms of the performance and domain extensibility.

12th November 2015 Matthew Purver Queen Mary University London - Language Processing for Diagnosis and Prediction in Mental Health

Conditions which affect our mental health often affect the way we use language; and treatment often involves linguistic interaction. This talk will present work on three related projects investigating the use of NLP techniques to help improve diagnosis and treatment for such conditions. We will look at clinical dialogue between patient and doctor or therapist, in cases involving schizophrenia, depression and dementia; in each case, we find that diagnostic information and/or important treatment outcomes are related to observable features of a patient's language and interaction with their conversational partner. We discuss the nature of these phenomena and the suitability and accuracy of NLP techniques for detection and prediction.

5th November 2015 Abdulaziz Alamri The University of Sheffield Automatic Identification of Potentially Contradictory Claims to Support Systematic Reviews

Medical literature suffers from the existence of contradictory studies that make incompatible claims about the same research question. This research introduces an automatic system that detects contradiction between research claims using their assertion value with respect to a question. The system uses a machine learning algorithm (SVM) to construct a classifier that uses multiple linguistic features to recognise a claim's assertion value. The classifier is developed using a dataset consisting of 258 claims distributed in 24 groups, where each group answers a single research question. The classifier achieved ROC Score of 89% and precision/recall of 87.3%, compared against a baseline of 68%. The system enables researchers carrying out systematic reviews to visually identify potentially contradictory research claims.

22nd October 2015 Goran Nenadic University of Manchester - Adding Context to Biomedical Text Mining

Most of biomedical text mining efforts focus on the extraction and linkage of facts retrieved from various resources. In addition to such "raw data", text mining can be used to contextualise the extracted facts. For example, information extracted about the molecular basis of a disease need to be contextualised with patient's disease status, severity and type of the disease, associated anatomical location(s), related biological pathways and known drug targets, etc. Another important contextual dimension is temporal, capturing the dynamics of facts in particular in clinical practice. In this talk I will review our recent work on context-aware text mining with examples in biology, bioinformatics and medicine.

15th October 2015 Andres Duque, The University of Sheffield (Visiting Researcher) - Co-occurrence graphs for multilingual Word Sense Disambiguation

Word Sense Disambiguation has been frequently treated as a supervised learning problem, based on techniques that depend on scarce and expensive resources such as semantically tagged corpora or lexical databases like WordNet. Unsupervised techniques, in contrast, do not require such resources and make use of information provided by unannotated corpora to discriminate among word meanings. In this talk, I will present an unsupervised approach based on co-occurrence graphs which has been successfully applied to solve general WSD problems, such as Cross-Lingual Word Sense Disambiguation. The current work is focused on the application of the developed techniques in the biomedical domain, under a multilingual perspective. This includes the analysis of both monolingual and multilingual corpora for addressing tasks such as abbreviation and acronym disambiguation.

1st October 2015 Chris Huyck, Middlesex University - Natural Language Processing with Neurons

It's clear that all of the Natural Language Processing that people do is done with neurons. We take several years to learn how to recognise and produce language, and then spend the rest of our lives using it. It's really quite a mystery how that's done. While there is some psycholinguistic and neuro-psycholinguistic evidence on how we process language, neural models of language processing are needed to further our understanding. Moreover, those neural models can be used for proper language engineering tasks. This talk will cover a range of work we've been doing using systems of simulated neurons. This work includes:

    • a neuro-cognitive model of parsing; this applies grammar rules implemented in neurons, and uses short-term potentiation for binding with the output being the semantics of a sentence.
    • machine learning tasks; this uses Hebbian learning rules to solve standard categorisation tasks, a neurally implemented reinforcement learning to implement a cognitive model, and a cognitive model of categorisation.
    • agents in virtual environments; in addition to language, these systems have vision, planning, and cognitive mapping.
    • the Human Brain Project, agents including language processing have been implemented in standard neural simulators and in neuromorphic hardware. It is hoped that this mechanism will incorporate work by other researchers, in addition to our own extensions, to build increasingly sophisticated and biologically accurate agents.
    • and the Telluride Neuromorphic Cognition Engineering Workshop; we quickly developed parsers integrated with memory systems. The group also worked on bag of word techniques.

Clearly, we're a long way from Turing test passing systems, but I will argue that continuing along this path is the best way to get there.

29th September 2015 Sean Chester, Aarhus University, Denmark & NTNU, Norway - |C|=1000 and other Brown clustering fallacies 

Brown clustering has recently re-emerged as a competitive, unsupervised method for learning distributional word representations from an input corpus. It applies a greedy heuristic based on mutual information to group words into clusters, thereby reducing the sparsity of bigram information. Using the clusters as features has been shown over again to incur excellent performance on downstream NLP tasks. In this talk, however, I expose the naivety in how features are currently generated from Brown clusters. With a look into hyperparametre selection, the reality of Brown clustering output, and the algorithm itself, I will show that the space for improving the resultant word representations is predominantly unexplored.

10th September 2015 Michal Lukasik The University of Sheffield

Classifying Tweet Level Judgements of Rumours in Social Media

Social media is a rich source of rumours and corresponding community reactions. Rumours reflect different characteristics, some shared and some individual. We formulate the problem of classifying tweet level judgements of rumours as a supervised learning task. Both supervised and unsupervised domain adaptation are considered, in which tweets from a rumour are classified on the basis of other annotated rumours. We demonstrate how multi-task learning helps achieve good results on rumours from the 2011 England riots.

Modeling Tweet Arrival Times using Log-Gaussian Cox Processes

Research on modeling time series text corpora has typically focused on predicting what text will come next, but less well studied is predicting when the next text event will occur. In this paper we address the latter case, framed as modeling continuous inter-arrival times under a log-Gaussian Cox process, a form of inhomogeneous Poisson process which captures the varying rate at which the tweets arrive over time. In an application to rumour modeling of tweets surrounding the 2014 Ferguson riots, we show how inter-arrival times between tweets can be accurately predicted, and that incorporating textual features further improves predictions.

7th September 2015 Ikechukwu Onyenwe The University of Sheffield - Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language

The accuracy of an annotated corpus can be increased through evaluation and revision of the annotation scheme, and through adjudication of the disagreements found. In this paper, we describe a novel process that has been applied to improve a part-of-speech (POS) tagged corpus for the African language Igbo. An inter-annotation agreement (IAA) exercise was undertaken to iteratively revise the tagset used in the creation of the initial tagged corpus, with the aim of refining the tagset and maximizing annotator performance. The tagset revisions and other corrections were efficiently propagated to the overall corpus in a semi-automated manner using transformation-based learning (TBL) to identify candidates for correction and to propose possible tag corrections. The affected word-tag pairs in the corpus were inspected to ensure a high quality end-product with an accuracy that would not be achieved through a purely automated process. The results show that the tagging accuracy increases from 88% to 94%. The tagged corpus is potentially re-usable for other dialects of the language.

2014 - 2015

13th August 2015 Gustavo Paetzold The University of Sheffield - Improved Models for Lexical Simplification

Lexical Simplification (LS) consists in replacing complex words in a text with simpler alternatives, while preserving the text's grammaticality and meaning. Most LS approaches in literature follow a pipeline of steps: Complex Word Identification, Substitution Generation, Selection and Ranking. In order to improve on the state-of-the-art of the task, we have conducted new user studies, gathered new resources and conceived new strategies for each step of the task.

30th July 2015 Karin Sim Smith, University of Sheffield - Modelling Coherence in Machine Translation

We introduce the problem of measuring coherence in Machine Translation (MT). Local coherence has previously been assessed in monolingual contexts using coherent texts that are then artificially modified to create corresponding incoherent versions. We investigate how well previous approaches to coherence assessment work for the task of measuring coherence of MT output. We also extend one of these models - based on syntax - to learn better probability distributions of syntactic patterns. This extension outperforms the state of the art in a traditional monolingual task, and performs very well on MT output.

9th July 2015 Diana Maynard, University of Sheffield - Tools for (Almost) Real-Time Social Media Analysis: The Political Futures Tracker

Social media is fast becoming a crucial part of our everyday lives, not only as a fun and practical way to share our interests and activities with geographically distributed networks of friends, but also as animportant part of our business activities. Tools for social media analytics are critical tounderstand how people engage with topics such as environmental issues or political elections, and totarget campaigns appropriately. This talk will provide an overview of the social media analysis tools recently developed in GATE, which enable us to annotate tweets andother social media in almost real-time, recognising key entities, topics, events, sentiment and soon. 
This enables us to then search these huge datasets and gain new insights by means of complex queries involving semantics and information from external knowledge sources. The talk will focus on the Political Futures Tracker, a project which investigated key insights into hotly debated issues leading up to the UK election, and the polarisation of political opinion around them, plus the search and visualisation tools which sit on top of the analysis. The toolkit can be used to answer questions such as which topics were most hotly debated in which regions of the country, how engaged the public became about different topics, how people reacted to different politicians throughout the TV debates, how a politician's age affects their tweeting frequency, and many more.

2nd July 2015 Roland Roller The University of Sheffield

ACL short paper: Improving distant supervision using inference learning

Distant supervision is a widely applied approach to automatic training of relation extraction systems and has the advantage that it can generate large amounts of labelled data with minimal effort. However, this data may contain errors and consequently systems trained using distant supervision tend not to perform as well as those based on manually labelled data. This work proposes a novel method for detecting potential false negative training examples using a knowledge inference method. Results show that our approach improves the performance of relation extraction systems trained using distantly supervised data.

BioNLP workshop paper: Making the most of limited training data using distant supervision

Automatic recognition of relationships between key entities in text is an important problem which has many applications. Supervised machine learning techniques have proved to be the most effective approach to this problem. However, they require labelled training data which may not be available in sufficient quantity (or at all) and is expensive to produce. This paper proposes a technique that can be applied when only limited training data is available. The approach uses a form of distant supervision but does not require an external knowledge base. Instead, it uses information from the training set to acquire new labelled data and combines it with manually labelled data. The approach was tested on an adverse drug data set using a limited amount of manually labelled training data and shown to outperform a supervised approach.

29th June 2015 Serge Sharoff, University of Leeds - Extending Quality Estimation for a large number of pairs 

The process of localisation of products, such as software, video games or films usually involves translation from English into a number of other languages. In our PALODIEM project we try to harness the similarities between related languages, so that a product can be first localised into Spanish, and then translated into Portuguese, Italian or Romanian using Machine Translation. A problem in evaluating MT output in this case is that the reference human translation has been done from English, while the test translation comes from a related language. Therefore, we need a Quality Estimation framework which covers a large number of language pairs: Es-Pt, Es-It, Es-Fr, Es-Ro, Fr-It, etc. We will present our current work on this topic focusing on two main ideas: 1. Developing cheap (even if dirty) resources which utilise the similarity between the languages, such as the lists of cognates (instead of bilingual dictionaries), POS language models (instead of parsing) or distributional similarity lists (instead of thesauruses). 2. Applying transfer learning, when texts in cognate languages are treated as unlabelled out-of-domain data.

18th June 2015 Aitor Gonzalez Agirre, Universidad del País Vasco - Why are these similar? Investigating Semantic Textual Similarity

Semantic Textual Similarity (STS) measures the degree of semantic equivalence between two sentences. Given two snippets of text, semantic textual similarity (STS) captures the notion that some texts are more similar than others, measuring their degree of semantic equivalence. Textual similarity can range from complete unrelatedness to exact semantic equivalence, and a graded similarity score intuitively captures the notion of intermediate shades of similarity, as pairs of text may differ from some minor nuanced aspects of meaning to relatively important semantic differences, to sharing only some details, or to simply unrelated in meaning. In this talk I will describe a novel approach to compute similarity scores between two sentences using a cube where each layer contains token-to-token and phrase-to-phrase similarity scores from a different method and/or resource. In addition, I will also present several approaches to automatically identify the type of similarity between items in a Large Digital Library (Typed STS), and a novel task on adding an explanatory layer to STS (Interpretable STS).

28th May 2015 Frederic Blain, The University of Sheffield - Machine Translation with humans in the loop: how to improve translation quality over time

I will give a talk about my previous research during my PhD as well as the one year postdoc at LIUM (France). In the first part, I'll talk about my work on post-editing analysis by introducing so called the "Post-Editing Actions". Then, I will present results on continuous and fast integration of users' feedback into a SMT system over the time. This work has been done using a very simple alignment algorithm which uses the raw translation as a pivot. In the second part of my talk, I will present my involvement into the European project MateCAT, which aimed at providing the professional translators a new web-based and enriched open-source CAT tool. I will finalise the talk by introducing the QT21 project and more precisely, the research topic on which I will focus on in the next few months.

21st May 2015 Genevieve Gorrell, The University of Sheffield - Using @Twitter Conventions to Improve #LOD-based Named Entity Disambiguation

State-of-the-art named entity disambiguation approaches tend to perform poorly on social media content, and microblogs in particular. Tweets are processed individually and the richer, microblog-specific context is largely ignored. This paper focuses specifically on quantifying the impact on entity disambiguation performance when readily available contextual information is included from URL content, hash tag definitions, and Twitter user profiles. In particular, including URL content significantly improves performance. Similarly, user profile information for @mentions improves recall by over 10% with no adverse impact on precision. We also share a new corpus of tweets, which have been hand-annotated with DBpedia URIs, with high inter-annotator agreement.

16th April 2015 Roland Roller, The University of Sheffield - Improving the quality of distantly labelled training data for relation extraction

Automatic recognition of relationships between entities in text is an important task that has been used to identify useful relations such as interactions between drugs and potential drug side-effects. Supervised machine learning techniques have proved to be the most effective approach to this problem. However, they require labelled training data which may not be available in sufficient quantity (or at all) and is expensive to produce. Distant supervision instead labels its own training data using a knowledge base. This data is then used to train a relational classifier. Unfortunately automatically labelled data can be noisy which decreases the quality of the classification results. In my talk I will present ongoing work focussing on reducing falsely annotated instances from automatically labelled data. First I will present a method to detect possible false negatives using an infer learning method. Second I will report first results using negations to detect and remove false positives from this data.

9th April 2015 Hegler Tissot, The University of Sheffield (Visitor) - UFPRSheffield: Contrasting Rule-based and SVM Approaches to Time Expression Identification in Clinical TempEval

We present two approaches to time expression identification, as entered in to SemEval-2015 Task 6, Clinical TempEval. The first is a comprehensive rule-based approach that favoured recall, and which achieved the best recall for time expression identification in Clinical TempEval. The second is an SVM-based system built using readily available components, which was able to achieve a competitive F1 in a short development time. We discuss how the two approaches perform relative to each other, and how characteristics of the corpus affect the suitability of different approaches and their outcomes.

8th April 2015 Hanna Bechara, The University of Sheffield (Visitor) - Semantic Similarity to aid Machine Translation Evaluation

The importance of evaluation in machine translation (MT) increases with the improvement of data-driven machine translation. However, most automatic metrics still fail to correlate to human judgement. In an attempt to focus further on the adequacy and informativeness of translations, we attempt to integrate features of semantic similarity into the evaluation process. By using methods previously employed in STS tasks, we attempt to use semantically similar sentences and their quality scores in order to estimate the quality of machine translated sentences. Our preliminary results show that this method can improve the prediction of machine translation quality for semantically similar sentences.

26th February 2015 Mussa Omer, University of Huddersfield - Implementing a Relational Database from Requirement Specifications

Creating a database scheme is essentially a manual process. From a requirement specification the information contained within the specification has to be analysed and reduced into a set of entities, tables, attributes and relationship before the relational database can be created. This a time consuming process and has to go through several stages before an acceptable database scheme is achieved. The purpose of this research is to attempt to implement a relational database from requirement specifications. Stanford CoreNLP version 3.3.1 and knowledge based system were used to implement the proposed model. The outcome of the current progression indicates that a first cut of a relational database scheme can be extracted from a requirement specification by involving Natural Language Processing tools and techniques with minimum user intervention. Therefore this method is a step forward in finding a solution that requires little or no user intervention.

19th February, 2015 Dominic Rout, The University of Sheffield - Beyond retweets: textual and social approaches to ranking content in Twitter timelines

Timelines on social media services such as Twitter and Facebook are subject to information overload. There is a clear personal value in prioritising posts to meet a user's limited time or attention space and Facebook, in particular, has also demonstrated that there is commercial value in the ranking of social media content. In our work, we consider content ranking using information other than retweets or likes, both for a specialised use case of professional social media analysis, and a generalised use cases of hundreds of unknown Twitter users. We discuss the challenges we face and results for a variety of approaches as we attempt to supplement and surpass the practise of recommending content that has the most retweets.

12th February, 2015 Josiah Wang, The University of Sheffield - When Text and Images Meet: Combining Natural Language Processing and Computer Vision

This talk will introduce you to my past and present research work at the intersection of Natural Language Processing and Computer Vision. The talk will be divided into two parts. The first part will feature a subset of my PhD work, which involves exploring the task of learning to recognise fine-grained object categories in images (e.g. butterfly species) using a textual description describing the category, rather than training with many example images. The performance of the automated system will also be compared to human performance on the same task. The second part of the talk will give you an overview of the CHIST-ERA funded VisualSense project on which I'm currently working, where one of the main goals is to develop methods to automatically generate textual descriptions describing a given image. We'll discuss the project, the problems encountered, some preliminary results and progress on the project so far.

22nd January 2015 Isabelle Augenstein, OAK group, Department of Computer Science, The University of Sheffield - Web Information Extraction using Distant Supervision

The Web is a heterogeneous knowledge source containing information about various domains. In order to adapt to different domains and non-standard language containing spelling and grammatical mistakes or jargon, distant supervision is used. Distant supervision allows to train classifiers for relation extraction in an unsupervised way by using a knowledge base to automatically annotate training data. While this approach has been used successfully for relation extraction, it has so far mostly been applied to the news domain.
This talk will present challenges involved in training relation extractors for the Web with distant supervision, including training data selection, named entity recognition for relation extraction, and evaluation of such approaches. Experiments on a Web corpus using the knowledge base Freebase indicate that distant supervision is a useful approach for unsupervised information extraction from heterogeneous Web pages on various topics.

11th December 2014 Frank Keller, School Of Informatics, University of Edinburgh - Joint work with Moreno Coco and Des Elliott

When humans process text or speech, this often happens in a visual context, e.g., when listening to a lecture, reading a map, or describing an image. Here, we focus on image description as an example of language/vision integration. Previous research has hown that objects in a visual scene are fixated before they are mentioned, leading us to hypothesize that the scan pattern of a participant can be used to predict what they will say. We test this hypothesis using a data set of cued scene descriptions of photo-realistic scenes. We demonstrate that similar scan patterns are correlated with similar sentences and that this correlation holds for three phases of language production (target identification, sentence planning, and speaking). We go on to show how insights from human language/vision integration can be used to build systems that automatically describe images. We propose a novel way of representing images as visual dependency graphs, where arcs between image regions are labeled with spatial relationships. The task of relating image regions to each other can then be viewed as a parsing task. We show how image parsing can be automated and how the output of an image parser can be used to generate image descriptions. The resulting system outperforms standard approaches that rely on object proximity or corpus information to generate descriptions.

20th November 2014 Gianluca Demartini, i-School, The University of Sheffield, - Entity-Centric Information Access

Over the last years we have observed an evolution towards richer Search Engine Result Pages which are now including pictures, news, videos, factual data, and more.
This has become possible also thanks to a deeper understanding of user queries and of Web content: entities are being extracted out of Web pages and uniquely identified. This enables entity-centric information access.
In this talk I will overview different research challenges that needs to be tackled to enable such novel search result pages. First, I will talk about entity linking and disambiguation using crowdsourcing and a graph of linked entities as background corpus. Next, I will show how the types of such identified entities can be effectively ranked support to user browsing. Finally, I will describe how keyword query understanding can be crowdsourced to build search engines that can answer rare complex queries.

13th November, 2014 Tamara Polajnar, University of Cambridge, - Reducing Dimensions of Tensors in Type-Driven Distributional Semantics

I will talk about work where we examined the performance of lower dimensional approximations of transitive verb tensors on a sentence plausibility and similarity tasks. We found that the matrices perform as well as, and sometimes even better than, full tensors, allowing a reduction in the number of parameters needed to model the framework. I will also talk about numerical techniques that allowed us to train matrices and tensors using low-dimensional vectors.

23rd October, 2014 Internal Paper Presntations

Daniel Beck, University of Sheffield - Joint Emotion Analysis via Multi-task Gaussian Processes

We propose a model for jointly predicting multiple emotions in natural language sentences. Our model is based on a low-rank coregionalisation approach, which combines a vector-valued Gaussian Process with a rich parameterisation scheme. We show that our approach is able to learn correlations and anti-correlations between emotions on a news headlines dataset. The proposed model outperforms both single-task baselines and other multi-task approaches.

Wilker Aziz, University of Sheffield - Exact Decoding for Phrase-Based Statistical Machine Translation

The combinatorial space of translation derivations in phrase-based statistical machine translation is given by the intersection between a translation lattice and a target language model. We replace this intractable intersection by a tractable relaxation which incorporates a low-order upperbound on the language model. Exact optimisation is achieved through a coarse-to-fine strategy with connections to adaptive rejection sampling. We perform exact optimisation with unpruned language models of order 3 to 5 and show search error curves for beam search and cube pruning on standard test sets. This is the first work to tractably tackle exact optimisation with language models of orders higher than 3.

7 October, 2014 Nikos Aletras The University of Sheffield - Interpreting Document Collections with Topic Models

This thesis concerns topic models, a set of statistical methods for interpreting the contents of document collections. These models automatically learn sets of topics from words frequently co-occurring in documents. Topics learned often represent abstract thematic subjects, i.e Sports or Politics. Topics are also associated with relevant documents.

These characteristics make topic models a useful tool for organising large digital libraries. Hence, these methods have been used to develop browsing systems allowing users to navigate through and identify relevant information in document collections by providing users with sets of topics from which can select documents relevant to topics.

The main aim of this thesis is to post-process the output of topic models, making them more comprehensible and useful to humans.

First, we look at the problem of identifying incoherent topics. We show that our methods work better than previously proposed approaches. Next, we propose novel methods for efficiently identifying semantically related topics which can be used for topic recommendation. Finally, we look at the problem of alternative topic representations to topic keywords. We propose approaches that provide textual or image labels which assist to topic interpretability. We also compare different topic representations within a document browsing system.

2 October, 2014 Jey Han Lau King's College London - A probabilistic approach to grammaticality

A foundational question in cognitive science is whether linguistic knowledge is fundamentally categorical or probabilistic in nature. Grammaticality judgments present a problem for probabilistic models in that probabilities cannot be mapped directly to grammaticality, because of the influence of sentence length and lexical frequency. In this talk we look at the problem of predicting grammaticality judgments using probabilistic models. We tested a set of enriched models on a data set of crowd sourced grammaticality judgments for sentences that have had errors introduced through round trip machine translation. Using various normalisation methods, applied to a variety of largely unsupervised learning models, we show encouraging correlations between the predictions of our models and mean native speaker judgments. These results suggest that probabilistic models are, in principle, capable of accounting for observed grammaticality judgments.

24th September, 2014 Nigel Collier European Bioinformatics Institute in Cambridge - Exploiting NLP for Digital Disease Informatics

Accurate and timely collection of facts from a range of text sources is crucial for supporting the work of experts in detecting and understanding highly complex diseases. In this talk I illustrate several applications using techniques that baseline Natural Language Processing (NLP) pipelines against human-curated biomedical gold standards. (1) In the BioCaster project , high throughput text mining on multilingual news was employed to map infectious disease outbreaks. In order to detect norm violations we show the effectiveness of a range of time series analysis algorithms evaluated against ProMED-mail; (2) In the EAR project, a quantitative and qualitative evaluation of BioCaster was conducted on epidemic intelligence data specific to highly pathogenic avian influenza (A/H5N1) showing the effects of combining six systems to achieve improved detection rates; (3) In the PhenoMiner project, using an ensemble approach together with SVM learn-to-rank, we show how existing medical concept recognition systems (NCBO Annotator, cTAKES etc.) can achieve improved levels of performance for disorder name recognition on the ShARE/CLEF electronic patient record collection; (4) Time permitting, I will discuss seasonal influenza tracking against CDC data using text mining of health seeking behaviour signals in Twitter. 
This talk includes work done on the Marie Curie Phenominer project with Tudor Groza (Queensland University), Anika Oellrich (Wellcome Trust Sanger Institute), Damian Smedley (Wellcome Trust Sanger Institute), Dietrich Rebholz-Schuhmann (University of Zurich), Vu Tran Mai (University of Vietnam, Hanoi), and on the JST Sakigake BioCaster project with Son Doan (University of California at San Diego), Mike Conway (University of California at San Diego), Philippe Barboza (WHO and French Institute for Public Health), Laetitia Vaillant (French Institute for Public Health), Abla Mawudeku (Public Health Agency of Canada), Noele Nelson (Georgetown University Medical Center), David Hartley (Georgetown University Medical Center), Lawrence Madoff (ProMED-mail), Jens Linge (JRC), John Brownstein (Boston Children's Hospital, Harvard University), Roman Yangarber (University of Helsinki), Pascal Astagneau (Pierre et Marie Curie University School of Medicine).


    • [1] Collier, N. (2011). Towards cross-lingual alerting for bursty epidemic events. J. Biomedical Semantics, 2(S-5), S10.
    • [2] Barboza, P., Vaillant, L., Mawudeku, A., Nelson, N. P., Hartley, D. M., Madoff, L. C., ... & Astagneau, P. (2013). Evaluation of epidemic intelligence systems integrated in the early alerting and reporting project for the detection of A/H5N1 influenza events. PloS one, 8(3), e57252.
    • [3] Collier, N., Oellrich, A. and Groza, T. (2014), Concept selection for phenotypes and disease-related annotations using support vector machines. In Proc. Phenotype Day at ISMB 2014. Available from
    • [4] Collier, N., Son, N. T., & Nguyen, N. M. (2011). OMG U got flu? Analysis of shared health messages for bio-surveillance. J. Biomedical Semantics, 2(S-5), S9.

11th September, 2014 Roland Roller The University of Sheffield - Self-Supervised Relation Extraction using UMLS

Self-supervised relation extraction uses a knowledge base to automatically annotate a training corpus which is then used to train a classifier. This approach has been successfully applied to different domains using a range of knowledge bases. This talk will present an approach which applies self-supervised relation extraction to the biomedical domain using UMLS, a large biomedical knowledge base containing millions of concepts and relations among them. The approach is evaluated using two different techniques. The presented results are promising and indicate that UMLS is a useful resource for semi-supervised relation extraction.

5th September, 2014 Wilker Aziz The University of Sheffield - Decoding algorithms for statistical translation

In SMT, decoding typically refers to the task of finding an optimum translation derivation under a linear model. In fact, that is just one possible decision rule. Different decision rules come with their own advantages and challenges. In this lecture, I am going to discuss the common challenges of decoding for SMT. I will start drawing a connection between different models of translational equivalences and tools from formal languages and automata theory. This will lead to a formal characterisation of the unweighted set of solutions to be parameterised. I will discuss linear parameterisation and show how it typically requires unpacking compact representations. I will move on to defining search problems under a given parameterisation (e.g. MAP, Viterbi approximation, MBR). Finally, I will survey a number of decoding techniques, always drawing connections to the more fundamental formal tools. Amongst the techniques covered: A* search, beam search, cube pruning, greedy search, coarse-to-fine search, ILP, Lagrangian relaxation and sampling.

4th September, 2014 Marine Carpuat National Research Council Canada - Second Languages as Noisy Supervision for Natural Language Processing

Translations in a second language provide an attractive source of supervision for learning algorithms in Statistical Natural Language Processing (NLP): translated texts are available in large quantities in many languages and can replace costly annotation by experts. In this talk, I will discuss two scenarios where NLP can be guided by second language supervision.

First, translations can be directly used as labels, to disambiguate the meaning of words in context. However, unlike in conventional supervised learning tasks, test examples might require new labels that have been previously observed. I will show that context models from lexical semantics can be used to detect words that gain new senses in new domains.

Second, translations can be used to project linguistic annotation from one language to another. I will show how translations can be used to automatically analyze discourse relations on Chinese given English annotations, despite important divergences in discourse organization between the two languages.


Marine Carpuat is a Research Scientist at the National Research Council Canada, where she works on Natural Language Processing and Statistical Machine Translation. Before joining the NRC, Marine was a postdoctoral researcher at Columbia University in New York. She received a PhD in Computer Science from the Hong Kong University Science & Technology.

2013 - 2014

21st August, 2014 Tim Dawborn Sydney University - docrep: A lightweight and efficient document representation framework

Modelling linguistic phenomena requires highly structured and complex data representations. Document representation frameworks (DRFs) provide an interface to store and retrieve multiple annotation layers over a document. Researchers face a difficult choice: using a heavy-weight DRF or implement a custom DRF. This talk introduces docrep, a lightweight and efficient DRF, and compare it against existing DRFs. docrep has been heavily used within our research lab for the past 2 years but this is the first publicly present work on it. The talk will discuss the design goals and implementations of docrep.

Users of annotated corpora frequently perform basic operations such as inspecting the available annotations, filtering documents, formatting data, and aggregating basic statistics over a corpus. While these may be easily performed over flat text files with stream-processing UNIX tools, similar tools for structured annotation require custom design. docrep provides a declarative description and storage of structured annotations plus a rich set of generic command-line utilities. This talk will also describe the most useful utilities - some for quick data exploration, others for high-level corpus management - with reference to comparable UNIX utilities.

Relavent publications:

COLING 2014:
OIAF4HLT 2014:

21st August, 2014 Andrea Varga The University of Sheffield - Exploiting Domain Knowledge for Adaptive Text Classification in Large Heterogeneous Data Sources

With the growing amount of data generated in large heterogeneous repositories (such as Word Wide Web, corporate repositories, citation databases), there is an increased need for the end users to locate relevant information efficiently. Text Classification (TC) techniques provide automated means for classifying fragments of text (phrases, paragraphs or documents) into predefined semantic types, allowing an efficient way for organising and analysing such large document collections. Current approaches to TC rely on supervised learning, which perform well on domains related to the source domain (on which an TC system is built), but they tend to adapt poorly to even slightly different domains.

This thesis presents a body of work for exploring adaptive TC techniques across heterogeneous corpora in large repositories with the goal of finding novel ways of bridging the gap across the domains. The proposed approaches rely on the exploitation of domain knowledge for the derivation of stable cross-domain features. This thesis also investigates novel ways of estimating the performance of a text classifier, by means of domain similarity measures. For this purpose, two novel knowledge-based similarity measures are proposed that capture the usefulness of the selected cross-domain features for cross-domain TC. The evaluation of these approaches and measures are presented on real world datasets against various strong baseline methods and content-based measures used in transfer learning.

This thesis explores how domain knowledge can be used to enhance the representation of documents to address the lexical gap across the domains. Given that the effectiveness of a text classifier largely depends on the availability of annotated data, this thesis explores techniques which can leverage the data from social knowledge sources (such as DBpedia and Freebase). Techniques are further presented, which explore the feasibility of exploiting different semantic graph structures from knowledge sources in order to create novel cross-domain features and domain similarity metrics. The methodologies presented provide a novel representation of documents, and exploit four wide coverage knowledge sources, such as DBpedia, Freebase, SNOMED-CT and MeSH.

The contribution of this thesis demonstrates the feasibility of exploiting domain knowledge for adaptive TC and domain similarity, providing an enhanced representation of documents with semantic information about entities, that can indeed reduce the lexical differences between domains.

14th August, 2014 Nikos Aletras The University of Sheffield - Representing Topics Labels for Exploring Digital Libraries

Topic models have been shown to be a useful way of representing the content of large document collections, for example via visualisation interfaces (topic browsers). These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a set of keywords, i.e.\ the top-$n$ words with highest marginal probability within the topic. However, alternative topic representations have been proposed, including textual and image labels. In this paper, we compare different topic representations, i.e.sets of topic words, textual phrases and images, in a document retrieval task. We asked participants to retrieve relevant documents based on pre-defined queries within a fixed time limit, presenting topics in one of the following modalities: (1) sets of keywords, (2) textual labels, and (3) image labels. Our results show that textual labels are easier for users to interpret than keywords and image labels. Moreover, the precision of retrieved documents for textual and image labels is comparable to the precision achieved by representing topics using sets of keywords, demonstrating that labelling methods are an effective alternative topic representation.

17th July, 2014 Ayman Alhelbawy The University of Sheffield - Collective Approaches to Named Entity Disambiguation

Named Entity Disambiguation refers to the task of mapping different named entity mentions in running text to their correct interpretations in a specific knowledge base, such as Wikipedia. The main goal of this research is to develop new methods for named entity (NE) disambiguation, emphasising the importance of the interdependency of candidate interpretations of different NE textual mentions in the document. First we present a named entity based text similarity function to help in generating a short NE candidate list containing the correct disambiguation of an NE textual mention from Wikipedia. Then, three novel approaches to collectively disambiguate textual mentions of named entities against Wikipedia are developed and tested. The first approach is based on HMMs and novel approximations are introduced to enable the Viterbi algorithm to be used to decode the named entity textual mention sequence. Two other collective approaches based on graphical models are presented: the first uses graph ranking and the other one uses clique partitioning. Our results show the interdependence between named entity textual mentions in a document can be used reliably to jointly disambiguate all of them. A comparison with the state-of-the-art approaches shows that collective approaches are more accurate for named entity disambiguation than individual approaches which consider each individual named entity textual mention for disambiguation independently of the others.

16th July, 2014 Micha Elsner The University of Edinburgh - Visual complexity and referring expression generation

An observer's perception of a visual scene influences the language they use to describe it --- which objects they choose to mention and how they characterize the relationships between them. In this talk, I will discuss some experiments on the relationship between visual perception and reference. Using a corpus of descriptions of cartoon people from the childrens' book "Where's Wally", we show that speakers' choice of which objects to mention as landmarks is guided by the objects' visual salience and proximity to the target. Moreover,perceptual salience acts like discourse salience in determining which objects are mentioned first and what kinds of determiners are used for them. In a second experiment using arrays of shapes, we look at the generation of expressions in real time. We analyze the ways in which speakers buy themselves time with filled and unfilled pauses and repeated words, and demonstrate that different scene types and descriptive elements have varied processing demands which are reflected in these time-wasting strategies.

Joint work with Hannah Rohde, Alasdair Clarke, Manjuan Duan and Marie-Catherine de Marneffe.

12th June, 2014 Daniel Beck The University of Sheffield - Bayesian Kernel Methods for Natural Language Processing

Kernel methods are heavily used in Natural Language Processing (NLP). Frequentist approaches like Support Vector Machines are the state-of-the-art in many NLP tasks. However, these approaches lack efficient procedures for model selection, which hinders the usage of more advanced kernels.

In this talk, I propose the use of Gaussian Processes (GPs), a Bayesian approach for kernel methods which allow easy model fitting even for complex kernel combinations. They have already been used successfully in a number of NLP tasks. The focus of this talk will be on combining GPs with kernels for structured data, including string and tree kernels. The goal is to employ this approach to improve results in a number of regression and classification tasks in NLP, like Sentiment Analysis and Translation Quality Estimation.

8th May, 2014 Varvara Logacheva - The University of Sheffield - A Quality-based Active Sample Selection Strategy for Statistical Machine Translation

Active learning is a data selection technique which aims at choosing the data instances that are most useful for the training of a system. It can save time spent on manual annotation of the data by requesting annotation only for the most useful objects. Machine translation may benefit from active learning, because acquisition of new training corpora for MT requires manual translation which takes much time and effort.

We present a new method of active learning for machine translation based on quality estimation of automatically translated sentences. It uses an error-driven strategy, i.e., it assumes that the more errors an automatically translated sentence contains, the more informative it is for the translation system. Our approach is based on a quality estimation technique which involves a wider range of features of the source text, automatic translation, and machine translation system compared to previous works. We conducted some experiments showing the effectiveness of our selection strategy compared to random selection and some other strategies.

In our experiments we enhanced the machine translation system training data with post-edited machine translations of the sentences selected, instead of simulating this using previously created reference translations. We found that re-training systems with additional post-edited data yields higher quality compared with the use of the same amount of free references. Moreover, the origin of post-editions does not matter: even if post-editions were acquired by post-editing output of a third-party MT system, they improve the quality.

Additionally, we discovered that oracle selection techniques that use real quality scores lead to poor results, making the effectiveness of confidence-driven methods of active learning for machine translation questionable.

1st May, 2014 Ekaterina Lapshinova-Koltunski - Saarland University - Multilingual Analysis of Discourse Variation

One of the components of effectively organised and meaningful discourse is textual cohesion, as the message being communicated in discourse is not just a set of clauses, but forms a unified, coherent whole. While coherence concerns the cognitive aspects of establishing meaning relations during text processing, cohesion involves explicit linguistic means that signal how clauses and sentences are linked link together to function as a whole.

Classifications of lexico-grammatical markers and their relational potentials are quite often language specific. For multilingual analysis, e.g. contrastive linguistics or translation (both human and machine) studies, it is important to establish categories which enable the comparison of inventories across languages in order to identify similarities or contrasts. Moreover, languages vary according to their contextual settings which result in registers (or genres). In a multilingual analysis, settings of one register in a language does not necessarily correspond to the settings of the same register in another language. This means that the typical distribution of lexico-grammatical features expressing cohesive (discourse) relations will vary across languages and also registers. We need to analyse this variation to be aware of it, e.g. while translating.

To analyse this variation, we semi-automatically annotate a multilingual (English-German) corpus on the level of cohesion. The annotation categories we operate with base on lexico-grammatical and semantic aspects of different cohesive types. This annotation scheme allows us to compare and differentiate cohesive features across languages, registers, including written and spoken dimensions.

In my talk, I will present the categories of cohesion we work with, the procedures to annotate them in a corpus, as well as some of the results of our analysis, which provide explanatory backgrounds for contrastive and translation-relevant discourse properties. Moreover, this allows us to interpret semantic relations and different degrees of strength and breadth of cohesive encoding for the two languages which may serve as contrastive background for translation strategies and evaluation.

10th April, 2014 Nikos Aletras - The University of Sheffield - Measuring the Similarity between Automatically Generated Topics

Previous approaches to the problem of measuring similarity between automatically generated topics have been based on comparison of the topics' word probability distributions. This paper presents alternative approaches, including ones based on distributional semantics and knowledge-based measures, evaluated by comparison with human judgements.

The best performing methods provide reliable estimates of topic similarity comparable with human performance and should be used in preference to the word probability distribution measures used previously.

3rd April, 2014 Monica Paramita - The University of Sheffield - Bootstrapping Term Extractors for Multiple Languages

Terminology extraction resources are needed for a wide range of human language technology applications, including knowledge management, information extraction, semantic search, cross-language information retrieval and automatic and assisted translation. However, their availabilites are limited to highly-resourced languages. In this study, we report a low cost method for creating terminology extraction resources for 21 non-English EU languages. Using parallel corpora and a projection method, we create a General POS Tagger for these languages. We also investigate the use of EuroVoc terms and Wikipedia to automatically create a term grammar for each language. Our results show that these automatically generated resources can assist the term extraction process, achieving similar performance to manually generated resources.

6th March, 2014 Adam Kilgarriff - Lexical Computing Ltd - Getting to know your corpus (+ term-finding)

Corpora are not easy to get a handle on. The usual way of getting to grips with text is to read it, but corpora are mostly too big to read (and not designed to be read). We show, with examples, how keyword lists (of one corpus vs: another) are a direct, practical and fascinating way to explore the characteristics of corpora, and of text types. Our method is to classify the top one hundred keywords of corpus1 vs: corpus2, and corpus2 vs: corpus1. This promptly reveals a range of contrasts between all the pairs of corpora we apply it to. We also present improved maths for keywords, briefly discuss quantitative comparisons between corpora, and show how we extend the method to find the terms in a domain. All the methods discussed (and almost all of the corpora) are available in the Sketch Engine, a leading corpus query tool.

13th February, 2014 Hieu Hoang - The University of Edinburgh - Updating the Feature Function Framework in the Moses Decoder

SBest practices for developing a SMT decoder have progressed enormously since the initial release of the Moses decoder in 2006. The emergence of syntax-inspired models, sparse features and more features, have been the driver for the major changes in the framework of the decoder.

In this talk, we describe the changes in the Moses decoder and supporting programs to support such changes, while preserving the toolkit's existing functionality. Particular emphasis will be place in describing the feature function framework.

We compare the implementation of the Moses SCFG decoder to that of cdec and Joshua and find that there are some surprising differences in implementation of even very fundamental features in each decoder.

Bio Hieu Hoang is a postgraduate researcher at the University of Edinburgh, where he also completed his Phd in 2011. His main responsibility is maintaining and enhancing the Moses SMT toolkit as part of the MosesCore project, funded by the EU.

He started the Moses project when he wrote a slow, buggy phrase-based decoder as part of his Phd. Things got quickly out of hand when it was picked up and improved upon at the JHU Summer Workshop on SMT. He has been dining on it ever since.

Before returning to Edinburgh as a researcher, Hieu spent a year working at Asia Online, getting first hand experimence of SMT in a commercial setting. He studied Computer Science and Machine Learning as a student.

23rd January, 2014 Shay Cohen - The University of Edinburgh - Spectral Learning and Decoding for Natural Language Parsing

Spectral methods have received considerable attention in the past in the machine learning and the NLP communities. Most recently, they have been applied to latent-variable modelling. In this case, their two most clear advantages over the use of EM is their computational efficiency and the sound theory behind them (they are not prone to local maxima like EM).

In this talk, I will present two distinct uses of the spectral method for natural language parsing. I will describe a learning algorithm for latent-variable PCFGs, a very useful model for constituent parsing. I will also describe our use of tensor decomposition for speeding up parsing inference. Here, we approximate the underlying model by using a tensor decomposition algorithm, and this approximation permits us to use fast inference with dynamic programming.

If time permits, I will also touch on the use of spectral decomposition algorithms for unsupervised learning.

Joint work with Michael Collins, Dean Foster, Ankur Parikh, Giorgio Satta, Karl Stratos, Lyle Ungar, Eric Xing.

18th December, 2013 Guillaume Bouchard - Xerox - Probabilistic modeling of relational databases: why and how?

Statistical modeling of multi-relational data is a generic approach to perform predictive tasks in a database. We illustrate it by showing that most of the common machine learning tasks can be viewed as predicting entries in a relational database, including supervised classification, collaborative filtering, forecasting, learning to rank and preference learning are special cases. We give a broad overview of the research in this domain, together with some applications in domains such as bioinformatics, social media monitoring, knowledge base optimization and learning analytics. Focusing on multi-linear algebra, i.e. tensor and matrix factorization, we show that the key advantages of such approaches are their scalability, enabling "big data" applications as well as their simplicity: the only one core operation to know is the dot product. Numerical experiments illustrate the efficiency of such methods. Depending on the time, some recent work will be presented on 1) convex relaxation of multi-relational data factorization, 2) the handling of problems with large amount of zeros (i.e. negative examples) and 3) the optimization of regularization parameters by using type-II maximum likelihood instead of heavy cross-validation.

Guillaume Bouchard is senior research scientist at Xerox Research Centre Europe. The talk includes work done in collaboration with Arto Klami (University of Helsinki), Beyza Ermis (Bo?azi\E7i University), Behrouz Behmardi (Samsung), Abhishek Tripathi (Xerox), Shengbo Guo (Samsung), Dawei Yin (Yahoo) and Cedric Archambeau (Amazon).

6th December, 2013 Georgios Paltoglou - Wolverhampton University - Sentiment Analysis: An introduction

Sentiment analysis deals with the detection and analysis of affective content in written text. It has known particular popularity recently, both in academia and industry, especially due to the unprecedented increase of user-generated content in social media. In this talk, Dr. Georgios Paltoglou will provide a brief introduction to the field and focus on the techniques that have been developed within the Wolverhampton research group for dealing with the issues and challenges that it presents. Outputs of the research were used during the London Olympic Games to light the London Eye and are also currently utilized by Yahoo! Answers, amongst other companies.

21st November, 2013 Viviana Cotik - University of Buenos Aires - Research on BioNL

The growth of digital literature drives the need for automatic techniques to analyze texts.

The extraction of medical terms and of relationships in medical reports, as medication-adverse effects enables the detection of patterns not previously known. Acronyms are often used in different texts, particularly in biological/medical ones. In order to recognize entities and disambiguate word sense, acronym identification is an important component.

I will make a short presentation of the work that I have been doing in acronym-expansion/detection and medical terms and relationship detection as part of my PhD thesis in this area.

7th November, 2013 Serge Sharoff - Leeds University - Approaching genre classification

Approaching genre classification from the viewpoint of document similarity I will present an approach to studying large text collections with respect to the genre of their constituent texts. Given that such concepts are not well defined, existing annotated corpora differ in the set of labels used. Also we can often observe considerable dissimilarity between texts annotated with the same label. This can can be remedied by having a text-external similarity space to compare the documents. This allows us to train classifiers based on text-internal features to represent the contents of text collections and compare them within and across languages. For example, we will be able to asnwer questions to what extent a crawl of the UK Web is similar to the genre categories in the Brown corpus, or which portions of a Reuters corpus in English are similar to a Xinhua corpus in Mandarin Chinese.

3rd October, 2013 Barry Haddow - University of Edinburgh - Domain Adaptation in Statistical Machine Tranlsation

To create a high quality statistical machine translation (SMT) system, we require large quantities of training data, both monolingual and parallel.

We obtain the best results when the training data is similar to the data we wish to test on, however there is rarely enough of the right type of data and we typically have to combine data from disparate sources.

This mismatch between the training and test sets is known as the domain adaptation problem. In this talk we will analyse the domain adaptation problem and show how the train/test mismatch affects the SMT training pipeline. We will discuss various techniques for domain adaptation in SMT; data selection and weighting, model combination, and transforming the test data to make it more like the training data.

2012 - 2013

29th July, 2013 Will Radford - Named Entity Linking, Apposition and Media Demos

Named Entity Linking attempts to ground textual entity mentions to an external knowledge base (e.g., Wikipedia). Mentions are assigned a KB entry or NIL if they are absent from the KB. The task requires resolving name polysemy and synonymy to disambiguate their references. I'll discuss the NEL task, datasets and evaluation as well as a comparison of some systems from the literature.

More recently, we've investigated how to better extract entity descriptions for disambiguation. While apposition is used as a component in several tasks (e.g., Coreference Resolution, Textual Entailment), apposition extraction performance is not often directly evaluated. We propose systems exploiting syntactic and semantic constraints to extract appositions from OntoNotes 4. Our joint log-linear model outperforms the state-of-the-art model (Favre and Hakkani-T\FCr in Interspeech 2009), by around 10% on Broadcast News, and achieves 54.3% F-score on multiple genres. I'll talk about our apposition system and some more general work on sentence-local description and its application to NEL.

Finally, I'll demonstrate an application developed with our industry partner. Fairfax Media operates some of the main metropolitan newspapers and news websites in Australia and has recently launched "zoom", showing how NEL can provide a compelling view of 25 years of stories.

6th June, 2013 Internal Paper Presntations

Douwe Gelling - University of Sheffield Combined method for Gold Standard and Model in Source-Side Reordering

Daniel Preotiuc - University of Sheffield Modeling temporal periodicities in NLP

Temporal periodicities naturally occur in many time series. With data from Social Media usage we can create many of these, such as word frequencies, word co-occurence, Twitter hashtag frequencies, topic frequencies or Foursquare check-ins over time.

I will present preliminary work in modeling periodicities using Gaussian Processes where we automatically identify the period and the shape of the periodic signal. This is applied to the regression task of forecasting Twitter hashtag volume.

Roland Roller - BioNLP 2013 Shared Task - Identification of Genia Events using Multiple Classifier

In my talk I will describe our participation in the BioNLP 2013 Shared Task, a competition for information extraction in the biomedical domain. We focussed on the Genia Event (GE) Extraction task, which aims on the detection of gene events and their regulation. For the participation we set up a supervised IE platform using a SVM as a relation classifier. For each single event we optimised the classification process by adjusting the SVM parameter and the selection of the involved features. Our system achieved the highest precision of all participating systems in that task and is ranked on the 6th place in terms of F-Score.

30th May, 2013 Internal Paper Presntations

Nikoloas Aletras - Representing Topics Using Images

Topics generated automatically, e.g. using LDA, are now widely used in Computational Linguistics. Topics are normally represented as a set of keywords, often the n terms in a topic with the highest marginal probabilities. We introduce an alternative approach in which topics are represented using images. Candidate images for each topic are retrieved from the web by querying a search engine using the top n terms. The most suitable image is selected from this set using a graph-based algorithm which makes use of textual information from the metadata associated with each image and features extracted from the images themselves. We show that the proposed approach significantly outperforms several baselines and can provide images that are useful to represent a topic.

Abdulaziz Alamri

Abdulaziz has been in the IT field for a while and part of his experience was in the field of NLP. In this talk, he will introduce himself and describe briefly his previous work experience within the NLP.

16th May, 2013 Ayman AlHelbawy The University of Sheffield - Collective Named Entity Disambiguation Using HMMs

In this paper we present a novel approach to disambiguate textual mentions of named entities against the Wikipedia knowledge base.

The conditional dependencies between different named entities across Wikipedia are represented as a Markov network. In our approach, named entities are treated as hidden variables and textual mentions as observations. The number of states and observations is huge and naively using the Viterbi algorithm to find the hidden state sequence that emits the query observation sequence is computationally infeasible, given a state space of this size. Based on an observation that is specific to the disambiguation problem, %; for each textual mention that there is a disambiguation list of reference knowledge base named entities. So, we propose an approach that uses a tailored approximation to reduce the size of the state space, making the Viterbi algorithm feasible.

Results show good improvement in disambiguation accuracy relative to the baseline approach and to some state-of-the-art approaches. Also, our approach shows how, with suitable approximations, HMMs can be used in such large-scale state space problems.

9th May, 2013 Daniel Beck The University of Sheffield - Minimizing Annotation Costs in Quality Estimation

Quality Estimation (QE) models provide a quality feedback for new, unseen machine translated (MT) texts without relying on reference translations. These models are usually built by applying supervised machine learning techniques on datasets composed of human-evaluated machine translations. Since QE is a task-specific problem, QE models should, ideally, be specifically tailored to their end task, taking into account the annotators, language pairs and MT systems, among other features. However, building task-specific models leads to a large increase in annotation costs. In this talk, I will show some approaches to tackle this issue, including strategies to reduce the annotation effort and to reuse different datasets by applying domain adaptation techniques.

2nd May, 2013 Wilker Aziz University of Wolverhampton - Exact Optimisation and Sampling for Statistical Machine Translation

In this talk I will present the OS* algorithm (Dymetman et al, 2012) and how this algorithm can be used to perform exact optimisation and sampling for SMT.

OS* is a tractable form of adaptive rejection sampling that can also be used for optimisation.

The contributions of this research go beyond the exactness aspect of OS*. Sampling has many applications in SMT, such as it enables one to better explore the space of likely solutions, it is less prone to outliers than optimisation, it is relevant to topics such as minimum error rate training, minimum Bayes risk decoding, and consensus decoding.

Topics relevant to this talk are: SMT, SMT decoding, automata theory, complexity theory, optimisation and sampling.

25th April, 2013 Eric Atwell Leeds University - Natural Language Processing Working Together with Arabic and Islamic Studies

You may ask: is the Qur'an a suitable dataset for Computing research? Text Analytics touches many domains and applications; but in general NLP research involves Machine Learning from a domain-specific corpus of text documents enriched with linguistic and semantic tags. Ideally, we want a domain where: a source text corpus is freely available, with no IPR or privacy restrictions; a large expert community exists, which has already developed standard "tagging schemes" or linguistic analyses and ontologies for the domain, such as Ibn Kathir's Tafsir ("gold standard" commentary); and a large user group exists, to assist with linguistic and semantic tagging, and to evaluate our systems and results, and also to use the text analytics tools we deliver, so that our research has Impact. The corpus which best meets these research criteria is the Qur'an: the source Classical Arabic text is freely available; Qur'anic scholars over the past thousand years have developed a rich tradition of Arabic linguistics to formally describe the language and meaning of the Quran; and billions of Muslims worldwide constitute the largest user-group ever for a single text corpus.

Our Arabic Language Computing research group at Leeds University developed:

We have just begun an EPSRC-funded project, Natural Language Processing Working Together with Arabic and Islamic Studies. We will bring together these different levels of linguistic annotation in a single mutli-layer corpus, and add phonetic and prosodic annotations to capture Tajweed or Quranic recitation. We will also target research communities in: NLP and Artificial Intelligence; Arabic Language and Literature; Qur'anic and Islamic Studies; Religious Studies and Theology; Corpus Linguistics and Digital Humanities; Lexicography; and Linguistics and Phonetics.

Join us at the WACL'2 Workshop on Arabic Corpus Linguistics, 22 July 2013, Lancaster University.

18th April, 2013 Sebastian Riedel University College London - Relation Extraction with Matrix Factorization and Universal Schemas

The ambiguity and variability of language makes it difficult for computers to analyse, mine, and base decisions on. This has motivated machine reading: automatically converting text into semantic representations. At the heart of machine reading is relation extraction: predicting relations between entities, such as employeeOf(Person,Company). Machine learning approaches to this task require either manual annotation or, for distant supervision, existing databases of the same schema (=set of relations). Yet, for many interesting questions (who criticised whom?) pre-existing databases and schemas are insufficient. For example, there is no critized(Person,Person) relation in Freebase. Moreover, the incomplete nature of any schema severely limits any global reasoning we could use to improve our extractions.

In this talk I will first present some earlier work we have done in distantly supervised extraction. Then I will show that the need for pre-existing datasets can be avoided by using, what we call, a "universal schema": the union of all involved schemas (surface form predicates such as "X-was-criticized-by-Y", and relations in the schemas of pre-existing databases). This extended schema allows us to answer new questions not yet supported by any structured schema, and to answer old questions more accurately. For example, if we learn to accurately predict the surface form relation "X-is-scientist-at-Y", this can help us to better predict the Freebase employee(X,Y) relation.

To populate a database of such schema we present a family of matrix factorization models that predict affinity between database tuples and relations. We show that this achieves substantially higher accuracy than the traditional classification approach. More importantly, by operating simultaneously on relations observed in text and in pre-existing structured DBs, we are able to reason about unstructured and structured data in mutually-supporting ways. By doing so our approach outperforms state-of-the-art distant supervision.

21st March, 2013 Yorick Wilks Florida Institute of Human and Machine Cognition - Can metaphor processing move to a large and empirical scale?

The paper described part of the current US effort on metaphor recognition and interpretation, and in particular the CMU/IHMC project METAL. The paper also presents an experimental algorithm to detect conventionalised metaphors implicit in the lexical data in a resource like WordNet, where metaphors are coded into the senses and so would never be detected by any algorithm based on the violation of preferences, since there would always be a constraint satisfied by such senses, We report an implementation of this algorithm, which was implemented first with Wordnet and the (limited) preference constraints in VerbNet. We then transformed WordNet in a systematic way so as to produce far more extensive constraints based on its content, and with this data we reimplemented the detection algorithm and got a substantial improvement in recall. We suggest that this algorithm could contribute to the core detection pipeline of the METAL project at CMU. The new WordNet data is of wider significance because it also produces adjective constraints, unlike any existing lexical resource, and can be applied to any language with a WordNet for it.

14th March, 2013 Internal Paper Presentations

Nikoloas Aletras - Evaluating Topic Coherence Using Distributional Semantics

This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collect frequencies. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information (PMI). Topic coherence is determined by measuring the distance between these vectors computed using a variety of metrics. Evaluation on three data sets shows that the distributional-based measures outperform the state-of-the-art approach for this task.

Dominic Rout - Reliably Evaluating Summaries of Twitter Timelines

The primary view of the Twitter social network service for most users is the timeline, a heterogenous, reverse chronological list of posts from all connected authors.
Previous tweet ranking and summarisation work has heavily relied on retweets as a gold standard for evaluation. The author argues that this is unsatisfactory, since retweets only account for certain kinds of post relevance. The focus of the talk is work-in-progress on designing a user study, through which to create a gold standard for evaluating automatically generated summaries of personal timelines.

7th March, 2013 Samia Touileb University of Bergen - Inducing local grammars from n-grams

With the increase of information in blogs, there is a pressing need to develop tools for extracting statements that characterize the content of blog posts (e.g. to highlight different opinions). In this talk we will present our ideas for using grammar induction to create statement extraction templates that capture the typical expressions around a concept. We have evaluated two algorithms (ADIOS [Solan et al., 2005], ABL [Van Zaanen, 2001]) on input data comprising n-grams around the concept "climate change" (generated from a blog search engine).

28th February, 2013 Andrew Salway Uni Computing, Bergen - Key Statement Extraction in the NTAP project

Language technologies have an important role to play in making social media a more accessible information source, and in enabling social scientists to better understand how, through social media, organisations and individuals influence public opinion on important and complex issues. The first part of this talk will give an overview of the NTAP project (2012-15, which is synthesising language processing and network visualization in order to map the distribution, flow and development of information/opinions in the blogosphere. A distinctive feature of our approach is the treatment of text content as key statements, rather than as keywords, which elucidates the diverse aspects and viewpoints on an issue. By associating key statements with blogs and time-stamps, we hope to be able to track the diffusion of a statement (e.g. "climate change is caused by humans") along with statements related to it. The second part of the talk will present and discuss early results for extracting key statements from the blogosphere using relatively portable methods, with the example of statements about the causes and effects of climate change.

21st February, 2013 Marcelo Amancio The University of Sheffield - Automatic Text Adaptation

Text Adaptation is one of the activities that writers use to improve text comprehension and text readability for certain audiences. Two main techniques are usually used. One is Text Elaboration, which brings complementary information in the text, and the other is Text Simplification, which rewrites the text using simpler grammar and vocabulary. My talk will present my former worker in Text Elaboration and present the my initial approach in Text Simplification within the context of my PhD work.

29th November, 2012 Samuel Fernando The University of Sheffield - Comparing taxonomies for organising collections of documents

There is a demand for taxonomies to organise large collections of documents into categories for browsing and exploration. This paper examines four existing taxonomies that have been manually created, along with two methods for deriving taxonomies automatically from data items. We use these taxonomies to organise items from a large online cultural heritage collection. We then present two human evaluations of the taxonomies. The first measures the cohesion of the taxonomies to determine how well they group together similar items under the same concept node. The second analyses the concept relations in the taxonomies. The results show that the manual taxonomies have high quality well defined relations. However the novel automatic method is found to generate very high cohesion.

15th November, 2012 Roland Roller The University of Sheffield - Presentation of my former work

Within my talk I would like to present my former work, in particular my work at NTT communication science laboratories and DFKI. First I would like introduce the influence model and my extension of user turn segmentation. Both models utilize the effect of speech entrainment to improve the language model in polylogue. Furthermore, I will present the SpeechEval project, a corpus-based user simulation to evaluate spoken dialogue systems.

8th November, 2012 Mark Steedman The University of Edinburgh - The Future of Semantic Parser Induction

There has recently been some interest in the task of inducing grammar-based "semantic parsers" from sets of paired strings and meaning representations, following pioneering work by Zettlemoyer and Collins (2005). Work of this kind is currently limited by the paucity of datasets for training. The talk reviews the state of the art in this field, then proposes a way to semi-automatically generate much larger datasets, on the same order of magnitude as syntactic treebanks, using linguistic knowledge that has only recently begun to become available, for use in inducing semantic parsers for under-resourced languages for possible applications of semantic parsing in statistical machine translation.

1st November, 2012 Dominic Rout The University of Sheffield - Drowning in Tweets: Automatic Summarisation of Twitter's home timelines

Social networks such as Twitter present vast oceans of information in which it's easy for the average user to drown. Where content is generated by absolutely anyone in no time, it's easy to see why the number of incoming tweets can quickly become too much to handle. This talk discusses the problem of 'information overload' on social network services. We present a study that helped to demonstrate how twitterers at The University of Sheffield are interested in only a fraction of the content to which they are exposed. We also provide a background and describe the state of the art in personalised timeline summarisation for Twitter users.

This presentation is given as part of the speaker's PhD research programme and discusses his ongoing work.

25th October, 2012 Vasileios Lampos The University of Sheffield - Detecting Events and Patterns in the Social Web with Statistical Learning

A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The Social Web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The main purpose of this talk is to present methods that enable us to automatically extract useful conclusions from this raw information in both supervised and unsupervised learning scenarios. Our input data stream will be the micro-blogging service of Twitter and presented applications will include the 'nowcasting' of Influenza-like illness rates as well as collective mood analysis for the UK.

Selected Publications

[1] V. Lampos and N. Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST 3(4), no. 72, 2012. [ Link: ]

[2] V. Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods. PhD Thesis, University of Bristol, 2012. [ Link: ]

18th October, 2012 Oier Lopez De Lacalle Lekuona University of Cambridge Visiting Scholar - Domain Specific Word Sense Disambiguation

Word Sense Disambiguation (WSD), in its broader sense, can be considered as a task determining the sense of every word occurring in a context. Computationally, it can be seen as classification problem, where the sense are the classes, the context provides the evidence, and each occurrence of a word is assigned to one or more possible classes based on the evidence. WSD often is described as "AI-complete" problem, whose solution presupposes a solution to complete Natural Language Understanding (NLU).

State-of-the-art methods which acquire linguistic knowledge via hand-tagged text mainly suffer from two drawbacks, called the data-sparseness and the domain shift problems. This is specially noticeable in WSD, where there is a lack of training examples. The domain shift problem involve potential changes on word sense distribution and context distribution. This makes more difficult to estimate a robust and high-performance models, and causes a degradation in the performance when porting from one domain to another.

This work explores domain adaptation issues for WSD systems based on features induced with Singular Value Decomposition (SVD) and the use of unlabeled data. The use of the SVD and unlabeled data might be helpful to mitigate the data sparseness problem, and make possible to port WSD system across domains. SVD finds a condensed representation and reduce significantly the dimensionality of the feature space. This representation captures indirect and high-order associations, by finding linear combination over features and occurrences of target words. This work presents how to induce the reduced feature space, and shows how it can help adapting a generic WSD system into specific domains.

11th October, 2012 Diana McCarthy University of Cambridge Visiting Scholar - Compositionality modelling and non-compositionality detection with distributional semantics

Distributional similarity has been used as a proxy for modelling lexical semantics for nearly two decades. There is now a significant and growing interest in moving these models from lexical to phrasal semantics. For just under one decade, many computational linguistics researchers have applied distributional semantics to the task of detecting non-compositionality of candidate multiwords. In this talk, I will give an overview of my work in this area. I will focus on the more recent work I have collaborated on, with Siva Reddy and colleagues, which borrows techniques from the state-of-the-art phrasal compositional models for non-compositionality detection. Ultimately, these distributional models of phrasal semantics will need to be extended to incorporate non-compositionality.

4th October, 2012 Rob Gaizauskas The University of Sheffield - Applying ISO-Space to Healthcare Facility Design Evaluation Reports

This paper describes preliminary work on the spatial annotation of textual reports about healthcare facility design to support the long-term goal linking of report content to a three-dimensional building model. Emerging semantic annotation standards enable formal description of multiple types of discourse information. In this instance, we investigate the application of a spatial semantic annotation standard at the building-interior level, where most prior applications have been at inter-city or street level. Working with a small corpus of design evaluation documents, we have begun to apply the ISO-Space specification to annotate spatial information in healthcare facility design evaluation reports. These reports present an opportunity to explore semantic annotation of spatial language in a novel situation. We describe our application scenario, report on the sorts of spatial language found in design evaluation reports, discuss issues arising when applying ISO-Space to building-level entities and propose possible extensions to ISO-Space to address the issues encountered.

27th September, 2012 Kashif Shah The University of Sheffield - Weighting parallel data for model adaptation in SMT

Statistical Machine Translation (SMT) systems use parallel texts as training material for creation of translation model and monolingual corpora for target language modeling. The performance of an SMT system heavily depends upon the quality and quantity of available data. In order to train the translation model, the parallel texts is collected from various sources and domains. These corpora are usually concatenated, word alignments are calculated, phrases are extracted and their translation probabilities are estimated. This means that the corpora are not weighted according to their importance to the domain of the translation task. Therefore, it is the domain of the training resources that influences the translations that are selected among several choices. This is in contrast to the training of the language model for which well known techniques are used to weight the various sources of texts. We have proposed novel methods to automatically weight the heterogeneous data to adapt the translation model. I will present the underlying architecture of proposed techniques along with experiments and results.

2011 - 2012

28th June, 2012 Chris Daniels The University of Sheffield (CICS) - To talk to a person, press one: An insider's view of the Automated University Switchboard Speaker

The Automated Switchboard was the first general use of Speech Self Service at the University. Naturally the new service would be met with both interest and resistance. This talk will provide an anecdotal account of the design, development and evaluation processes used in its implementation. It will discuss the development tools and grammar design in addition to questions beyond the technical surrounding the social and political elements of replacing a human operator with an automated system at the University.

21st June, 2012 Yang Feng The University of Sheffield - Left-to-Right Tree-to-String Decoding with Prediction

Decoding algorithms for syntax based machine translation suffer from high computational complexity, a consequence of intersecting a language model with a context free grammar. Left-to-right decoding, which generates the target string in order, can improve decoding efficiency by simplifying the language model evaluation. This paper presents a novel left to right decoding algorithm for tree-to-string translation, using a bottom-up parsing strategy and dynamic future cost estimation for each partial translation. Our method outperforms previously published tree-to-string decoders, including a competing left-to-right method.

30th May, 2012 Douwe Gelling The University of Sheffield - Using Senses in HMM Word Alignment

Some of the most used models for statistical word alignment are the IBM models. Although these models generate acceptable alignments, they do not exploit the rich information found in lexical resources, and as such have no reasonable means to choose better translations for specific senses.

We try to address this issue by extending the IBM HMM model with an extra hidden layer which represents the senses a word can take, allowing similar words to share similar output distributions. We test a preliminary version of this model on English-French data. We compare different ways of generating senses and assess the quality of the alignments relative to the IBM HMM model, as well as the generated sense probabilities, in order to gauge the usefulness in Word Sense Disambiguation.

28th May, 2012 Juidta Preiss The University of Sheffield - Identifying Comparable Corpora Using LDA

Parallel corpora have applications in many areas of Natural Language Processing, but are very expensive to produce. Much information can be gained from comparable texts, and we present an algorithm which, given any bodies of text in multiple languages, uses existing named entity recognition software and topic detection algorithm to generate pairs of comparable texts without requiring a parallel corpus training phase. We evaluate the system's performance firstly on data from the online newspaper domain, and secondly on Wikipedia cross-language links.

10th May, 2012 Federico Sangati (University of Edinburgh) - Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

I will mainly present my EMNLP 2011 paper describing a novel approach to Data-Oriented Parsing (DOP). Like other DOP models, the parser utilizes syntactic fragments of arbitrary size from a treebank to analyse new sentences, but, crucially, it uses only those which are encountered at least twice in the training data. This follows the general assumption of considering a syntactic construction linguistically relevant if there is some empirical evidence about its reusability in a representative treebank. This criterion allows us to work with a relatively small but representative set of fragments, which can be employed as the symbolic backbone of several probabilistic generative models. For parsing we define a transform-backtransform approach that allows us to use standard PCFG technology, making our results easily replicable. According to standard Parseval metrics, our best model is on par with other state-of-the-art parsers, while offering some complementary benefits: a simple generative probability model, and an explicit representation of the larger units of grammar.

In the final part of the talk I will introduce my current parsing framework: an efficient and accurate incremental Double-DOP parser which only utilizes lexicalized recurring fragments.

3rd May, 2012 Internal Paper Presentations

Daniel Preotiuc Real Time Analysis of Social Media Text

The emergence of online social networks (OSNs) and the accompanying availability of large amounts of data, pose a number of new natural language processing (NLP) and computational challenges. Data from OSNs is different to data from traditional sources (e.g. newswire). The texts are short, noisy and conversational. Another important issue is that data occurs in a real-time streams, needing immediate analysis that is grounded in time and context.

I will describe a new open-source framework for efficient text processing of streaming OSN data. I will present the current state of its development as well as some novel contributions to tackle two important issues: social network user location and recall oriented information retrieval.

Jing Li Biologically-inspired Building Recognition

Building recognition has attracted much attention in computer vision research. However, existing building recognition systems have the following problems: 1) extracted features are not biologically-related to human visual perception; 2) features are usually of high dimensionality, resulting in the curse of dimensionality; 3) semantic gap between low-level visual features and high-level image concepts; and 4) limited challenges set by published databases. To this end, we propose a biologically-inspired building recognition scheme and create a new building image database to address the aforementioned problems. The scheme is based on biologically-inspired features that can model the process of human visual perception. To deal with the curse of dimensionality, the dimensionality of extracted features is reduced by linear discriminant analysis (LDA). To fill the semantic gap, a relevance feedback-based support vector machine (SVM) is applied for classification.

29th March, 2012 Massimo Poesio (The University of Essex) - Rethinking anaphora

Current models of the anaphora resolution task achieve mediocre results for all but the simpler aspects of the task such as coreference proper (i.e. linking proper names into coreference chains). One of the reasons for this state of affairs is the drastically simplified picture of the task at the basis of existing annotated resources and models-e.g., the assumption that human subjects by and large agree on anaphoric judgments. In this talk I will present the current state of our efforts to collect more realistic judgments about anaphora through the Phrase Detectives online game, and to develop models of anaphora resolution that do not rely on the total agreement assumption.

Joint work with Jon Chamberlain and Udo Kruschwitz

15th March, 2012 Internal Paper Presentations

Nikos Aletras - Computing Similarity between Cultural Heritage Items using Multimodal Features

A significant amount of information about Cultural Heritage artefacts is now available in digital format and has been made available in digital libraries. Being able to identify items that are similar would be useful for search and navigation through these data sets. Information about items in these repositories is often multimodal, such as pictures of the artefact and an accompanying textual description. This paper explores the use of information from these various media for computing similarity between Cultural Heritage artefacts. Results show that combining information from images and text produces better estimates of similarity than when only a single medium is considered.

Mark Hall - Enabling the Discovery of Digital Cultural Heritage Objects through Wikipedia

Over the past years large digital cultural heritage collections have become increasingly available. While these provide adequate search functionality for the expert user, this may not offer the best support for non-expert or novice users. In this paper we propose a novel mechanism for introducing new users to the items in a collection by allowing them to browse Wikipedia articles, which are augmented with items from the cultural heritage collection. Using Europeana as a case-study we demonstrate the effectiveness of our approach for encouraging users to spend longer exploring items in Europeana compared with the existing search provision.

8th March, 2012 Robert Villa (The University of Sheffield / Information School) - Can an Intermediary Collection Help Users Search Image Databases Without Annotations?

Developing methods for searching image databases is a challenging and ongoing area of research. A common approach is to use manual annotations, although generating annotations can be expensive in terms of time and money. Content-based search techniques which extract visual features from image data can be used, but users are typically forced to express their information need using example images, or through sketching interfaces. This can be difficult if no visual example of the information need is available, or when the information need cannot be easily drawn.

In this talk an alternative approach is considered, where the a final content-based image search is mediated by an intermediate database which contains annotated images. A user can search by conventional text means in the intermediate database, as a way of finding visual examples of their information need. The visual examples can then be used to search a database that lacks annotations. Experiments which investigated this idea, culminating in a small user study, will be discussed in this talk.

19th January, 2012 Maria Liakata (Aberystwyth University / European Bioinformatics Institute (EMBL-EBI)), Cambridge - Towards reasoning with scientific articles: identifying conceptualisation zones and beyond

Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. Here I discuss our approach and results from automatically annotating the scientific discourse at the sentence level in terms of eleven categories, which we call the Core Scientific Concepts. I will present applications of this work in extractive summarisation and its implications in improving our automatic understanding of scientific articles.

8th December, 2011 Sascha Kriewel (Universität Duisburg-Essen) - Introduction to Daffodil / ezDL

Daffodil was created to provide strategic support through high-level search functions to users of Digital Libraries. It is based on ideas by Marcia Bates with the goal of supporting the entire scientific workflow.

The agent-based architecture of the backend can be easily extended to add new services and a tool-based user client can be configured into different perspectives for specific tasks. Since 2009 the software is being re-implemented as ezDL (easy access to Digital Libraries). EzDL is currently used within several running projects and provides a platform for user-based evaluations, e.g. within the INEX iTrack.

17th November, 2011 Elaine Toms (The University of Sheffield) - Designing the next generation information appliance

Finding information has been all about plugging keywords into a search box and scanning a ranked list of items where the ranking has been based on a mysterious and somewhat magical query-keyword match with a set of documents. This has led to unreasonable expectations about the power of the search box, and disappointment in results particularly in workplace settings where outputs have productivity, profit, and performance implications. How do we move beyond this simple "bag of words" approach? The problem has both algorithmic and interface issues that are tightly inter-related. In this talk, I will discuss two studies, one in which we considered the interface problem, and one in which we started from the beginning -- the requirements for an application, rather than from the source of documents to be used.

10th November, 2011 Ahmet Aker (The University of Sheffield) - Conceptual Modelling for Multi-Document Summarization

I will talk about the paper I presented in ACL 2010 (see abstract of the paper below). However, I will also discuss my current experiments and ask for feedback from your side. I hope with those new experiments I can finalize my PhD.

This paper presents a novel approach to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image's location. The summarizer is biased by dependency pattern models towards sentences which contain features typically provided for different scene types such as those of churches, bridges, etc. Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries. Summaries generated using dependency patterns also lead to more readable summaries than those generated without dependency patterns.

3rd November, 2011 Ayman Alhelbawy (The University of Sheffield) - Disambiguating Named Entities against a Reference Knowledge Base

The task of Named Entity Linking, as defined in the recent NIST knowledge base population evaluation, aims at associating named entities with a corresponding explanatory document - a document that contains information about that entity - in a given document collection. There are two main challenges in this task. The first challenge is the ambiguity of the named entity: the same named entity string can occur in different contexts with different meaning. Also, a named entity may be denoted using various forms like acronyms and nick names. The second challenge is to decide if the named entity is not found in the document collection, then link this named entity with the "NIL" link. A survey of some methodologies that have been used to perform the entity linking task is presented in addition to the base line approach. Also, data sets used for this purpose for evaluation and training will be explored. Finally, evaluation metrics used and some results for the state of the art will be presented.

20th October, 2011 Udo Kruschwitz (University of Essex) - Exploiting Implicit Feedback: From Search to Adaptive Search

This talk will give an overview of the information retrieval work we conduct in the Language and Computation Group at the University of Essex on building adaptive domain models that can assist in searching or navigating document collections. We are particularly interested in searching local Web sites, digital libraries and other collections. Such collections are different from the Web in that spamming is not an issue, searchers are less heterogeneous and often there is only a single document satisfying an information need. The underlying assumption of our work is that we can use implicit feedback such as queries submitted, documents clicked on etc. to build domain models that assist other users with similar requests in finding the relevant documents quickly. Our ongoing work is about applying different algorithms in the construction and automatic adaptation of domain models but also about finding ways to evaluate these models.

13th October, 2011 Mark Stevenson (The University of Sheffield) - Disambiguation of Medline Abstracts using Topic Models

Topic models are an established technique for generating information about the subjects discussed in collections of documents. Latent Dirichlet Allocation (LDA) is a widely applied topic model. We apply LDA to a corpus of Medline abstracts and compare the topics that are generated against manually curated labels, Medical Subject Headings (MeSH) codes.

The models generated by LDA consist of sets of terms associated with each topic and these are used to provide context for a Word Sense Disambiguation (WSD) system. It is found that using this context leads to a statistically significant improvement in the performance of a graph-based WSD system when applied to a standard evaluation resource in the biomedical domain.

Information about the topic of a document has already been shown to be useful for WSD of Medline abstracts. Previous approaches have relied on using MeSH codes but these have to be added manually. We demonstrate that information about the topic of abstracts can be identified without the need for manual annotation, by using an unsupervised technique, and can also be used to improve WSD performance.

6th October, 2011 Chris Dyer (Carnegie Mellon University) - Unsupervised Word Alignment and Part of Speech Induction with Undirected Models

This talk explores unsupervised learning in undirected graphical models for two problems in natural language processing. Undirected models can incorporate arbitrary, non-independent features computed over random variables, thereby overcoming the inherent limitation of directed models, which require that features factor according to the conditional independencies of an acyclic generative process. Using word alignment (finding lexical correspondences in parallel texts) and bilingual part-of-speech induction (jointly learning syntactic categories for two languages from parallel data) as case studies, we show that relaxing the acyclicity requirement lets us formulate more succinct models that make fewer counterintuitive independence assumptions. Experiments confirm that our undirected alignment model yields consistently better performance than directed model baselines, according to both intrinsic and extrinsic measures. With POS tagging, we find more tentative results. Analysis reveals that our parameter learner tends to get caught in shallow local optima corresponding to poor tagging solutions. Switching to an alternative learning objective (contrastive estimation; Smith and Eisner, 2005) improves the stability and performance, but it suggests that non-convex objectives may be a larger problem in undirected models than with directed models.

Joint work with Noah Smith, Desai Chen, Shay Cohen, Jon Clark, and Alon Lavie

15th September, 2011 Rao Nawab (The University of Sheffield) - External Plagiarism Detection using Information Retrieval and Sequence Alignment

This talk describes the University of Sheffield entry for the 3rd International Competition on Plagiarism Detection which attempted the monolingual external plagiarism detection task. A three stage framework was used: preprocessing and indexing, candidate document selection (using an Information Retrieval based approach) and detailed analysis (using the Running Karp-Rabin Greedy String Tiling algorithm). The submitted system obtained an overall performance of 0.0804, precision of 0.2780, recall of 0.0885 and granularity of 2.18 in the formal evaluation.

2010 - 2011

6th July, 2011 Paola Velardi (Universita di Roma) - A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch Slides

In this talk I present a novel graph-based approach aimed at learning a lexical taxonomy automatically, starting from a domain corpus and the Web. Unlike many taxonomy learning approaches in the literature, the algorithm learns both concepts and relations entirely from scratch via the automated extraction of terms, definitions and hypernyms. This results in a very dense, cyclic and possibly disconnected hypernym graph. The algorithm then induces a taxonomy from the graph via optimal branching. Experiments show high-quality results, both when building brand-new taxonomies and when reconstructing WordNet sub-hierarchies.

This research is the result of joint work with Roberto Navigli and Stefano Faralli.

R. Navigli, P. Velardi LearningWord-Class Lattices for Definition and Hypernym Extraction The 48th Annual Meeting of the Association for Computational Linguistics ACL 2010 Uppsala, Sweden, July 11-16, 2010

R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch. To appear in Proc. of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July 19-22th, 2011. YouTube

30th June, 2011 Ann Copestake (University of Cambridge) - Formal semantics and dependency structures

Logical representations and dependency structures are both used to describe aspects of the meaning of natural language sentences, but are formally very different. In this talk, I will show that one widely used form of logical representation can be transformed into graph structures comparable to dependency representations without loss of information. This has some significant practical advantages for language processing.

23rd June, 2011 Peter Wallis (The University of Sheffield) - Engineering Spoken Dialogue Systems

Having a conversation with a machine has many commercial applications and has a certain sex appeal for the students. What is more, it is a grand challenge that could provide a unifying theme for much of the departmental research. The dialog manager is, I believe, where there is the greatest opportunity for improvement in spoken dialogue systems and in this talk I contrast my approach with POMDPs. Partially Observable Markov Decision Processes are an elegant approach to the problem of structuring conversation but it is not clear the work being done on them will lead to useful systems. In this talk I argue for an agent based approach to dialogue and provide a set of algorithms from the literature.

9th June, 2011 Internal Research Student Presentations

Niraj Aswani - Evolving a General Framework for Text Alignment: Case Studies with Two South Asian Languages

A gold standard is an essential requirement for automatic evaluation of text alignment algorithms and approaches such as semi-automatic or incremental learning can be used to speed up the process of creating one. In this talk, I will describe a general framework for text alignment that supports manual creation of a gold-standard while in the background updating the language resources used to suggest an initial alignment. In particular, the talk will cover a case study of developing language resources for the English-Hindi language pair. Our focus is on the South Asian languages that are similar to the Hindi language for which the resources are scarce. I will demonstrate the generality of the approach by adapting the resources for the English-Gujarati language pair.

Danica Damljanovic - Usability Enhancement Methods in Natural Language Interfaces for Querying Ontologies

Recent years have seen a tremendous increase of structured data on the Web with Linked Open Data project encouraging publication of even more. This massive amount of data requires effective exploitation which is now a big challenge largely because of the complexity and syntactic unfamiliarity of the underlying triple models and the query languages built on top of them. Natural Language Interfaces are increasingly relevant for information systems fronting rich structured data stores such as RDF and OWL repositories, largely because of the conception of them being intuitive for human. Many NLIs to ontologies have been developed, however little work has been done in testing the usability of these systems and the usability enhancement methods which can improve their performance. In this paper, we assess the effect of these methods through the two user-centric studies of the two systems: QuestIO and FREyA. The first study assesses the usability of QuestIO, which is fully automatic, in comparison to the traditional ways of search. The second one assesses the usability of FREyA, which involves the user into loop, with special emphasis on feedback. Our results highlight the expressiveness of the language supported by QuestIO and FREyA, and also the importance of feedback which is shown to improve the overall usability and user experience. In addition, combination of feedback and clarification dialogs in FREyA is shown to outperform the state of the art systems.

2nd June 2011 Piek Vossen (Vrije Universiteit Amsterdam) - The KYOTO project: a cross- lingual platform for open text mining

The European-Asian project KYOTO developed a platform for mining concepts and events from text across different languages. It uses a layered stand-off representation of text that is shared by 7 languages: English, Dutch, Italian, Spanish, Basque, Chinese and Japanese. The KYOTO Annotation Format (KAF) distinguishes separate layers for structural and semantic aspects of the text that can be stacked on top of each other and that can be extended easily. Once a structural representation of the text in KAF is created, semantic layers are added using modules that work the same for all the languages, creating an interoperable semantic interpretation of the text. The semantic layers are based on wordnet concepts linked to a shared ontology and named entities.
From the semantically annotated text, KYOTO derives on the one hand terminology databases with concepts that are anchored to the wordnets and through these to the ontology and on the other hand events with participants that are mentioned in the text, which are instantiations of these concepts. The detection of the latter is helped through the conceptual database. Ultimately, every word and expression in the text is connected to the ontology. Likewise, events and their participants are mined by defining patterns using constraints in the shared ontology, e.g. physical_object in object position of a change_of_integrity process. Such patterns can be applied to text in any language, since the structural unit in these languages are mapped to the same concept structure. Mined events are related to time and places, detected as named entities. This makes potential facts out of events: they took place at some point of time in some place. These potential facts help the development of applications that can group all events that took place in the same area in the same period and that may be semantically related or show some conceptual coherence. KYOTO carried out first evaluations of the precision and recall of such an open-event mining approach and developed a semantic search application that exploits the rich data. Such a search system bridges the gap between rich text mining and comprehensive search on text indexes.

19th May 2011 Internal Research Student Presentations

Kumutha Swampillai - Overview of Research Topic

Douwe Gelling - Overview of Research Topic

12th May 2011 Leon Derczynski (The University of Sheffield) - Processing Temporal Relations

Language requires a description of time in order to allow use to describe change, to plan, and to discuss history. Temporal information extraction has been a persistently difficult task over the past decade. I will discuss my PhD research in this area and outline a partially data-driven method to extract temporal relations from natural language text, with good results.

5th May, 2011 David Weir (University of Sussex) - Exploiting Distributional Semantics: exploring asymmetry and non-standard contextual features

The distributional hypothesis asserts that words that occur in similar contexts tend to have similar meanings. A growing body of research has been concerned with exploiting the connection between language use and meaning, and much of this work has involved measuring the distributional similarity of words based on the extent that they share similar contexts. In this talk I look at two particular aspects of how distributional similarity can be measured: the value of asymmetry and the choice of co-occurrence features. These issues will be considered in the contexts of various applications, including cross-domain sentiment analysis and detection of non-compositionally.

14th April, 2011 Paul Rayson (Lancaster University) - Extreme NLP - Co-presenting with Will SimmScott Piao and Maria-Angela Ferrario

In this talk, we will describe Natural Language Processing research and applications which can be loosely described as 'Extreme NLP'. At Lancaster, there are a number of projects which apply NLP techniques in extreme or harsh circumstances and to controversial or challenging topics. For example, we will describe the problems faced when applying corpus-based NLP methods and tools to historical data (Early Modern English) and to online varieties of language (social networks, emails, blogs). Short texts, informal messages and high volumes of data cause multiple issues for existing tools trained on modern standard varieties of language. The novel application areas such as online child protection, crime, environmental issues, serendipity etc, also mean that it is sometimes difficult to be precise about the exact techniques that are employed.

7th April, 2011 Edward Grefenstette (University of Oxford) - Categorical Compositionality for Distributional Semantics, Without Tears

Coecke, Sadrzadeh, and Clark (arXiv:1003.4394v1 [cs.CL]) developed a compositional model of meaning for distributional semantics, in which each word in a sentence has a meaning vector and the distributional meaning of the sentence is a function of the tensor products of the word vectors. Abstractly speaking, this function is the morphism corresponding to the grammatical structure of the sentence in the category of finite dimensional vector spaces. In this paper, we provide a concrete method for implementing this linear meaning map, by constructing a corpus-based vector space for the type of sentence. Our construction method is based on structured vector spaces whereby meaning vectors of all sentences, regardless of their grammatical structure, live in the same vector space. Our proposed sentence space is the tensor product of two noun spaces, in which the basis vectors are pairs of words each augmented with a grammatical role. This enables us to compare meanings of sentences by simply taking the inner product of their vectors.

31st March, 2011 Alexander Clark (Royal Holloway University of London) - Distributional Lattice Grammars: a learnable representation for syntax

A central problem for NLP is grammar induction: the development of unsupervised learning algorithms for syntax. In this paper we present a lattice-theoretic representation for natural language syntax, called Distributional Lattice Grammars.

These representations are objective or empiricist, based on a generalisation of distributional learning, and are capable of representing all regular languages, some but not all context-free languages and some non-context-free languages. We present a simple algorithm for learning these grammars together with a complete self-contained proof of the correctness and efficiency of the algorithm, and we discuss the relevance of this work to the problems of theoretical linguistics.

17th March, 2011 Stephen Clark (University of Cambridge) - Practical Linguistic Steganography using Synonym Substitution - joint work with Ching-Yun (Frannie) Chang

Linguistic Steganography is concerned with hiding information in a natural language text, for the purposes of sending secret messages. A related area is natural language watermarking, in which information is added to a text in order to identify it, for example for the purposes of copyright. Linguistic Steganography algorithms hide information by manipulating properties of the text, for example by replacing some words with their synonyms. Unlike image-based steganography, linguistic steganography is in its infancy with little existing work. In this talk I will motivate the problem, in particular as an interesting application for NLP and especially generation. Linguistic steganography is a difficult NLP problem because any change to the cover text must retain the meaning and style of the original, in order to prevent detection by an adversary.

Our method embeds information in the cover text by replacing words in the text with appropriate substitutes, making the task similar to the standard lexical substitution task. We use the Google n-gram data to determine if a substitution is acceptable, obtaining promising results from an evaluation in which human judges are asked to rate the acceptability of sentences.

10th March, 2011 Internal Research Student Presentations

Xingyi Song - Overview of research topic

Daniel Preotiuc - Overview of research topic

Samuel Fernando - Enriching knowledge bases from Wikipedia

Lexical knowledge bases, such as WordNet, have been shown to be useful in a wide range of language processing applications. Enriching such resources using the usual manual approach is costly. This thesis explores methods for enriching WordNet using information from Wikipedia.

The approach consists of mapping concepts in WordNet to corresponding articles in Wikipedia. This is done using a three stage approach. First a set of possible candidate articles is retrieved for each WordNet concept. Secondly, text similarity scores are then used to select the best match from the candidate articles. Finally, the mappings are refined using information from Wikipedia links to give a set of high quality matches. Evaluation reveals that this approach generates mappings of accuracy over 90%.

This information is then used to enrich relations in WordNet using Wikipedia links. The enriched WordNet is then used with a knowledge based Word Sense Disambiguation system, and evaluated on Semeval 2007 test data. Using WordNet alone gives accuracy of 70%, but with the enriched WordNet the performance is boosted to 84% correct disambiguation, rivalling state-of-the-art performance on this data set.

3rd March, 2011 John Carroll (University of Sussex) - Text Mining from User-Generated Content

Over the past five years or so, technology has made it possible for members of the general public to create and publish digital media content, for example in the form of video, audio, or text. Being able to process such content automatically to derive relevant information from it will be of great societal and commercial benefit. In this talk I will present a number of research and commercial applications which I and collaborators are developing, in which we process digital text from sources as diverse as mobile phone text messages, non-native language learner essays, and primary care medical notes. These applications involve a number of language processing challenges, and I will outline how we have overcome them.

24th February, 2011 Leon Derczynski (The University of Sheffield) - ESSLLI course - Word Senses

In an introduction to the tasks of word sense disambiguation and word sense induction, we will discuss a wide range of techniques for the two tasks, from fundamental concepts to state of the art. Further, we survey tools for the development of systems able to participate in past and current evaluation exercises for WSD and WSI (ref: Semeval).

17th February, 2011 Lucia Specia (University of Wolverhampton) - Quality Estimation for Machine Translation

One of the most popular ways to incorporate Machine Translation (MT) into the human translation workflow is to have humans checking and post-editing the output of MT systems. However, the post-editing of a proportion of the translated segments may require more effort than translating those segments from scratch, without the aid of an MT system. In this talk I will introduce some of my work on quality estimation for MT: the task of predicting the quality of sentences produced by machine translation systems, where "quality" is defined in terms of post-editing effort. A quality estimation system can be used to filter out bad quality translations to prevent human translators spending time post-editing them. I will present the outcomes of experiments with different ways of estimating quality which demonstrate that it is possible to predict post-editing effort using standard machine learning techniques with a relatively small number of training examples and a number of shallow features.

10th February, 2011 Rao Nawab (The University of Sheffield) - Automatic Plagiarism Detection

The task of plagiarism detection using automatic methods has got the attention of the academia, commercial and publishing communities. The main objective of my PhD thesis is to explore the problem of automatically detecting extrinsic plagiarism (when the plagiarized text is created by paraphrasing) using IR and NLP techniques.

The first part of my talk will give an overview of a two-stage framework for my PhD thesis: 1) candidate document selection stage and 2) detailed analysis stage. The aim of first stage is to reduce the search space whereas that of second stage is to identify the suspicious-source sections from the reduced search space. The second part of my talk will present my current work on the candidate document selection stage and a brief summary of the results. Suggestions and feedback from the group will be of great value for me.

3rd February, 2011 Adam Kilgarriff (Lexical Computing Ltd.) - Using Corpora Without the Pain

Corpora are large objects and querying them efficiently is non-trivial. There are substantial costs to building them, storing them, maintaining them, and building and maintaining software to access them. We propose a model where this work is done by a corpus specialist and NLP systems then use corpora via web services or (if there is a local installation) a command-line API. Our corpus tool is fast, even for billion-word corpora, and offers a wide range of queries via its web API. We have large corpora available for twenty-six languages, and are experts in preparing large corpora from the web, with particular expertise in web text cleaning and de-duplication. To increase our coverage of the world's languages, we have a 'corpus factory' programme. For English, we are building corpora that are both bigger and more richly marked up than others available. The 'big corpus' thread is BiWeC (BIg WEb Corpus) for which we currently have 5.5 billion words fully encoded. The 'more richly marked up' thread is the New Model Corpus, which we are setting up as a collaborative project for multiple annotation. The combination of the API model, the corpora, and the tools, will allow many NLP researchers to use bigger and better corpora in more sophisticated ways than would otherwise be possible.

27th January, 2011 Leon Derczynski (The University of Sheffield) - Review of courses from ESSLLI 2010

Last year, I attended the first week of the European Summer School for Logic, Language and Information. In this talk I will recap briefly over two of the classes taken there. 
Class 1 - Focus. An introduction to the phenomena and theories of focus at the levels of phonetics, phonology, syntax, semantics and pragmatics, and the interfaces between them. Common grammatical and contextual environments that trigger focus are surveyed. We will look in detail at the most prominent accounts of the semantics of focus and consider how they are applied in particular cases. Additional topics include issues of grammatical representation including scope; focus in the pragmatics of the question-answer relation, and the hypothesis that focus and question phrases have a single compositional semantics.

Class 2 - Word Sense Disambiguation and Induction. We introduce the audience to a wide range of techniques for the two tasks; in addition, we provide tools for the development of systems able to participate in past and current evaluation exercises for WSD and WSI (ref: Semeval).

13th January, 2011 Diana Maynard (The University of Sheffield) - The National Archives: The GATE-way to Government Transparency

In this talk I will describe work we are undertaking in a short project for the National Archives, improving access to the huge volumes of information they are making available as part of the initiative publishing government-related material in open and accessible forms as linked data. Together with our partners Ontotext, we have developed tools to import, store and index structured data in a scalable semantic repository, making links from regularly crawled web archive data into this repository storing hundreds of millions of documents, and enabling search via semantic annotation. Document annotation is first carried out using GATE, and then indexed via MIMIR, a new massively scalable multiparadigm index that forms part of the GATE and Ontotext product family.

9th December, 2010 Bill Byrne (University of Cambridge) - Hierarchical Phrase-based Translation with Weighted Finite State Transducers

I will present recent work in statistical machine translation which uses Weighted Finite-State Transducers (WFSTs) to implement a variety of search and estimation algorithms. I will describe HiFST, a lattice-based decoder for hierarchical phrase-based statistical machine translation. The decoder is implemented with standard WFST operations as an alternative to the well-known cube pruning procedure. I will discuss how improved modelling in translation results from the efficient representation of translation hypotheses and their derivations and scores under translation grammars. We find that the use of WFSTs in translation leads to fewer search errors, better parameter optimisation, improved translation performance, and the ability to extract useful confidence measures under the translation grammar.

8th November, 2010 John Tait (Information Retrieval Facility) - Slides

7th October, 2010 Danica Damljanovic (The University of Sheffield) - Natural Language Interfaces to Conceptual Models

Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. We study the usability of NLIs from two perspectives: that of the developer who is customising the NLI system, and that of the end-user who uses it for querying. We investigate whether usability methods such as feedback and clarification dialogs can increase the usability for end users and reduce the customisation effort for the developers. To that end, we have developed FREyA - an interactive NLI to ontologies which will be the described and demoed during this talk.

2009 - 2010

5th August, 2010 David Guthrie (The University of Sheffield) - Storing the Web in Memory: Space Efficient Language Models using Minimal Perfect Hashing

The availability of the text on the web and very large text collections, such as the Gigaword corpus of newswire and the Google Web1T 1-5gram corpus, have made it possible to build language models incorporating counts of billions of n-grams. In this talk we present novel methods for efficiently storing these large models. We introduce three novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams. Our methods use significantly less space than all known approaches and have retrieval speed faster than current language modelling toolkits.

22nd July, 2010 Alberto Diaz (Universidad Complutense de Madrid) -

In the talk I'll give a short introduction to my research group (members and high levels details about the main research areas), and after I'll explain more details about my research lines and projects. In particular, I'll talk about personalization for digital newspapers through user modelling and text classification tasks, and for text processing for biomedical documents, including text summarization and ICD-9-CM indexing tasks.

8th July, 2010 Laura Plaza (The University of Sheffield Visiting Researcher) - Improving Summarization of Biomedical Documents using Word Sense Disambiguation

We describe a concept-based summarization system for biomedical documents and show that its performance can be improved using Word Sense Disambiguation. The system represents the documents as graphs formed from concepts and relations from the UMLS. A degree-based clustering algorithm is applied to these graphs to discover different themes or topics within the document. To create the graphs, the MetaMap program is used to map the text onto concepts in the UMLS Metathesaurus. This paper shows that applying a graph-based Word Sense Disambiguation algorithm to the output of MetaMap improves the quality of the summaries that are generated.

24th June, 2010 Ronald Denaux (University of Leeds) -

Ronald will first present his work on involving domain experts in ontology engineering through the use of the Rabbit controlled natural language, a tailored ontology engineering methodology and a tailored user interface based on Protege (this all in the context of the Confluence project in a collaboration between the Ordnance Survey and the University of Leeds). In the second part, Ronald will present his current work on Multi-perspective Ontology Engineering where he is investigating a mechanism for capturing the perspective of ontology authors in order to enhance tool support for ontology creation and reuse. In particular, Ronald is working on formalising the purpose of ontologies and eliciting the goals of ontology authors through dialogue games (the second part is in the context of Ronald's PhD).

17th June, 2010 Hector Llorens (University of Sheffield Visiting Researcher) - Temporal information extraction using semantic roles and semantic networks

In the last years, there has been an intensive research on the temporal elements of natural language text. TimeML scheme has been recently adopted as the standard for annotating temporal expressions (TIMEX3), events (EVENT), and their relations ([T,A,S]LINK). This research analyzes the advantages of applying semantic information to the automatic annotation of TimeML elements. For that purpose, a system addressing the automatic annotation of TimeML elements is presented. The system implements an approach which uses semantic roles and semantic networks as additional information extending classic approaches based on morphosyntactic information. A multilingual analysis carried out evaluating the system for Spanish demonstrated the approach is valid for different languages achieving same quality results and improvement over classic approaches. In the talk, I will include an "application proposal" which I intend to develop during my stay there and which will be the application of my thesis. Yours and your group suggestions and feed back about my current and further work will be of great value for me.

30th April, 2010 Atefeh Farzindar (NLP Technologies Inc) - Successful cooperation between the university and industry

NLP Technologies and RALI (Applied Research in Computational Linguistics, Université de Montréal) have developed an automated monitoring system for the automatic summarization and translation of legal decisions. During this seminar, Atefeh Farzindar, will discuss the successful cooperation between the university and industry leaders, a milestone in applied research and technology transfer. Experience shows that when industry players combine their strengths and work alongside university experts with the same vision, the result yielded is by far greater than what can be achieved separately. She will present her experience with domain-based technologies in the legal and military fields.

22nd April, 2010 Miles Osbourne (University of Edinburgh) - What is happening now? Finding events in Massive Message Streams

Social Media {eg Twitter, Blogs, Forums, FaceBook} has exploded over the last few years. FaceBook is now the most visited site on the Web, with Blogger being the 7th and Twitter the 13th. These sites contain the aggregated beliefs and opinions of millions of people on an epic range of topics, and in a large number of languages. Twitter in particular is an example of a massive message stream and finding events embedded in it poses hard engineering challenges. I will explain how we use a variant of Locality Sensitive Hashing to find new stories as they break. The approach scales well, easily dealing with the more than 1 million Tweets a day we process and only needing a single processor. For June 2009, the fastest growing stories all concerned deaths of one kind or another.

15th April, 2010 Peter Wallis (University of Sheffield) - Conversation in Context: what should a robot companion say?

Language as used by humans is a truly amazing thing with multiple roles in our lives. Academics have tended to focus on the way languages convey meaning, and disciplines that come new to the problem such as computer science tend to start with reference semantics and progress to models of meaning that look mathematical and hence solidly academic. Language as used is however beautifully messy. People sing, they lie and swear, they use metaphor and poetry, play word games and talk to themselves. Is there a better way to look at language? Interdisciplinary research is hard not only because each discipline has its own terminology, but also because they usually have different interests. Those of us interested in spoken language interfaces (computer science) however have a shared interest with applied linguistics in how language works in situ. This paper outlines a theory about how language works from applied linguistics and shows how the theory can be used to guide the design of a robot companion.

25th March, 2010 Adam Funk (University of Sheffield) - Ontology-Based Categorization of Web Services with Machine Learning

We discuss the problem of categorizing web services according to a shallow ontology for presentation on a specialist portal. We treat it as a text classification problem and apply first information extraction techniques (using keywords and rules), then machine learning (ML), and finally a combined approach in which ML has priority over keywords. The techniques are evaluated according to standard measures for flat categorization as well as the Balanced Distance Metric for ontological classification and compared with related work in web service categorization. The ML and combined categorization results are good and the system is designed to take users' contributions through the portal's Web 2.0 features as additional training data.

18th March, 2010 Elena Lloret (University of Alicante) - Text Summarization and it's Applications in NLP Tasks

Text Summarization, which aims to condense the information contained in one or more documents and present it in a more concise way, can be very useful for helping users to manage the large amounts of information available due to the rapid growth of the Internet. In this talk, I will present the Natural Language Processing and Information Systems Research Group of the University of Alicante (Spain), and next I will focus on Text Summarization as the research topic of my PhD. I will describe a knowledge-based approach to generate extractive summaries, and how this approach has been successfully applied to neighbouring NLP tasks, such as Question Answering, Sentiment Analysis or Text Classification. Finally, some issues regarding the difficult task concerning the evaluation of summaries will be also outlined, suggesting preliminary ideas of new directions for the evaluation task.

26th February, 2010 René Witte (Concordia University in Montréal) - Software Engineering and Natural Language Processing: Friends or Foes?

This talk will investigate some connections between software engineering (SE) and natural language processing (NLP). It will attempt to answer questions such as "Why do software engineers use natural language artifacts everywhere, but no NLP?" and "Why, after more than 10 years of modern NLP research, do we still not have the most basic NLP functionalities integrated into our desktops?". In the first part, we examine NLP for SE: Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. However, while source code artifacts are well-managed by today's software development tools, documents are not integrated on a semantic level with their corresponding code artifacts. This results in a number of problems, like the loss of traceability links between code and its documentation (requirements specifications, user guides, design documents). We show how natural language processing approaches can be used to retrieve semantic information from software documents and connect them with source code using ontology alignment techniques. The second part of the talk will investigate the integration of existing NLP techniques (such as summarization or question-answering) into end-user desktop programs (such as email clients or word processors). This work is motivated by the observation that none of the impressive advances in NLP and text mining over the last decade has materialized in the tools and desktop environments in use today. The "Semantic Assistants" project aims to provide effective means for the integration of natural language processing services into existing applications, using an open service-oriented architecture based on OWL ontologies and W3C Web services.

25th February, 2010 Claude Roux (Xerox Research Labs) - TBA

4th February, 2010 Peter Wallis (University of Sheffield) - High Recall Search in Practice

Internet search engines do an amazing thing, but what they can do well has coloured our view of the general problem of search. There are cases where a search engine would be better if the searcher knew he or she had found everything relevant, but how often and how significant these cases are is an open question. One popular notion is that high recall is not that useful as we can get by without it. Although sound reasoning, it does not mean there is not a opportunity to be had - Xerox faced this marketing problem with the photocopier and jet engines had been in use for quite a while before the advantages were quantifiable. One situation where the need for high recall is acknowledged is defence intelligence. Defence has both the will and resources to develop bespoke systems for their particular needs and in this talk I describe, in some detail, the needs of the "Health Intelligence" community. I go on to describe how we addressed these needs using an Information Extraction system based on a library of "Fact Extractors".

17th December, 2009 Jose Iria (University of Sheffield) - Machine Learning Approaches to Text and Multimedia Mining

Today's search engines are able to retrieve and index several billion web pages, but the analysis that they perform on the content of these pages is still very shallow -- as is, consequently, the functionality that they are able to offer the user. What if these search engines could, for example, extract the factual content from the pages they retrieve, classify the pictures that accompany the text, disambiguate namesakes or mine opinions expressed in the pages? Undoubtably, this would open a world of possibilities in what concerns new functionalities and enhanced user experience, fueled by richer underlying data models. In this talk, I will describe my research, spanning a number of years, on these topics. The common denominator in the several approaches that I will present is the fact that they rely heavily on machine learning techniques, to train systems to classify and extract target information. The talk will also overview real-world applications of the systems originating from the research -- for instance, in one case we trained one of our systems to extract information from a collection of jet engine reports provided by Rolls-Royce, resulting in a positive impact in the way their engineers search for information in the course of their work.

15th December, 2009 Donia Scott (University of Sussex) - Summarisation and Visualization of Electronic Health Records

10th December, 2009 Roberto Navigli (Universita di Roma "La Sapienza") - Comparing Graph Connectivity Measures for Word Sense Disambiguation

Word sense disambiguation (WSD), the task of identifying the intended meanings (i.e. senses) for words in context, has been a long-standing research objective for Natural Language Processing. While supervised systems typically achieve better performance, they require large amounts of sense-tagged training instances. An alternative solution is that of adopting knowledge-based approaches, that exploit existing knowledge resources to perform WSD and do not need annotated training sets. In this talk, we present an objective comparison of graph-based algorithms for alleviating the data requirements for large-scale WSD. Under this framework, finding the right sense for a given word amounts to identifying the most "important" node among the set of graph nodes representing its senses. We present a variety of measures that exploit the connectivity of graph structures, thereby identifying the most relevant word senses. We assess their performance on standard datasets, and show that the best measures perform comparably to state-of-the-art systems. We also provide interesting insights into the relevance of the underlying knowledge resource on WSD performance.

26th November, 2009 Serge Sharoff (University of Leeds) - Classifying the Web into Domains and Genres

The jungle metaphor is quite common in corpus studies. The subtitle of David Lee's seminal paper on genre classification is 'navigating a path through the BNC jungle'. According to Adam Kilgarriff, the BNC is a jungle only when compared to smaller Brown-type corpora, while it looks more like an English garden when compared to the Web. At the moment we know little about the domains and genres of webpages. In the seminar I'm going to talk about approaches to understand the composition of the Web as a corpus.

19th November, 2009 Luke Zettlemoyer (University of Edinburgh) - Learning to Follow Orders: Reinforcement Learning for Mapping Instructions to Actions

In this talk, I will address the problem of relating linguistic analysis and control --- specifically, mapping natural language instructions to executable actions. I will present a reinforcement learning algorithm for inducing these mappings by interacting with virtual computer environments and observing the outcome of the executed actions. This technique has enabled automation of tasks that until now have required human participation --- for example, automatically configuring software by consulting how-to guides. Our results demonstrate that this method can rival supervised learning techniques while requiring few or no annotated training examples.

29th October, 2009 Allan Ramsay (Univeristy of Manchester) - Using English to Express Commonsense Rules

The talk will discuss some issues arising from an attempt to provide natural language access to a body of simple information about diet and its effect on various common medical conditions. Expressing this knowledge in natural language has a number of advantages. It also raises a number of difficult issues. I will outline the reasons why it seemed like a good idea and the reasons why it is difficult, and sketch our solution to these problems.

15th October, 2009 Diana Maynard (University of Sheffield) - Using Lexico-Syntactic Patterns for Ontology Enrichment: the case of ODd SOFAS

This talk describes the use of information extraction techniques involving lexico-syntactic patterns to generate ontological information from unstructured text and augment an existing ontology with new entities. We refine the patterns using a term extraction tool and some semantic restrictions derived from WordNet and VerbNet, in order to prevent the overgeneration that typically occurs with general patterns. We present two applications developed in GATE and available as plugins for the NeOn Toolkit: one for general use on all kinds of text, and one for specific use in the fisheries domain. Both make use of a new plugin for GATE which generates ontologies on the fly. Furthermore, we integrate support for ontology lifecycle development via a change log mechanism that enables logging of ontology versions and application of changes from one version to another.

1st October, 2009 Trevor Cohn (Univeristy of Sheffield) - Bayesian Non-Parametric Models for Parsing and Translation Slides

Many natural language processing tasks require inference over partially observed input data. Traditionally these models are trained using the expectation maximisation (EM) algorithm. However, for many models EM finds poor or degenerate solutions. Bayesian methods provide a elegant and theoretically principled way to address these problems, by including a prior over the model and integrating over uncertain events. In this talk I'll describe how we developed non-parametric Bayesian models for two related tasks: 1) learning a tree substitution grammar (DOP) for syntactic parsing and 2) learning a grammar-based machine translation model. The models learn compact and simple grammars, uncovering latent linguistic structures and in doing so outperform competitive baselines.

2008 - 2009

14th May, 2009 Sivaji Bandyopadhyay (Jadavpur University, India) - Emotion Analysis in Blog texts

Emotion analysis on blog texts is being carried out for a less privileged language like Bengali. A set of six attitude types, namely, happy, sad, anger, fear, disgust and surprise, have been selected toward this emotion detection task for reliable and semi automatic annotation of the blog texts. An automatic classifier has been applied for recognizing six basic types of attitudes for different words of a sentence. Different scoring strategies have been applied to identify sentence level emotion type based on the acquired word level emotion information. Unsupervised techniques have been applied on the classified test output to improve the accuracy. Same method has been applied on English SemEval 2007 Affect Sensing corpus that has given satisfactory performance.

7th May, 2009 Leon Derczynski (University of Sheffield) - Sequencing of Events and Their Durations Based on Event Descriptions Slides

Temporal Information Extraction is the elicitation of accurate data on events in a discourse. This specifies both tense and aspect of actions, both explicitly given by text and implicit from world knowledge. Events can occur at any point along a timeline, and are often only loosely specified in terms of upper and/or lower bounds relative to other events. Being able to identify and annotate times in discourse enables us to build a richer representation of the knowledge present in text. Given a document - for example, a news article - only a subset of facts within that document ever hold true at any one time. For example, we cannot concurrently assert "The silver and black Scott bike was chained to railings" and "An hour later it was gone". Extracting and temporally linking information is the only way to know which sets of facts hold true at the same time. A brief summary of literature and models surrounding tense and temporal location will be presented, followed by a review of recent work in the field. We will look at the normalisation of temporal data (anchoring vague expressions to a fixed interval on an absolute time scale), how events in text relate to each other and ways of reasoning about them, and different representations of temporal data - logical, textual and visual.

30th April, 2009 Marta Sabou (Open University) - Exploiting Semantic Web Ontologies: An Experimental Report Slides

As a side effect of the Semantic Web research activities, a large collection of ontologies is now available online constituting one of the largest and most heterogeneous knowledge sources in the history of AI. In this talk we report on the characteristics of this novel source and on its successful use for relation discovery. Our experiments show that, in the context of an ontology matching task, relations between the concepts of two ontologies can be discovered with a precision of 70% when using online ontologies. We conclude by exploring the potential of this novel knowledge resource for language technology applications.

16th April, 2009 Kumutha Swampillai (University of Sheffield) - Inter-Sentential and Intra-Sentential Relations in IE Corpora

Some information extraction systems are limited to extracting binary relations from single sentences. This constraint means that relations occurring across sentence boundaries cannot possibly be extracted by such systems. We examine the distribution of inter-sentential and intra-sentential relations in the MUC6 and ACE03 corpora. It was found that inter-sentential relations constitute 31.4% and 9.4% of the total number of relations in MUC6 and ACE03 respectively. These results show a 69.6% and a 90.6% recall upper bound of single sentence approaches to relation extraction. As such, any comprehensive approach to relation extraction will have to treat linguistic units larger than a sentence.

2nd April, 2009 Danica Damljanovic (University of Sheffield) - Natural Language Interfaces to Conceptual Models: Usability and PerformanceSlides

Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SeRQL or SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain or ontology. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. Many methods are under development to reduce this training, and increase the usability of NLIs. We have developed Question-based Interface to Ontologies (QuestIO) which translates Natural Language text-based queries to SeRQL/SPARQL queries, which are then executed against the given ontology/knowledge base and the results are shown to the user. Customisation of this system is performed automatically from the ontology vocabulary. QuestIO is quite flexible in terms of complexity and syntax of the supported queries, as both keyword-based searches and full blown questions are supported. However, in the user-centric evaluation of this system we have noticed that the performance was degraded as the users did not have suficient help from the interaction with the system. In this talk, we propose combination of the three methods which are used to assist the user while interacting with the system: feedback, creating personalised vocabulary, and query refinement, and how these can be used in combination to improve the usability of NLIs to conceptual models.

19th March, 2009 Peter Wallis (University of Sheffield) - Social Engagement with Robots and Agents (SERA) Slides

Getting people to engage with robotic and virtual artifacts is easy, but keeping them engaged over time is hard: robots and agents lack some fundamental capabilities which can be summarized as sociability. The research community has realized the problem, but approaches, so far, have been dispersed and disjoint. If robots and agents are to become companions in people's lives, they will have to blend into these lives seamlessly. SERA is innovative in that it addresses sociability holistically, by advancing knowledge about what sociability in robots and agents entails, by developing methodology to analyze and evaluate it, and by making available research resources and platforms. SERA will, to this purpose, undertake real-life extended field studies of users' engagement with robotic devices. Sociablity has also to be built into robot and agent architectures from scratch and the goal here is to implement an architecture that caters for both background (cultural, normative etc.) and situational individual (theory of mind, adaptivity, responsiveness) practices and needs of users, with the guiding principle of pervasive affectivity. Assistive robots and agents that are to become true companions have to be versatile in functionality and identity (style, personality) depending on the service they are required to deliver, such as (reactive) social mediators, as (in turn reactive and proactive) information assistants, or as (proactive) coaches or monitors e.g. with health-related tasks. SERA will develop pilots of such intertwined interactive service applications for a robotic device.

12th March, 2009 Chris Huyck (Middlesex University) - A Pyscholinguistic Model of Natural Language Parsing Implemented in Simulated Neurons Slides

One of the central activities in natural language processing is parsing. There are a wide range of engineering solutions to parsing but none perform at human levels. The understanding of how humans process language is far from complete, but there is little doubt that humans use their neurons for all mental activities including parsing. There are several psychological models of parsing, but this talk will describe the first neuro-psychological model of parsing. That is, the parser is implemented entirely in simulated neurons. It makes use of Hebb's Cell Assembly hypothesis to form the basis of memories including words, clauses and sentences. Neural parsers require variable binding, and this parser binds via short-term potentiation. The parser produces correct semantic output. As neural cycles have an associated time, time can be measured, and the parser parses in times similar to humans. Prepositional phrase attachment ambiguities are resolved based on the semantics of the sentence. Finally, the parser is embedded in a functioning agent.

5th March, 2009 Monica Schraefel (University of Southampton) - The Path to Joyful Interaction or Why doesn't your computer make you happy?

The common computing interaction paradigm is task oriented and task silo'd. We go to a specific application that supports a specific task and do that specific thing. There is some boundary crossing within applications - calendars and address books share data; email is forced into being as flexible as a paper notebook, spreadsheets can be linked into word processing documents. Yet perhaps not too many would say they feel particularly empowered by their computers; that their quality of life is enhanced by interacting with these machines. There are several ways at least in which we might consider why this lack of joy and delight is the more usual experience of computers in our world. One may be this sense of having to do too many things FOR the computer in order for it to do things for us. Another may be that even when it has the information, it does not DO what we want with it. It is functionally obtuse. Another may be that the cost of trying to explain what to do is simply too high for the benefit that might accrue. In the past year or so, a few of us have been looking at some of these problems that appear to be quite light weight issues, and yet have been substantial road blocks towards delightful computing. We have been prototyping some approaches to explore new interactions and new types of services that might be both practically effective in freeing us from serving the computer to get on with our own missions, and may, in so doing, serve to enhance our quality of life along the way. In this talk, I'll go over some of these projects, the motivation behind them and how far we've gotten on the path to joyful computing and the perfect digital assistant.

26th February, 2009 Mark Stevenson (University of Sheffield) - Disambiguation of Biomedical Text Slides

Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of these texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including the context in which the ambiguous term is used and domain-specific resources (such as UMLS). We compare a range of knowledge sources which have been previously used and introduce a novel one: MeSH terms. The best performance is obtained using linguistic features in combination with MeSH terms. Performance exceeds previously reported results on a standard test set. Our approach is supervised and therefore relies on annotated training examples. A novel approach to automatically acquiring additional training data, based on the relevance feedback technique from Information Retrieval, is presented. Applying this method to generate additional training examples is shown to lead to a further increase in performance.

19th February, 2009 Mark A. Greenwood (University of Sheffield) - IR4QA: An Unhappy Marriage Slides

Over a decade of recent question answering (QA) research has relied on using off-the-shelf information retrieval (IR) engines in order to find relevant documents from which exact answers can be extracted. In this talk I will explain why most QA systems follow this approach and summarise the recent research into what has become known as IR4QA. It is becoming increasingly clear, however, that the use of IR within QA systems is nothing more than a marriage of convenience: in general, QA researchers don't want to develop IR engines and IR researchers are not interested in the QA task. I believe that this marriage is doomed and will never lead to the production of high performance QA systems. The second half of the talk will highlight the main problems inherent in modern QA systems which use IR engines and suggest some possible avenues that QA research may take in the future.

12th February, 2009 Ehud Reiter (University of Aberdeen) - BabyTalk: Generating English Summaries of Clinical Data Slides

I will give an overview of the BabyTalk project, whose goal is to generate English summaries of complex clinical data from a neonatal intensive care unit, for doctors, nurses, parents, and other family members. BabyTalk is based on the hypothesis that a textual summary of the most important information in a data set can in some cases be more useful than a visualisation which presents all of the data, or a expert system which explicitly gives advice based on the data. I will primarily focus on NLP challenges in BabyTalk, such as generating good narratives and effectively communicating temporal information. I will also present the results of our first evaluation, which were mixed but overall quite encouraging.

5th February, 2009 Julien Bourdon (Kyoto University) - Language Grid: An Infrastructure for Intercultural Collaboration Slides

The Language Grid is an on-line multilingual service platform which enables easy registration and sharing of language services such as on-line dictionaries, bilingual corpora, and machine translations. Unlike existing machine translation systems, the Language Grid allows users to register and combine user-created dictionaries and bilingual corpora with existing machine translations to realize user-oriented translation programs with greater accuracy. The main goals of this project are to combine the existing standard language services provided by linguistic professionals and to assist users to create new language services for their own purpose by permitting them to add their own language resources to the ones made by professionals. Currently, services such as translators, dictionaries, parallel texts, morphological analysers, concept dictionaries, available in 10 languages are deployed on the Language Grid. The Language Grid is used for applications such as multilingual collaboration in NPOs, intercultural coexistence in Japanese schools or hospitals.

4th December, 2008 Diana McCarthy (University of Sussex) - Evaluating Lexical Inventories and Disambiguation Systems with Lexical Substitution Slides

There has been a surge of interest within Computational Linguistics over the last decade into methods for word sense disambiguation (WSD). A major catalyst has been the series of SENSEVAL evaluation exercises which have provided standard datasets for the field. Whilst researchers believe that WSD will ultimately prove useful for applications which need some degree of semantic interpretation; the jury is still out on this point. One significant problem is that there is no clear choice of inventory for any given task, other than the use of a parallel corpus for a specific language pair for a machine translation application. Many of the evaluation datasets produced, certainly in English, have used WordNet. Whilst WordNet is a useful resource, it would be beneficial if systems using other inventories could enter the WSD arena without the need for mappings between the inventories which may mask results. This is particularly important since there is no consensus that WordNet sense distinctions are the right ones to make for any given application. As well as the work in disambiguation, there is a growing interest in automatic acquisition of inventories of word meaning. It would be useful to investigate the merits of predefined inventories themselves, aside from their use for disambiguation, and compare these with inventories which have been acquired automatically. In this talk I will discuss these issues and some results in the context of the English Lexical Substitution Task, organised by myself and Roberto Navigli (University of Rome, "La Sapienza") last year under the auspices of SEMEVAL.

27th November, 2008 David Guthrie (University of Sheffield) - Unsupervised Detection of Anomalous Text SlidesPhD Thesis

Situations abound that rely on the ability of computers to detect differences from what is normal or expected. Credit card companies identify possible fraud by detecting spending patterns that differ from what is 'normal' for a given cardholder and network analysts detect possible attacks by spotting network traffic that is out of the ordinary. The focus for this talk is the development of unsupervised technologies to similarly detect anomalies in text. We use the term "anomalous" to refer to text that is irregular, or unusual, with respect to the writing style in the majority of a text. In this talk we show that identifying such abnormalities in text can be viewed as a type of outlier detection because these anomalies will deviate significantly from their surrounding context. We consider segments of text which are anomalous with respect to topic (i.e. about a different subject), author (written by a different person), or genre (written for a different audience or from a different source) and experiment with whether it is possible to identify these anomalous segments automatically. Several different innovative approaches to this problem are introduced and we present results over large document collections, created to contain randomly inserted anomalous segments.

18th November, 2008 Seemab Latif (University of Manchester) - Novel Automatic Technique for Linguistic Quality Assessment of Students' Essays Using Automatic Summarizers Slides

In this seminar, I will be talking about the experiments that have addressed the calculation of inter-annotator inconsistency in selecting the content in both manual and automatic summarization of sample TOEFL essays. A new finding is that the linguistic quality of source essay has a very strong positive correlation with the degree of disagreement among human assessors to what should be included in a summary. This leads to a fully automated essay evaluation technique based on degree of disagreement among automated summarizes. ROUGE evaluation is used to measure the degree of inconsistency among the participants (human summarizers and automatic summarizers). This automated essay evaluation technique is potentially an important contribution with wider significance.

6 November, 2008 Niraj Aswani (University of Sheffield) - Tools for Alignment Tasks Slides

For some tasks, such as text alignment and cross-document co-reference resolution, one would need to refer to more than one document at the same time. Hence, a need arises for Processing Resources (PRs) which can accept more than one document as parameters. For example, given two documents, a source and a target, a Sentence Alignment PR would need to refer to both of them to identify which sentence of the source document aligns with which sentence of the target document. Similarly for a cross-document co-reference resolution, the respective PR would need to access both the documents simultaneously. The standard behaviour of the GATE PRs contradicts the above mentioned requirements. GATE PRs process one document at a time. Corpus pipeline which accepts a corpus as input, considers only one document at a time. Having said this it is not impossible to make PRs accepting more than one document but this would require a lot of re-engineering. Recently, we have introduced a few new resources in GATE (e.g. CompoundDocument, CompositeDocument, AlignmentEditor etc.) to address these issues. In this short presentation, I will describe these components and show how to use them.

28 October, 2008 Rob Gaizauskas (University of Sheffield) - Generating Image Captions using Topic Focused Multi-document SummarizationSlides

In the near future digital cameras will come standardly equipped with GPS and compass and will automatically add global position and direction information to the metadata of every picture taken. Can we use this information, together with information from geographical information systems and the Web more generally, to caption images automatically? This challenge is being pursued in the TRIPOD project and in this talk I will address one of the subchallenges this topic raises: given a set of toponyms automatically generated from geo-data associated with an image, can we use these toponyms to retrieve documents from the Web and to generate an appropriate caption for the image?

We begin assuming the toponyms name the principal objects or scene contents in the image. Using web resources (e.g. Wikipedia) we attempt to determine the types of these things -- is this a picture of church? a mountain? a city? We have constructed a taxonomy of such image content types using on-line image collections and for each such type we have constructed a several collections of texts describing that type. For example, we have a collection of captions describing churches and a collection of Wiki pages describing churches. The intuition here is that these collections are examples of, e.g. the sorts of things people say in captions of churches. These collections can then be used to derive models of objects or scene types which can be used to bias or focus multi-document summaries of new images of things of the same type.

In the talk I report results of work we have carried out to explore the hypothesis underlying this approach, namely that brief multidocument summaries generated as image captions by using models of object/scene types to bias or focus content selection will be superior to generic multidocument summaries generated for this purpose. I describe how we have constructed an image content taxonomy, how we have derived text collections for object/scene types, how we have derived object/scene type models from these collections and how these have been used in multi-document summarization. I also discuss the issue of how to evaluate the resulting captions and present preliminary results from one sort of evaluation.

21 October, 2008
Leon Derczynski (University of Sheffield) - A Data Driven Approach to Query Expansion in Question Answering Slides

Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

Mark A. Greenwood (University of Sheffield) - Evaluation of Automatically Reformulated Questions in Question Series Slides

Having gold standards allows us to evaluate new methods and approaches against a common benchmark. In this paper we describe a set of gold standard question reformulations and associated reformulation guidelines that we have created to support research into automatic interpretation of questions in TREC question series, where questions may refer anaphorically to the target of the series or to answers to previous questions. We also assess various string comparison metrics for their utility as evaluation measures of the proximity of an automated system's reformulations to the gold standard. Finally we show how we have used this approach to assess the question processing capability of our own QA system and to pinpoint areas for improvement.

14 October, 2008 - Jordi Poveda (UPC Catalunya) - A Combination of Machine Learning Methods for the Recognition of Temporal Expressions Slides

Time expression recognition and representation of the time information they convey in a suitable normalized form is a central part of Information Extraction (IE), for it paves the way for the extraction of events and temporal relations. The most common approach to time expression recognition in the past has been the use of handmade extraction rules (grammars), which also served as the basis for normalization. Our aim is to explore the possibilities afforded by applying machine learning techniques to the recognition of time expressions, in order to see where it stands in relation to grammar-based approaches. We focus on recognizing the appearances of time expressions in text (not normalization) and transform the problem into one of chunking, where the aim is to correctly assign IOB tags to tokens. We explain will the knowledge representation used and compare the results obtained in our experiments with two different supervised methods, one statistical (support vector machines) and one of rule induction (FOIL), where the superiority of SVMs is revealed. Next, we will present a semi-supervised approach (based on bootstrapping) to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping. The only supervision is in the form of seed examples, hence it becomes necessary to resort to heuristics to rank and filter out spurious patterns and candidate time expressions. We will summarize our preliminary result with this bootstrapping architecture, which is currently in a testing and improvement stage . The ultimate benefit of developing an end-to-end machine-learning-based framework for information extraction is that it can be carried to new domains and tasks with little customization.