Natural Language Processing


The Natural Language Processing Research Group , established in 1993 , is one of the largest and most successful language processing groups in the UK and has a strong global reputation.

Natural Language Processing (NLP) is an interdisciplinary field that uses computational methods:

  • To investigate the properties of written human language and to model the cognitive mechanisms underlying the understanding and production of written language (scientific focus)
  • To develop novel practical applications involving the intelligent processing of written human language by computer (engineering focus)

Research themes

Contact us
Natural Langauge Processing Research Group
Department of Computer Science
University of Sheffield
Regent Court
211 Portobello
Sheffield, S1 4DP
+44 (0)114 222 1901

Twitter logo Follow us on Twitter @SheffieldNLP



The group's research interests fall into the broad areas of: 

Information Access: Building applications to improve access to information in massive text collections, such as the web, newswires and the scientific literature. Subtopics include: information extraction, text mining and semantic annotation, question answering, summarization.

Language Resources and Architectures for NLP: Providing resources - both data and processing resources - for research and development in NLP. Includes platforms for developing and deploying real world language processing applications, most notably GATE, the General Architecture for Text Engineering.

Machine Translation: Building applications to translate automatically between human languages, allowing access to the vast amount of information written in foreign languages and easier communication between speakers of different languages.

Human-Computer Dialogue Systems: Building systems to allow spoken language interaction with computers or embodied conversational agents, with applications in areas such as keyboard-free access to information, games and entertainment, articifial companions.

Detection of Reuse and Anomaly: Investigating techniques for determining when texts or portions of texts have been reused or where portions of text do not fit with surrounding text. These techniques have applications in areas such as plagiarism and authorship detection and in discovery of hidden content.

Foundational Topics: Developing applications with human-like capabilities for processing language requires progress in foundational topics in language processing. Areas of interest include: word sense disambiguation, semantics of time and events.

The NLP group's research has received support from: the EU's Framework Programmes (Frameworks 4, 5, 6 and 7) as well as Horizon 2020 and the European Research Council, the UK Research Councils (EPSRC, BBSRC, MRC and AHRC) and various governmental and industrial sponsors, including GlaxoSmithKline and IBM.

The NLP group has close associations with the Speech and Hearing and Information Retrieval research groups which carry out research into other areas of computational processing of human language.

We also host the ICCL and CLUK Websites



These are currently the members of NLP group. Click on a name to see a home page.

Administrative Support

Lucy Moffatt

Joanne Suter

Alice Tucker


Prof. Mikel Forcada

Jonathan Foster

Pawandeep Kaur

Luis Mesquita

Former group members



Click on a year to read the news stories


Papers accepted to EACL 2017:

  • Continuous N-gram Representations for Authorship Attribution, Y. Sari, A. Vlachos, M. Stevenson, Proceedings of EACL: Volume 2, Short Papers pdf bib
  • An Extensible Framework for Verification of Numerical Claims, J. Thorne, A. Vlachos, Proceedings of the Software Demonstrations pdf bib
  • Book: Natural Language Processing for the Semantic Web, Diana Maynard, Kalina Bontcheva, Isabelle Augenstein. Morgan and Claypool, December 2016. ISBN:97816270590
  • Journal paper: A Framework for Real-time Semantic Social Media Analysis. Diana Maynard, Ian Roberts, Mark A. Greenwood, Dominic Rout and Kalina Bontcheva. Web Semantics: Science, Services and Agents on the World Wide Web, 2017
  • Conference paper: Towards an Infrastructure for Understanding and Interlinking Knowledge Co-Creation in European research, Diana Maynard, Adam Funk and Benedetto Lepori. ESWC 2017 Workshop on Scientometrics, Portoroz, Slovenia, May 2017
  • Diana Maynard taught 2 practical tutorials at the AI Seminar on Social Media Content Analysis, UPC Barcelona, May 2017
  • Diana Maynard gave an invited tutorial at the EU CLARIN-PLUS workshop on "Creation and Use of Social Media Resources", Lithuania, 2017
  • Diana Maynard gave an invited talk at 2017 Joint EC-OECD workshop on Semantic Technologies and Semantic Web: Structuring Data for STI Policy Analysis, 19 June, Brussels
  • Diana Maynard gave an invited talk at 2017 EPSRC The Future of Patent Analytics Workshop, 3 March, Cambridge, UK
  • The KNOWMAK project has started. A 3 year EC H2020 project from 1 Jan'17 - 31 Dec’20. The University of Sheffield PI is Diana Maynard.
  • Diana Maynard was Programme Chair of the ESWC conference in Portoroz, Slovenia in May.
  • Diana Maynard has won an ESRC-funded award from Understanding Society to access and analyse EU Referendum UK household survey data, for the project "Brexit narratives of place and scale: a media environment analysis of the EU Referendum debate” Co-PIs: Jackie Harrison (Journalism), J. Miguel Kanai (Geography)

Papers accepted for COLING 2016:

  • Representation and Learning of Temporal Relations. L. Derczynski (2016). COLING
  • Broad Twitter Corpus: A Diverse Named Entity Recognition Resource. L. Derczynski, K. Bontcheva, I. Roberts (2016). COLING
  • Stance classification in Rumours as a Sequential Task Exploiting the Tree Structure of Social Media Conversations. A. Zubiaga, E. Kochkina, M. Liakata, R. Procter, M. Lukasik. (2016). COLING
  • Anita: An Intelligent Text Adaptation Tool. G. Paetzold, L. Specia. (2016). COLING
  • Understanding the Lexical Simplification Needs of non-Native Speakers of English. G. Paetzold, L. Specia. (2016). COLING
  • Collecting and Exploring Everyday Language for Predicting Psycholinguistic Properties of Words. G. Paetzold, L. Specia. (2016). COLING
  • Imitation learning for language generation from unaligned data. G. Lampouras, A. Vlachos. (2016). COLING
  • Carolina Scarton, Gustavo Paetzold and Lucia Specia will give a tutorial at COLING 2016, titledQuality estimation for language output applications
  • We are please to announce that Gutsavo Paetzold has passed his PhD viva, having submitted only 2 years after joining as a PhD student.
  • Leon Derczynski will give a course at ESSLLI 2017 with Matteo Magnani, titled "Networks and User-generated Content"
  • Book in press in Springer Studies in Computational Intelligence: Automatically ordering events and times in text - L Derczynski
  • Diana Maynard has had an article on automatic sarcasm detection published in Quartz Magazine
  • Diana Maynard will give tutorials on NLP and Social Media Analysis at the 1st International Deep Learning, Big Data and Big Compute Camp, Rabat, Morocco, 24-28 October 2016.
  • Paper published in European Psychiatry: Novel psychoactive substances: an investigation of temporal trends in social media and electronic health records - A Kolliakou, M Ball, L Derczynski, D Chandran, G Gkotsis, P Deluca, R Jackson, H Shetty, R Stewart
  • Mark Stevenson and Adam Poulson are collaborating with ScHaRR and Human on a project to visualise emotion in social media at the Festival of the Mind - Link to the Guardian Article
  • Paper: An IR-based Approach Utilising Query Expansion for Plagiarism Detection in MEDLINE. R. Nawab, M Stevenson and P. Clough (2016). IEEE/ACM Transactions of Computational Biology and Bioinformatics.
  • Paper: The Effect of Word Sense Disambiguation Accuracy on Literature Based Discovery. J. Preiss and M. Stevenson (2016). BMC Decision Making and Medical Informatics.
  • Paper: A Corpus of Potentially Contradictory Research Claims from Cardiovascular Research Abstracts. A. Alamri and M. Stevenson (2016). Journal of Biomedical Semantics, 7 (36).

Papers accepted for EMNLP 2016:

  • Stance Detection with Bidirectional Conditional Encoding , Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos and Kalina Bontcheva
  • Leon Derczynski has won an NVIDA hardware grant for summary generation from collections of text.
  • Prof. Lucia Specia has been awarded an EC H2020 funded ERC Starting Grant. The project on Multimodal Context Modelling for Machine Translation (MultiMT) will start on 1 July 2016 for 5 years.

Papers accepted for ACL 2016

  • Hawkes Processes for Continuous Time Sequence Classification: an Application to Rumour Stance Classification in Twitter. Michal Lukasik, P. K. Srijith, Duy Vu, Kalina Bontcheva, Arkaitz Zubiaga, Trevor Cohn.
  • Metrics for Evaluation of Word-level Machine Translation Quality Estimation. Varvara Logacheva, Michal Lukasik and Lucia Specia.

Papers accepted for TSD2016: 

  • Automatic Restoration of Diacritics for Igbo Language . Ignatius Ezeani, Mark Hepple and Ikechukwu Onyenwe. 
  • Predicting Morphologically-Complex Unknown Words in Igbo. Ikechukwu Onyenwe and Mark Hepple
  • Paper nominated for Best Paper Award at WebSci 2016: Miriam Fernandez, Harith Alani, Lara Piccolo, Christoph Meili, Diana Maynard and Meia Wippoo. Talking Climate Change via Social Media: Communication, Engagement and Behaviour, May 22-25 2016, Hannover, Germany.
  • Diana Maynard taught a 3-hour practical tutorial at the AI Seminar on Social Media Content Analysis, UPC Barcelona, 9-13 May 2016.
  • Leon Derczynski is co-organising a workshop on Noisy User-generated Text (WNUT) at COLING in Osaka, Japan, 10th December 2016.
  • Diana Maynard will teach two 6-hour courses, "Introduction to NLP" and "Practical social media and sentiment analysis" at the University of Essex Big Data and Analytics Summer School in September 2016.
  • Andreas Vlachos will be speaking at the Lisbon Machine Learning Summer School about imitation learning for structured prediction.
  • Andreas Vlachos will be speaking at the Knowledge Representation Workshop at the University of Liverpool on 28th June 2016.
  • Paper: Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing. James Goodman, Andreas Vlachos and Jason Naradowsky. ACL 2016.
  • Paper: Emergent: A novel data-set for stance classification. William Ferreira and Andreas Vlachos. NAACL 2016.
  • Paper: Large-scale Multitask Learning for Machine Translation Quality Estimation . Kashif Shah and Lucia Specia. NAACL 2016.
  • Paper: Phrase Level Segmentation and Labelling of Machine Translation Errors. Frederic Blain, Varvara Logacheva, and Lucia Specia. In Proc. of Language Resources and Evaluation Conference (LREC), May 2016, Portoroz, Slovenia
  • Paper: Challenges of Evaluating Sentiment Analysis Tools on Social Media. Diana Maynard and Kalina Bontcheva. In Proc. of Language Resources and Evaluation Conference (LREC), May 2016, Portoroz, Slovenia
  • Paper: Complementarity, F-score, and NLP Evaluation. Leon Derczynski. In Proc. of Language Resources and Evaluation Conference (LREC), May 2016, Portoroz, Slovenia
  • Paper: GATE-Time: Extraction of Temporal Expressions and Events Leon Derczynski, Jannik Strötgen, Diana Maynard, Mark A. Greenwood, Manuel Jung. In Proc. of Language Resources and Evaluation Conference (LREC), May 2016, Portoroz, Slovenia
  • Dr. Diana Maynard has been awarded a grant for a fully-funded 4-year PhD student project by the Grantham Centre for Sustainable Futures, to start in October 2016, on the topic of disaster relief reporting and climate change. The Grantham Scholar will be supervised by Diana Maynard and co-supervised by Prof. Jacqueline Harrison from the Dept of Journalism and Prof. Shaun Quegan from the Centre for Terrestrial Carbon Dynamics.
  • The next annual GATE training course will be held from 6-10 June 2016.
  • Mark Stevenson was awarded a grant from Defence Science and Technology Laboratory: "Hypothesis Generation and Visualisation from Data"
  • Paper: A Graph-based Approach to Topic Clustering for Online News. Ahmet Aker, Emina Kurtic, Balamurali Andiyakkal Rajendran, Monica Paramita, Emma Barker, Mark Hepple and Rob Gaizauskas. ECIR 2016.
  • Paper: Automated Content Analysis: A Sentiment Analysis on Malaysian Government Social Media. Siti Salwa Hasbullah and Diana Maynard. In Proc. of ACM International Conference on Ubiquitous Information Management and Communication (IMCOM), January 2016, Danang, Vietnam.
  • The COMRADES project has started. A 3 year EC H2020 project from 1 Jan'16 - 31 Dec'18. The University of Sheffield PI is Prof. Kalina Bontcheva
  • We are pleased to announce two new NLP Professors: Kalina Bontcheva and Lucia Specia have both been promoted to Personal Chair.
  • A piece was published in the Guardian technology blog on Tuesday 8.12.2015 on our work in the EU-funded SENSEI project.
  • Tutorial given by Diana Maynard at Search Solutions 2015, British Computer Society, London, November 2015: "Text analysis with GATE"
  • Mark Stevenson is co-organising a workshop on Topic Models: Post-processing and Applications at CIKM 2015 with Nikolaos Aletras (UCL), Jey Han Lau (King's College London) and Timothy Baldwin (University of Melbourne).
  • Andrés Duque from UNED in Madrid visited the group for 3 months (October - December 2015)
  • Paper: Understanding climate change tweets: an open source toolkit for social media analysis. D. Maynard and K. Bontcheva. In Proc. of EnviroInfo 2015, Copenhagen, Sep. 2015.PDF
  • Poster: Real-time Social Media Analytics through Semantic Annotation and Linked Open Data. D. Maynard, M. A. Greenwood, I. Roberts, G. Windsor, K. Bontcheva. Proceedings of WebSci 2015, Oxford, UK
  • Paper: "Generalised Brown Clustering and Roll-Up Feature Generation". Leon Derczynski, Sean Chester. AAAI 2016.
  • We are pleased to announce that Dr. Andreas Vlachos has joined the group from 1 September 2015.
  • Paper: Evaluating Topic Representations for Exploring Document Collections. N. Aletras, T. Baldwin, J. Lau and M. Stevenson (to appear), Journal of the Association for Information Science and Technology
  • Paper: Exploring Relation Types for Literature-based Discovery. J. Preiss, M. Stevenson and R. Gaizauskas. (to appear), Journal of the American Medical Informatics Association.
  • Paper: Why are these similar? Investigating item similarity types in a large Digital Library. A. Gonzalez-Agirre, N. Aletras, G. Rigau, M. Stevenson and E. Agirre. (to appear), Journal of the Association for Information Science and Technology
  • Paper: Cognitive Styles within an Exploratory Search System for Digital Libraries. P. Goodale, P. Clough, S. Fernando, N. Ford and M. Stevenson (2014), Journal of Documentation, 70(6):970-996.
  • Paper: Improving Distant Supervision using Inference Learning. R. Roller, E. Agirre, A. Soroa and M. Stevenson (2015). In Proceedings of the 53rd Annual Meeting of the Association for Computational Lingusitics and the 7th International Conference on Natural Language Processing of the Asican Federation of Natural Language Processing (ACL-IJCNLP 2015), Beijing, China.
  • Paper: A Hybrid Distributional and Knowledge-based Model of Lexical Semantics. N. Aletras and M. Stevenson (2015). In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 20--29, Denver, Colorado
  • Paper: Investigating Continuous Space Language Models for Machine Translation Quality Estimation. Kashif Shah, Raymond W. M. Ng, Fethi Bougares and Lucia Specia. EMNLP, 2015 (To Appear)
  • Paper: SHEF-NN: Translation Quality Estimation with Neural Networks. Kashif Shah, Varvara Logacheva, Gustavo Paetzold, Frédéric Blain, Daniel Beck, Fethi Bougares and Lucia Specia. WMT, 2015 (To Appear)
  • Paper: A study on the stability and effectiveness of features in quality estimation for spoken language translation. Raymond W. M. Ng, Kashif Shah, Lucia Specia and Thomas Hain. Interspeech, 2015.
  • Paper: Quality estimation for ASR K-best list rescoring in spoken language translation. Raymond W. M. Ng, Kashif Shah, Wilker Aziz, Lucia Specia and Thomas Hain. ICASSP, 2015.
  • Article: A Bayesian non-linear method for feature selection in machine translation quality estimation Kashif Shah, Trevor Cohn and Lucia Specia. Journal of Machine Translation, 2015.
  • The Pheme project is co-supporting Clinical TempEval again in 2016, a shared evaluation task with the NIH THYME project and Harvard Children's Hospital, which will run at SemEval.
  • Special issue on "Time and Information Retrieval" in the Information Processing & Management journal was published, with Leon Derczynski as lead guest editor.
  • Martin Leginus from Aalborg University, co-supervised by Leon Derczynski, won the Best Student Paper award at WEBIST with his work improving tag clouds using entity disambiguation in streams.
  • Sean Chester from Aarhus University will visit and give a seminar in late September.
  • Book deal signed with O'Reilly on Temporal Information Processing for Language, by Leon Derczynski working with James Pustejovsky and Marc Verhagen (both from Brandeis).
  • Our entry in the W-NUT entity recognition challenge in tweets won 3rd place for untyped entity recognition.
  • Paper: Extracting Relations Between Non-Standard Entities using Distant Supervision and Imitation Learning.Isabelle Augenstein, Andreas Vlachos, Diana Maynard. EMNLP 2015.
  • Article: Distantly Supervised Web Relation Extraction for Knowledge Base Population. Isabelle Augenstein, Diana Maynard, Fabio Ciravegna. Semantic Web Journal.
  • Tutorial with Barry Norton at ESWC Summer School 2015: "Information Extraction with Linked Data"
  • Article from the group published in the journal Information Processing and Management: Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, Kalina Bontcheva. 2015. Analysis of Named Entity Recognition and Linking for Tweets.
  • Paper presented at the SemEval workshop: Steven Bethard, Leon Derczynski, Guergana Savova, James Pustejovsky, Marc Verhagen. 2015. SemEval-2015 Task 6: Clinical TempEval.
  • Paper presented at the SemEval workshop: Fatih Uzdilli, Martin Jaggi, Dominic Egger, Pascal Julmy, Leon Derczynski, Mark Cieliebak. 2015. Swiss-Chocolate: Combining Flipout Regularization and Random Forest with Artificially Built Subsystems to Boost Text-Classification for Sentiment.
  • Paper from the group presented at the SemEval workshop: Hegler Tissot, Genevieve Gorrell, Angus Roberts, Leon Derczynski, Marcos Didonet del Fabro. 2015. UFPRSheffield: Contrasting Rule-based and Support Vector Machine Approaches to Time Expression Identification in Clinical TempEval.
  • Book chapter form the group to appear in The Handbook of Linguistic Annotation (edited by Nancy Ide and James Pustejovsky): Kalina Bontcheva, Leon Derczynski, Ian Roberts. 2015. Crowdsourcing Named Entity Recognition and Entity Linking Corpora.
  • Paper from the group presented at the ISA-11 workshop: Hegler Tissot, Angus Roberts, Leon Derczynski, Genevieve Gorrell, Marcos Didonet del Fabro. 2015. Analysis of Temporal Expressions Annotated in Clinical Notes.
  • Paper presented at the WEBIST conference: Martin Leginus, Leon Derczynski, Peter Dolog. 2015. Enhanced Information Access to Social Streams through Word Clouds with Entity Grouping.
  • Paper from the group at the W-NUT workshop: Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva. 2015. USFD: Twitter NER with Drift Compensation and Linked Data.
  • Diana Maynard will give a Tutorial on "Practical Sentiment Analysis" at Essex University Summer School on Big Data and Analytics, 24-28 August 2015
  • Book chapter publication. Diana Maynard and Jonathon Hare. Entity-based Opinion Mining from Text and Multimedia. In "Advances in Social Media Analysis", Mohamed Gaber, Nirmalie Wiratunga, Ayse Goker, and Mihaela Cocea (eds.) 2015, Springer.
  • Diana Maynard gave a keynote speech at 5th International Conference on Web Intelligence, Mining and Semantics (WIMS), July 13-15, 2015, Cyprus. "What you Tweet is What You Get: challenges and opportunities for social media analysis in industry"
  • The annual GATE training course was held in Sheffield from 8-12 June, with 21 participants.
  • Diana Maynard gave a tutorial on "Text Analysis with GATE" at the Reading University Workshop on Big Social Data, 24 April 2015.
  • A paper by Roland Roller and Mark Stevenson (Self-supervised Relation Extraction using UMLS) won the best paper award atCLEF 2014
  • Paper published in the Journal of Biomedical Informatics: B. McInnes and M. Stevenson (2014) Determining the Difficulty of Word Sense Disambiguation. Journal of Biomedical Informatics, 47:83-90.
  • Paper accepted for the journal Studies in the Digital Humanities: M. Hall, P. Goodale, P. Clough and M. Stevenson (2014) The PATHS System for Exploring Digital Cultural Heritage. Studies in the Digital Humanities.
  • Paper published in the journal Information Retrieval: M. Hall, S. Fernando, P. Clough, A. Soroa, E. Agirre and M. Stevenson (2014) Evaluating hierarchical organisation structures for exploring digital libraries. Information Retrieval 17(4):351-379.
  • Paper accepted for the journal Science of Computer Programming M. Shahbaz, P. McMinn and M. Stevenson (2014) Automatic generation of valid and invalid test data for string validation routines using web searches and regular expressions. Science of Computer Programming.
  • Paper from the group published at ACL 2014: N. Aletras and M. Stevenson (2014) Labelling Topics using Unsupervised Graph-based Methods. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), pages 631--636, Baltimore, Maryland
  • Paper from the group published at Digital Libraries 2014: N. Aletras, T. Baldwin, J. Lau and M. Stevenson (2014) Representing Topics Labels for Exploring Digital Libraries. In Digital Libraries 2014 (ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014) and International Conference on Theory and Practice of Digital Libraries (TPDL 2014), London, UK
  • Paper from the group published at EACL 2014: N. Aletras and M. Stevenson (2014) Measuring the Similarity between Automatically Generated Topics. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 22--27, Gothenburg, Sweden

Papers from the group published at EMNLP 2014:

  • Wilker Aziz and Lucia Specia. 2014. Exact Decoding for Phrase-Based Statistical Machine Translation. EMNLP, Doha.
  • Daniel Beck, Trevor Cohn and Lucia Specia. 2014. Joint Emotion Analysis via Multi-task Gaussian Processes. EMNLP, Doha.
  • Kashif Shah, Trevor Cohn and Lucia Specia. 2014. A Bayesian non-Linear Method for Feature Selection in Machine Translation Quality Estimation. Machine Translation.
  • The University of Sheffield (Sheffield NLP Group) was ranked 3rd in the list of institutions that have published the most LREC papers.
  • The Clinical TempEval exercise will run at SemEval 2015, a collaboration between researcher at Brandeis University, U. Alabama Birmingham and Leon Derczynski for the University of Sheffield
  • Leon Derczynski will give two guest lectures at a course on Network Science and online Social Network Analysis at Uppsala Universitet in May

Members of the group have chapters in 2 new books:

  • Documenting Contemporary Society by Preserving Relevant Information from Twitter In 'Twitter and Society', edited by K. Weller, A. Bruns, J. Burgess, M. Mahrt and C. Puschmann, 2014. T. Risse, W. Peters, P. Senellart, D. Maynard
  • Crowdsourcing Named Entity Recognition and Entity Linking Corpora in "The Handbook of Linguistic Annotation" edited by Nancy Ide & James Pustejovsky. Kalina Bontcheva, Leon Derczynski, Ian RobertsMatteo Magnani and Leon Derczynski will teach a week-long course at ESSLLI 2014 in Tubingen in August, on "Human Information Networks"

We have 2 demos accepted at EACL 2014:

  • The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy Kalina Bontcheva, Ian Roberts and Leon Derczynski
  • DKIE: Open Source Information Extraction for Danish Leon Derczynski, Camilla Vilhelmsen Derczynski Field, Kenneth Sejdenfaden Bøgh

The group have 6 papers accepted at LREC 2014

  • Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines Marta Sabou, Kalina Bontcheva, Leon Derczynski, Arno Scharl
  • An efficient and user-friendly tool for machine translation quality estimation Kashif Shah, Marco Turchi, Lucia Specia
  • Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis Diana Maynard
  • Bilingual dictionaries for all EU languages, LREC Ahmet Aker, Monica Paramita, Marcis Pinnis, Robert Gaizauskas
  • Bootstrapping Term Extractors for Multiple Languages Ahmet Aker, Monica Paramita, Emma Barker, Robert Gaizauskas
  • Spatio-temporal grounding of claims made on the web, in Pheme Leon Derczynski, Kalina Bontcheva
  • A paper is accepted in JASIST journal: Generating Descriptive Multi-Document Summaries of Geo-Located Entities Using Entity Type Models. JASIST Ahmet Aker, Robert Gaizauskas
  • The PHEME project has started. A 3 year EC FP7 project from 1 Jan'14 - 31 Dec'16 with 9 partners worth a total of € 4,269,938 with an EC contribution of € 2,916,000. The University of Sheffield PI is Dr Kalina Bontcheva

Three full papers from the group have been accepted at RANLP 2013, to be held in the spa town of Hisarya, Bulgaria

  • "Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data" Derczynski, L., Ritter, A., Clarke, S. & Bontcheva, K.
  • "Recognising and Interpreting Named Temporal Expressions" M. Brucato, M., Derczynski, L., Llorens, H., Bontcheva, K. & Jensen, C.S.
  • "TwitIE: A Fully-featured Information Extraction Pipeline for Microblog Text" Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D. & Aswani, N.
  • The group has had a discussion paper accepted at the International Conference on the Theory of Information Retrieval: "Information Retrieval for Temporal Bounding" Derczynski, L. & Gaizauskas, R.

2 short papers & 3 demonstrations have been accepted by the group at ACL 2013

Short Papers

  • "Reducing Annotation Effort for Quality Estimation via Active Learning" Beck, D., Specia, L. & Cohn, T.
  • "Temporal Signals Help Label Temporal Relations" Derczynski, L. & Gaizauskas, R.


  • "QuEst - A translation quality estimation framework" Specia, L., Shah, K., Guilherme Camargo de Souza, J. & Cohn, T.
  • "PATHS: A System for Accessing Cultural Heritage Collections" Agirre, E., Aletras, N., Clough, P., Fernando, S., Goodale, P., Hall, M., Soroa, A. & Stevenson, M.
  • "AnnoMarket: An Open Cloud Platform for NLP" Bontcheva, K., Tablan, V., Roberts, I., Cunningham, H. & Dimitrov, M.
  • Two out of the three nominations for the ACM SIGWEB Ted Nelson prize at Hypertext 2013, Paris are both from Sheffield's NLP group. (link)

5 papers by the group accepted at ACL 2013

  • "Extracting bilingual terminologies from comparable corpora" Aker, A., Paramita, M. & Gaizauskas, R.
  • "An Infinite Hierarchical Bayesian Model of Phrasal Translation" Cohn, T. & Haffari, G.
  • "Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation" Cohn, T & Specia, L.
  • "Markov Translation using Non-parametric Bayesian Inference" Feng, Y. & Cohn, T.
  • "A user-centric model of voting intention from Social Media" Lampos, V., Preotiuc-Pietro, D. & Cohn, T.

3 papers by the group accepted for NAACL 2013

  • "Representing Topics Using Images" Aletras, N. and Stevenson, M.
  • "Unsupervised Domain Tuning to Improve Word Sense Disambiguation" Preiss, J. and Stevenson, M.
  • "DALE: A Word Sense Disambiguation System for Biomedical Documents Trained using Automatically Labeled Examples (demo)" Preiss, J. and Stevenson, M.
  • The ForgetIT: Concise Preservation by combining Managed Forgetting and Contextualized Remembering project has started. A 3 year EC FP7 project from 1 Feb'13 - 31 Jan'16. The project has 11 partners worth a total of € 9,085,190 with an EC contribution of € 6,590,000. The University of Sheffield PI is Prof. Hamish Cunningham
  • The VisualSense: Tagging visual data with semantic descriptions project has started. A 3 year EPSRC project from 1 Jan'13 - 31 Dec'15. The project has 4 partners and is part of the Chist-Era EC funding programme. The University of Sheffield PI is Prof. Rob Gaizauskas

Older news

Older news stories


Scheduled Speakers

2016 - 2017

4 September 2017 - Thushari Atapattu (University of Adelaide) - Disclosure Analysis of Educational Big Data

Discourse analysis within the educational context consists of processing natural language data generated from learning and teaching processes including written assessments, transcripts, discussion forums, and micro blogs. Computational approaches for discourse analysis integrates NLP with psychological theories of social interaction, discourse comprehension, and communication. Discourse analysis is a complex problem, particularly within massive classrooms (e.g. Massive Open Online Courses – MOOCs). In this talk, I will discuss two of our research in understanding the academic discourse of lecturers as well as learner-generated discourse in MOOCs. Our work aims to detect the learners’ video interactions patterns and inform us of the influence of quality of lecturers’ discourse. This work analyses millions of video interactions in two MOOCs and found that transition in discourse (i.e. lexical diversity, connectivity) impacts on learners’ video engagement behaviour. Further, I will talk about the association between the quality of learner-generated discourse (i.e. discussion posts) and its impact on learning success. Thus, I will explain how the understanding of discourse enables us to identify the interventions for positive student trajectories.

6 July 2017 - Iacer Calixto (Dublin City University) - Doubly-Attentive Decoder for Multi-modal Neural Machine Translation

In this talk, I discuss the Multi-modal Neural Machine Translation model in which a doubly-attentive decoder naturally incorporates spatial visual features obtained using pre-trained convolutional neural networks, bridging the gap between image description and translation. Our decoder learns to attend to source-language words and parts of an image independently by means of two separate attention mechanisms as it generates words in the target language. We find that our model can efficiently exploit not just back-translated in-domain multi-modal data but also large general-domain text-only MT corpora. We also report state-of-the-art results on the Multi30k data set.

3 July 2017 - Lieve Maken (Universiteit Gent) - Product and process in translation

Machine translation (MT) is more and more integrated in the translation workflow and under certain circumstances post-editing will presumably become an integral part of the translation process. In addition, raw (unedited) MT output is increasingly being used "as is", e.g. on support sites.

In this talk I will elaborate on how insights of (translation) process research and translation product research can be combined to gain a better understanding of how translators handle MT output and how human readers process raw MT output.

I will illustrate this by means of 4 projects in which my research team is currently involved:

  1. ROBOT: A comparative study of process and quality of human translation and the post-editing of machine translation
  2. SCATE: Smart computer-aided translation environment
  3. ArisToCAT: Assessing the comprehensibility of automatic translations
  4. PreDicT: Predicting Difficulty in Translation

29 June 2017 - PhD Student Talks

22 June 2017 Nikola Mrksic (University of Cambridge) - Neural Belief Tracker: Data-Driven Dialogue State Tracking using Semantically Specialised Vector Spaces

One of the core components of modern spoken dialogue systems is the belief tracker, which estimates the user's goal at every step of the dialogue. However, most current approaches have difficulty scaling to larger, more complex dialogue domains. This is due to their dependency on either: a) Spoken Language Understanding models that require large amounts of annotated training data; or b) hand-crafted lexicons for capturing some of the linguistic variation in users' language. We propose a novel Neural Belief Tracking (NBT) framework which overcomes these problems by building on recent advances in representation learning. NBT models reason over pre-trained, semantically specialised word vectors, learning to compose them into distributed representations of user utterances and dialogue context. Our evaluation on two datasets shows that this approach surpasses past limitations, matching the performance of state-of-the-art models which rely on hand-crafted semantic lexicons and outperforming them when such lexicons are not provided. Finally, we will discuss how the properties of underlying vector spaces impact model performance, and how the fact that the proposed model operates purely over word vectors allows immediate deployment of belief tracking models for other languages.

8 June 2017 Dirk Hovy (University of Copenhagen) -

1 June 2017 Ondrej Dusek (Charles University Prague) -

11 May 2017 PhD Student Talks

4 May 2017 Julie Weeds (University of Sussex) -

30 March 2017 Yannis Konstas (University of Washington) -

23 March 2017 Sebastian Reidel (University College London) -

16 March 2017 Joachim Bingel (University of Copenhagen) -

9 March 2017 Marek Rei (University of Cambridge) -

16 February 2017 Pranava Madhyastha (The University of Sheffield) -

26 January 2017 Marco Turchi (Fondazione Bruno Kessler) -

8 December 2016 PhD student talks

1 December 2016 Francesca Toni (Imperial College London) - From computational argumentation to relation-based argument mining.

  • In this talk I will overview foundations, tools and (some) applications of computational argumentation, focusing on three popular frameworks, namely Abstract Argumentation, Bipolar Argumentation and Quantitative Argumentation Debates (QuADs). These frameworks can be supported by and support the mining of attack/support relations amongst arguments. Moreover, I will discuss the following questions: is the use of quantitative measures of strength of arguments, as proposed e.g. for QuADs, a good way to assess the dialectical strength of arguments mined from text or the goodness of argument mining techniques?

24 November 2016 Mikel Forcada (Universitat d'Alacant (Spain)) - Gap-filling as a method to evaluate the usefulness of raw machine translation.

  • Most machine translation is consumed raw by ordinary people wanting to make sense of text in languages they cannot understand, for a variety of purposes. In contrast, while subjective judgements of machine translation quality (fluency, adequacy) have been commonplace for decades, surprisingly very little research has addressed the evaluation of the actual usefulness of raw machine-translated text - and almost none about the actual way in which readers make sense of it. Direct evaluation is costly as it has to look into the success of machine-translation-mediated tasks. After a quick review of existing indirect approaches, I describe a possible low-cost method to indirectly evaluate the comprehension of machine-translated text by target-language monolinguals, which may effectively be seen as a simplification -and perhaps a generalization- of reading comprehension tests based on questionnaires. Readers of machine-translated excerpts are asked to fill word gaps in a professional translation of the same excerpt. Word gaps can be ______ anywhere in the reference _______, but preferably at content _______. Results of preliminary gap-filling evaluation work are critically reviewed, and suggestions for future research are outlined.

17 November 2016 Gerasimos Lampouras (University of Sheffield) - Imitation learning for language generation from unaligned data.

  • Natural language generation (NLG) is the task of generating natural language from a meaning representation. Rule-based approaches require domain-specific and manually constructed linguistic resources, while most corpus based approaches rely on aligned training data and/or phrase templates. The latter are needed to restrict the search space for the structured prediction task defined by the unaligned datasets. In this work we propose the use of imitation learning for structured prediction which learns an incremental model that handles the large search space while avoiding explicitly enumerating it. We adapted the Locally Optimal Learning to Search framework which allows us to train against non-decomposable loss functions such as the BLEU or ROUGE scores while not assuming gold standard alignments. We evaluate our approach on three datasets using both automatic measures and human judgements and achieve results comparable to the state-of-the-art approaches developed for each of them. Furthermore, we performed an analysis of the datasets which examines common issues with NLG evaluation.

3 November 2016 Mark-Jan Nederhof (University of St Andrews) - Transition-based dependency parsing as latent-variable constituent parsing.

  • We provide a theoretical argument that a common form of projective transition-based dependency parsing is less powerful than constituent parsing using latent variables. The argument is a proof that, under reasonable assumptions, a transition-based dependency parser can be converted to a latent-variable context-free grammar producing equivalent structures.

26 October 2016 Barbara Plank (University of Groningen) - What to do about non-canonical data in Natural Language Processing.

  • Real world data differs radically from the benchmark corpora we use in natural language processing (NLP). As soon as we apply our technology to the real world, performance drops. The reason for this problem is obvious: NLP models are trained on samples from a limited set of canonical varieties that are considered standard, most prominently English newswire. However, there are many dimensions, e.g., socio- demographics, language, genre, sentence type, etc. on which texts can differ from the standard. The solution is not obvious: we cannot control for all factors, and it is not clear how to best go beyond the current practice of training on homogeneous data from a single domain and language. 

    In this talk, I review the notion of canonicity, and how it shapes our community's approach to language. I argue for the use of fortuitous data. Fortuitous data is data out there that just waits to be harvested. It might be in plain sight, but is neglected (available but not used), or it is in raw form and first needs to be refined (almost ready). It is the unintended yield of a process, or side benefit. Examples include hyperlinks to improve sequence taggers, or annotator disagreement that contains actual signal informative for a variety of NLP tasks. More distant sources include the side benefit of behavior. For example, keystroke dynamics have been extensively used in psycholinguistics and writing research. But do keystroke logs contain actual signal that can be used to learn better NLP models? In this talk I will present recent (on-going) work on keystroke dynamics to improve shallow syntactic parsing. I will also present recent work on using bi-LSTMs for POS tagging, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words and achieves state-of-the-art performance across 22 languages.

20 October 2016 Leon Derczynski , University of Sheffield - Building a Diverse Named Entity Recognition Resource.

  • This talk presents a new benchmark in corpus construction methodology. One of the main obstacles, hampering method development and comparative evaluation of named entity recognition in social media, is the lack of a sizeable, diverse, high quality annotated corpus, analogous to the CoNLL'2003 news dataset. For instance, the biggest Ritter tweet corpus is only 45 000 tokens - a mere 15% the size of CoNLL'2003. Another major shortcoming is the lack of temporal, geographic, and author diversity. This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire. The corpus is released openly, including source text and intermediate annotations.

13 October 2016 Isabelle Augenstein , University College London - Weakly Supervised Machine Reading.

  • The state of the art in natural language processing for high-level end user tasks has advanced to a point where are seeing more and more usable commercial applications. These include question answering and dialogue systems such as Google Now or Amazon Echo. One of the things that is crucial for building such applications is to automatically understand text, which is also known as machine reading. In this talk, I will highlight methods for different components of machine reading, namely representation learning, structured prediction and automatically generating training data. I will then present ongoing research of applying these techniques to the tasks of sentiment analysis, semantic error correction and question answering.

6 October 2016 Sumithra Velupillai, King's College London and KTH Sweden - Extracting temporal information from clinical narratives: existing models, approaches - and challenges for the mental health domain

  • Accurately extracting temporal information from clinical documentation is crucial for understanding e.g. disease progression and treatment effects. In addition to time-stamped and other structured information in electronic health records, temporal information is conveyed in narrative form. Although techniques for extracting events such as symptoms ("anxiety") and treatments ("Xanax"), time expressions ("May 1st") and time relations ("anxiety before Xanax") from clinical notes have been developed in the Natural Language Processing community with promising results in the past few years, most studies have been performed on heterogeneous clinical specialties and use-cases. Mental health documentation poses several unique challenges, one of which will be addressed in my project on extracting symptom and treatment onset for psychosis patients, to better understand duration of untreated psychosis. In this talk, I will describe my previous work on automated extraction of temporal expressions from clinical text using the clearTK package, a framework for machine learning and NLP with UIMA. I will also describe other state-of-the-art approaches for temporal reasoning in clinical text, and discuss challenges involved in applying and adapting these for extracting onset information from mental health records.

29 September 2016 Savelie Cornegruta, King's College London.

  • Timeline extraction using distant supervision and joint inference.
    In timeline extraction the goal is to order the events in which a target entity is involved in a timeline. Due to the lack of explicitly annotated data, previous work is rule-based and uses temporal linking systems trained on previously annotated data. Instead, we propose a distantly supervised approach by heuristically aligning timelines with documents. The noisy training data created allows us to learn models that anchor events to temporal expressions and entities; during testing, the predictions of these models are combined to produce the timeline. Furthermore, we show how to improve performance using joint inference.
  • Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks
    Motivated by the need to automate medical information extraction from free-text radiological reports, we present a bi-directional long short-term memory (BiLSTM) neural network architecture for modelling radiological language. The model has been used to address two tasks, medical named-entity recognition and negation detection. We further investigate whether learning several types of word embeddings improved the performance on BiLSTM on those tasks. Using a dataset of chest x-ray reports, we show that the BiLSTM model outperforms a baseline rule-based system on the NER task while for the negation detection it approaches the performance of a hybrid system that leverages the hand-crafted rules of the NegEx algorithm and the grammatical relations obtained from the Stanford Dependency Parser.

16 September 2016 Philip Schulz, University of Amsterdam - Word Alignment with NULL workds

  • Most existing word alignment models assume that source words that do not have a lexical translation in the target language were generated from a hypothetical target NULL word. This NULL word is assumed to exist in any target sentence. From a modeling perspective this is unsatisfactory since our linguistic knowledge tells us that untranslatable source word (e.g. certain prepositions) are required by the source context in which they are found. Moreover, the NULL word does have a position in the target sentence and thus troubles distortion-based alignment models by influencing their distortion distributions in unexpected ways.

    We present a Bayesian word alignment model that does not postulate NULL words. Instead, source words that don't have lexical translations are generated from the source context. In the final alignment step, such source words are left unaligned. This leads to more informed distributions over unaligned words because these distributions are now conditioned on source contexts. Our model is also general enough to incorporate different distortion models. Finally, we have developed a fast auxiliary variable Gibbs sampler that makes our model competitive with existing models in terms of training time.

    After having presented our alignment model I will shortly discuss plans to extend it to a probabilistic phrase extraction model for machine translation.

8 September 2016 Mikel Forcada, Universitat d'Alacant (Spain) - Towards an effort-driven combination of translation technologies in computer-aided translation

  • The talk puts forward a general framework for the measurement and estimation of professional translation effort in computer-aided translation. It then outlines the application of this framework to optimize and seamlessly combine available translation technologies in a principled manner to reduce professional translation effort.

Past seminars

Reading Group

NLP Reading Group

The target audience is all the members of the NLP group and other possible interested participants.

The meeting will take place weekly for one hour usually on Tuesdays from 11-12pm.

The meetings of the group will be informal and no necessary preparation will be required with the exception of the moderator reading the current paper and the rest having at least a brief overview of it.

Next Meeting

Tuesday 19 September 2017

"Men also like shopping: Reducing Gender Bias Amplification Using Corpus Level Constraints"

Past Meetings

Tuesday 29 August 2017

Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction

Wen Sun, Arun Venkatraman, Geoffrey J. Gordon, Byron Boots, J. Andrew Bagnell

Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3309-3318, 2017.

Tuesday 22 August 2017

Split and Rephrase, Accepted for EMNLP 2017

Shashi Narayan, Claire Gardent, Shay B. Cohen and Anastasia Shimorina

Tuesday 15 August 2017

Attention Is All You need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely

Tuesday 8 August 2017

Learning to Compute Word Embeddings On the Fly

Dzmitry Bahdanau, Tom Bosc, Stanisław Jastrzębski, Edward Grefenstette, Pascal Vincent, Yoshua Bengio

Tuesday 1 August 2017

Learning to Generate Textual Data, EMNLP 2016
Guillaume Bouchard and Pontus Stenetorp and Sebastian Riedel

Tuesday 11 July 2017

SoundNet: Learning Sound Representations from Unlabeled Video

Yusuf Aytar, Carl Vondrick, Antonio Torralba

Tuesday 4 July 2017

Sentence Simplification with Deep Reinforcement Learning

Xingxing Zhang, Mirella Lapata

Tuesday 27 June 2017

Generation and Comprehension of Unambiguous Object Descriptions

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, Kevin Murphy

Tuesday 20 June 2017

Understanding the BPE algorithm

Tuesday 13 June 2017

Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech

Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, Zhifeng Chen

Tuesday 6 June 2017

Covonlutional Sequence to Sequence Learning

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin

Tuesday 30 May 2017

Program Induction by Rationale Generation:Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, Phil Blunsom

Tuesday 9 May 2017

Chatterjee et al.: Online Automatic Post-editing for MT in a Multi-Domain Translation Environment

Tuesday 6 May 2017

Convolutional Sequence to Sequence Learning

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin

Tuesday 2 May 2017

Coarse-to-Fine Question Answering for Long Documents

Tuesday 25 April 2017

Re-evaluating Automatic Metrics for Image Captioning

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, Erkut Erdem

Tuesday 18 April 2017

Neural Tree Indexers, EACL2017

Tuesday 11 April 2017

EACL Recap

Tuesday 4 April 2017

Shakir Mohammed's deep learning overview

Tuesday 28 March 2017

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond

Tuesday 21 March 2017

Unsupervised AMR-Dependency Parse Alignment

Tuesday 14 March 2017

Kim et al. (2016): Examples are not Enough, Learn to Criticize! Criticism for Interpretability, NIPS 2016

Tuesday 7 March 2017

Latent Variable Dialogue Models and their Diversity

Kris Cao and Stephen Clark

Tuesday 28 February 2017

Zhang et al. EACL2017

Tuesday 21 February 2017

Structured Attention Networks

Tuesday 14 February 2017

CORE: Context-Aware Open Relation Extraction with Factorization Machines

by Fabio Petroni, Luciano Del Corro and Rainer Gemulla

Tuesday 7 February 2017

Adversarial Training Methods for Semi-Supervised Text Classification

Takeru Miyato, Andrew, M.Dai, Ian Goodfellow

Tuesday 31 January 2017

Learning to Prune: Exploring the Frontier of Fast and Accurate Parsing

Tim Vieira and Jason Eisner

Tuesday 24 January 2017

Matching Networks for One Shot Learning

Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, Daan Wierstra

Tuesday 17 January 2017

Learning Structured Predictors from Bandit Feedback for Interactive NLP. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL). Berlin, Germany

Artem Sokolov, Julia Kreutzer, Christopher Lo, Stefan Riezler

Tuesday 13 December 2016

Optimization and Sampling for NLP from a Unified Viewpoint

Marc Dymetman, Guillaume Bouchard, Simon Carter

Tuesday 6 December 2016

Matrix Completion has No Spurious Local Minimum

Rong Ge, Jason D. Lee, Tengyu Ma

Tuesday 29 November 2016

Compositional Semantic Parsing on Semi-Structured Tables 
Panupong Pasupat and Percy Liang

Tuesday 22 November 2016

Minimum Risk Training for Neural Machine Translation 
Shiqi Shen, Yong Cheng, Zhougjun He, Wei He, Hua Wu, Maosong Sun, Yang Liu

Tuesday 15 November 2016

Generation from Abstract Meaning Representation using Tree Transducers 
Jeffrey Flanigan, Chris Dyer, Noah A. Smith and Jaime Carbonell

Tuesday 1 November 2016

Visual Representations for Topic Understanding and Their Effects on Manually Generated Labels Transactions of the Association for Computational Linguistics, 2016. 
Alison Smith, Tak Yeon Lee, Forough Poursabzi-Sangdeh, Leah Findlater, Jordan Boyd-Graber, and Niklas Elmqvist

Tuesday 25 October 2016

Learning to Search Better than your Teacher

Chang et al. ICML 2015

Tuesday 11 October 2016

A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task 
Danqi Chen, Jason Bolton, Christopher D. Manning

Tuesday 4 October 2016

Ultradense Word Embeddings by Orthogonal Transformation 
Sascha Rothe, Sebastian Ebert, Hinrich Schütze

Tuesday 7 June 2016

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution. 
Upendra Sapkota, Steven Bethard, Manuel Montes-y-Gómez & Thamar Solorio (2015)

Tuesday 31 May 2016

Relation extraction with matrix factorization and universal schemas.

Riedel, S., Yao, L., McCallum, A., & Marlin, B. M. (2013)

Tuesday 10 May 2016

Training Deterministic Parsers with Non-Deterministic Oracles, TACL

Goldberg, Y. and Nivre, J. (2013)

Tuesday 3 May 2016

A New Corpus and Imitation Learning Framework for Context-Dependent Semantic Parsing 
Vlachos, A. and Clark, S.

Tuesday 22 April 2016

Sequence Level Training with recurrent Neural Networks 
Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba

Tuesday 22 March 2016

"Distributed Representation of Sentences and Documents" 
Quoc Le and Tomas Mikolov

Tuesday 8 March 2016

AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes 
Sascha Rothe; Hinrich Schütze. ACL2015 (best student paper)

Tuesday 23 February 2016

From Word Embeddings To Document Distances 
Kusner et al.

Tuesday 16 February 2016

"Target-Dependent Twitter Sentiment Classification with Rich Automatic Features"

Tuesday 9 February 2016

"Evaluation methods for unsupervised word embeddings"

Tuesday 25 January 2016

Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks 
Hua He, Kevin Gimpel, and Jimmy Lin. EMNLP2015

Tuesday 19 January 2016

Multilingual Image Description with Neural Sequence Models

Tuesday 12 January 2016

"Improving Distributional Similarity with Lessons Learned from Word Embeddings"

Tuesday 8 December 2015

Using Discourse Structure Improves Machine Translation Evaluation
F Guzmán, S Joty, L Màrquez, P Nakov

And here are the author's slides

Tuesday 1 December 2015

Practical Bayesian Optimization of Machine Learning Algorithms Advances in Neural Information Processing Systems, 2012 
Snoek, J.; Larochelle, H. & Adams, R. P.

Related presentations/lecture slides:

Related Video

My reading group presentation slides

Tuesday 24 November 2015

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks ACL 2015 
LSTMs? Kai Sheng Tai, Richard Socher, Christopher D. Manning

Additional resource about LSTM: "Anyone Can Learn To Code an LSTM-RNN in Python"

Tuesday 17 November 2015

RNNs/LSTMs ConvNets

More details on auto encoders for unsupervised pre-training:

Tuesday 10 November 2015

Multi-Metric Optimization Using Ensemble Tuning. NAACL2013. Video 
Baskaran Sankaran, Anoop Sarkar and Kevin Duh

Tuesday 3 November 2015

NN tutorials by Quoc Le

Josiah's slides

Other resources:

Andrej Karpathy's notes

Different objective functions, multiclass problems

Gradient descent


Discussion about different activation functions

Tuesday 27 October 2015

Three blog posts introducing RNNs for language modelling in equations and code

might help to read this NLP primer

Additional material:
a thorough explanation of back propagation

Tuesday 20 October 2015

Teaching Machines to Read and Comprehend. NIPS 2015. 
Karl Moritz Hermann, Tomáš Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Phil Blunsom

Slides (presented at LXMLS)

Background reading:

Understanding LSTMs

NAACL 2013 Tutorial "Deep Learning without Magic"

EMNLP 2014 Tutorial "Embedding Methods for NLP"

Related Work:

Entailment with Neural Attention (better description of attention models than in the NIPS paper in my opinion)

Memory Networks

Tuesday 13 October 2015

A large annotated corpus for learning natural language inference. Proceedings of EMNLP 2015. 
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning

Should compare this to work on (multilingual) textual similarity



Funded Research Projects

Current Projects

Currently these group projects are active (in alphabetical order)

  • Career Accelleration Fellowship: Machine Learning Methods for Personalised, Abstractive Summerisation of Consumer-Generated Media 
  • Kalina Bontcheva
  • COMRADES: Collective Platform for Community Resilience and Social Innovation during Crises 
  • Kalina Bontcheva
  • Cracker: Cracking the Language Barrier: Coordination, Evaluation and Resources for European MT Research
  • Lucia Specia
  • DILiGENt: Domain-Independent Language Generation
  • Andreas Vlachos
  • GATE: A General Architecture for Text Engineering 
  • Hamish Cunningham
  • GOOGLE Grant: Distinguishing Common and Proper Nouns 
  • Mark Stevenson
  • Healtex: UK Healthcare Text Analytics Research Network 
  • Rob Gaizauskas
  • Investigating Spoken Dialogue to Support Manufacturing Processes
  • Rob Gaizauskas
  • KConnect: Khresmoi Multilingual Medical Text Analysis, Search and Machine Translation Connected in a Thriving Data-Value Chain 
  • Angus Roberts
  • KNOWMAK: Knowledge in the making in the European society
  • Diana Maynard
  • MultiMT: Multimodal Machine Translation 
  • Lucia Specia
  • OpenMinTed: Open Mining INfrastructure for TExt and Data 
  • Angus Roberts
  • Predicting Relevance and Quality of Machine Translation for Product Reviews
  • Lucia Specia
  • QT21: Quality Translation 21 
  • Lucia Specia
  • Recommendation Algorithm
    Mark Stevenson
  • SIMPATICO:SIMplifying the interaction with Public Administration Through Information technology for Citizens and cOmpanies 
  • Lucia Specia
  • SoBigData: SoBigData Research Infrastructure 
  • Hamish Cunningham
  • SUMMA: Scalable Understanding of Multilingual MediA 
  • Andreas Vlachos
Previous Projects

Previous projects (in alphabetical order)

  • ACCURAT: Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation 
  • Rob Gaizauskas & Paul Clough (Information School)
  • ABRAXAS: Automating Ontology Learning for the Semantic Web 
  • Yorick Wilks & Fabio Ciravegna
  • AKT: Advanced Knowledge Technolgies 
  • Yorick Wilks
  • AMILCARE: An adaptive IE system for the Semantic Web 
  • Fabio Ciravegna
  • AMITIES: Automated Multilingual Interaction with Information and Services 
  • Yorick Wilks
  • AnnoMarket: Annotation Resource Marketplace in the Cloud 
  • Hamish Cunningham
  • ARCOMEM: From Collect-All Archives to Community Memories - Leveraging the Wisdom of the Crowds for Intelligent Preservation 
  • Hamish Cunningham
  • AVENTINUS: Advanced Information System for Multinational Drug Enforcement 
  • Yorick Wilks & Hamish Cunningham
  • Barista: Non-Parametric Models of Phrase-based Machine Translation 
  • Trevor Cohn
  • CA4NLP: Engineering Natural Language Interfaces: can CA help? 
  • Mark Hepple & Peter Wallis
  • CASTLE: Computational Adaptive Semantics for Language Engineering 
  • Mark Stevenson
  • CLARIN: Common Language Resources and Technology Infrastructure 
  • Wim Peters
  • CLARITY: Cross Language Information Retrieval and Organisation of Text and Audio Documents 
  • Rob Gaizauskas & Mark Sanderson (Information Studies)
  • CLEF: CLinical E-Science Framework 
  • Rob Gaizauskas & Mark Hepple
  • CLUE II: Contextual Learning for detecting Unexpected Events 
  • Louise Guthrie
  • COMIC: COnversational Multimodal Interaction with Computers 
  • Yorick Wilks
  • COMPANIONS: Intelligent, Persistent, Personalised Multimodal Interfaces to the Internet 
  • Yorick Wilks
  • CRONOPATH: Information Retrieval/Extraction through time 
  • Yorick Wilks
  • CONVERSE: A Conversational Companion 
  • Yorick Wilks
  • CLUE: Contextual Learning for detecting Unexpected Events 
  • Louise Guthrie
  • Cub Reporter: QA and Summarisation for Preparation of Background News Reports 
  • Rob Gaizauskas, Yorick Wilks & Jonathan Foster (Jounalism Studies)
  • DALOS: DrAfting Legislation with Ontology-based Support 
  • Wim Peters
  • DAPPER: Natural Language Processing Tools for Discourse Analysis in Psychology 
  • Horacio Saggion
  • DecarboNET: A Decarbonisation Platform for Citizen Empowerment and Translating Collective Awareness into Behavioural Change
  • Kalina Bontcheva
  • DOT KOM: Designing Adaptive Information Extraction from Text for Knowledge Management and the Semantic Web 
  • Fabio Ciravegna
  • DotRural: A Text Analytic Approach to Rural and Urban Legal Histories 
  • Wim Peters
  • Expert: EXPloiting Empirical appRoaches to Translation 
  • Lucia Specia
  • : Extraction of Content: Research at Near Market 
  • Yorick Wilks
  • ELSE: Evaluation in Language and Speech Engineering 
  • Rob Gaizauskas
  • EMILLE: Enabling Minority Language Engineering 
  • Rob Gaizauskas
  • EMPATHIE: Enzyme and Metabolic Path Information Extraction 
  • Rob Gaizauskas
  • EMPIRICAL GRAMMAR: Inducing Adequate Grammars from Electronic Texts 
  • Yorick Wilks & Rob Gaizauskas
  • EnviLOD: 
  • Kalina Bontcheva
  • EWN: EuroWordNet 
  • Yorick Wilks
  • FASiL: Flexible and Adaptive Spoken Language and Multi-Modal Interfaces 
  • Yorick Wilks
  • FLaReNet: Fostering Language Resources Network 
  • Yorick Wilks & Wim Peters
  • ForgetIT: Concise Preservation by combining Managed Forgetting and Contextualized Remembering 
  • Hamish Cunningham
  • GATE Cloud Exploratory: Adapting the General Architecture for Text Engineering to Cloud Computing 
  • Hamish Cunningham
  • GoTag: Real-Time Text Mining for the Biomedical Literature: A Collaboration between Discoverynet & Mygrid 
  • Rob Gaizauskas
  • HUMAINE: Research on Emotions and Human-Machine Interaction 
  • Yorick Wilks & Daniela Romano
  • InPuT: Individual Profiling using Text Analysis 
  • Mark Stevenson
  • KHRESMOI: Knowledge Helper for Medical and Other Information users 
  • Hamish Cunningham
  • KTA PoC Award: Scaling-up WSD for the Life Sciences 
  • Mark Stevenson
  • KnowledgeWeb: Network on excellence on realising the Semantic Web 
  • Hamish Cunningham
  • LarKC: Large Scale Semantic Computing Semantic Web Technologies distributed reasoning 
  • Hamish Cunningham
  • LaSIE: Large Scale Information Extraction 
  • Yorick Wilks & Rob Gaizauskas
  • LEXDIS: Lexical Disambiguation for the Biomedical Domain 
  • Mark Stevenson
  • LIRICS: Linguistic Infrastructure for Interoperable Resources and Systems 
  • Kalina Bontcheva
  • LOIS: Lexical Ontologies for Legal Information Sharing 
  • Wim Peters
  • M4L: Memories for Life Network 
  • Yorick Wilks, Christopher Brewster & Mark Sanderson (Information Studies)
  • MALT: Mappings, Agglomerations and Lexical Tuning 
  • Yorick Wilks
  • METER: Measuring Text Reuse 
  • Rob Gaizauskas, Yorick Wilks & Jonathan Foster (Jounalism Studies)
  • MiAkt: Grid enabled knowledge services: collaborative problem solving environments in medical informatics 
  • Yorick Wilks & Fabio Ciravegna
  • MediaCampaign: Discovering, inter-relating and navigating cross-media campaign knowledge 
  • Hamish Cunningham
  • Medics: Language Processing for Literature Based Discovery in Medicine 
  • Mark Stevenson
  • MLi: Towards a MultiLingual Data Services infrastructure 
  • Hamish Cunningham
  • MoDiST: Modelling Discourse in Statistical Machine Translation 
  • Lucia Specia
  • MULTIFLORA_II: Combining Information Extraction and Knowledge Representation for Biodiversity Informatics 
  • Yorick Wilks & Hamish Cunningham
  • MultiMatch: Multilingual/Multimedia Access To Cultural Heritage 
  • Paul Clough (Information Studies)
  • MUMIS: Multi-Media Indexing and Searching Environment 
  • Yorick Wilks & Hamish Cunningham
  • MUSE: Multi-Source Entity finder 
  • Yorick Wilks
  • Musing: Multi-Industry, Semantic-based Next Generation Business IntelliGence 
  • Kalina Bontcheva
  • MyGrid: Supporting the Biologist E-Scientist 
  • Rob Gaizauskas
  • NAMIC: News Agencies Multilingual Information Categorisation 
  • Yorick Wilks
  • NEON: Lifecycle support for networked ontologies 
  • Hamish Cunningham
  • PAROLE/SIMPLE: Preparatory Action for Linguistic Resources Organistion for Language Engineering 
  • Yorick Wilks
  • PASTA: Protein Active Site Template Acquisition 
  • Yorick Wilks
  • PATHS: Personalised Access To cultural Heritage Spaces 
  • Mark Stevenson & Paul Clough (Information School)
  • PEEC: Partitioning the Enron Email Corpus 
  • Louise Guthrie
  • PEEC II: Partitioning the Enron Email Corpus 
  • Louise Guthrie
  • PHEME: Computing Veracity Across Media, Languages, and Social Networks
    Kalina Bontcheva
  • POESIA: Public Open-source Environment for a Safer Internet 
  • Mark Hepple
  • POETIC: The POrtable Extendable Traffic Information Collator 
  • Rob Gaizauskas
  • PrestoSpace: Digital preservation and rich metadata indexing of audio-video collections 
  • Hamish Cunningham
  • QTLaunchpad: Preparation and Launch of a Large-Scale Action for Quality Translation Technology 
  • Lucia Specia
  • RESuLT: Relation Extraction using Semi-Supervised Learning Techniques 
  • Mark Stevenson
  • REVEAL: The Identification of Anomalous Segments in Text on a Large Scale 
  • Louise Guthrie
  • REVEAL II: The Identification of Anomalous Segments in Text on a Large Scale 
  • Louise Guthrie
  • RolTech: Platform for Romanian Language Technology: Resources, Tools and Interfaces 
  • Valentin Tablan
  • SEKT: Semantically-Enabled Knowledge Technologies (central page) 
  • Hamish Cunningham
  • SENSEI: Making Sense of Human-Human Conversation Data
    Rob Gaizauskas
  • SenseMaking: Information Processing and Sensemaking: An Exploratory Search System for Document Collections 
  • Mark Stevenson
  • SERA: Social Engagagement with Robots and Agents 
  • Peter Wallis
  • ServiceFinder: Realizing Web Service Discovery at Web Scale 
  • Kalina Bontcheva
  • SLaTr: A Joint Model of Spoken Language Translation 
  • Trevor Cohn / Thomas Hain
  • Sumerian/ETCSL: Tools for linguistic annotation and Web-based analysis of literary Sumerian 
  • Hamish Cunningham
  • SOCIS: Scene Of Crime Information System 
  • Yorick Wilks
  • SToBS: Structured Transcription of Broadcast Speech 
  • Rob Gaizauskas
  • TaaS: Terminology as a Service 
  • Rob Gaizauskas
  • TAO: Transitioning Applications to Ontologies 
  • Kalina Bontcheva
  • h-Techsight: A Knowledge management platform with intelligence and insight capabilities for technology intensive industries 
  • Hamish Cunningham
  • TEXTvre: Emerging, collective intelligence for personal, organizational and social use 
  • Kalina Bontcheva & Angus Roberts
  • TrendMiner: Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams 
  • Kalina Bontcheva & Trevor Cohn
  • TRESTLE: Text Retrieval, Extraction and Summarisation for Large Enterprises 
  • Rob Gaizauskas & Micheline Beaulieu (Information Studies)
  • TRIPOD: TRI-Partite multimedia Object Description 
  • Mark Sanderson (Information Studies) & Rob Gaizauskas
  • uComp: Embedded Human Computation for Knowledge Extraction and Evaluation 
  • Wim Peters
  • VIEWGEN: Belief Modelling and Dialogue Systems 
  • Yorick Wilks
  • VIKEF: Virtual Information and Knowledge Environment Framework 
  • Rob Gaizauskas
  • VisualSense: Tagging visual data with semantic descriptions 
  • Rob Gaizauska

PhD Projects

Current Research Students

Current students listed in alphabetical order

Awarded PhD's

Awarded PhD's by year


Gustavo Henrique Paetzold
Lexical Simplification for Non-Native Speakers
(Award Date: 24 October 2016)

Xingyi Song
Training Machine Translation for Human Acceptability
(Award Date: 16 October 2016)

Roland Roller
Information Extraction from Documents in the Life Sciences
(Award Date: 26 August 2016)


Dominic Rout
A ranking approach to summarising Twitter home timelines
(Award Date: 24 November 2015)


Nikolaos Aletras
Exploring the Semantics of Topic Models
(Award Date: 11 December 2014)

Ayman Alhelbawy
A new approach to information extraction from natural language texts
(Award Date: 23 September 2014)

Daniel Preotiuc-Pietro
Unsupervised learning for time-based clustering of language
(Award Date: 19 June 2014)

Ahmet Aker
Entity Type Modeling for Multi-Document Summarization of Geo-Located Entity Descriptions
(Award Date: 20 February 2014)


Leon Derczynski
Determining the Types of Temporal Relations in Discourse
(Award Date: 2 October 2013)

Samuel Fernando 
enriching knowledge bases using relation extraction
(Award Date: 13 June 2013)

Giuseppe Di Fabbrizio
Automatic Summarization of Opinions in Service and Product Reviews
(Award Date: 8 May 2013)


Angus Roberts
Clinical Information Extraction: Lowering the Barrier
(Award Date: 18 December 2012)

Rao Muhammad Adeel Nawab
Mono-lingual Paraphrased Text reuse and Plagiarism detection
(Award Date: 18 September 2012)

Niraj Aswani
Evolving a Generail Framework for Text Alignment: Case Studies with Two Asian Languages
(Award Date: 7 August 2012)

Kumutha Swampillai
Information Extraction Across Sentences
(Award Date: 7 March 2012)


Angelo Dalli
Timeline Extraction From Hyperlinked Text Corpora
(Award Date: 10 October 2011)

Danica Damljanovic
Natural Language Interfaces to Conceptual Models
(Award Date: 18 August 2011)


Ben Allison
An Improved Hierarchical Bayesian Model of Language for Document Classification
(Award Date: 21 October 2010)

Nick Webb
Cue-based dialogue act classification
(Award Date: 16 March 2010)

Sanaz Jabbari
A Statistical Model of Lexical Context
(Award Date: 23 February 2010)

Valentin Tablan
Toward Portable Information Extraction
(Award Date: 25 January 2010)


David Guthrie
Unsupervised Detection of Anomalous Text
(Award Date: 3 December 2008)

Joe Polifroni
Enabling Browsing in Interactive Systems
(Award Date: 18 November 2008)

Christopher Brewster
Mind the Gap: Bridging from text to ontological Knowledge
(Award Date: 1 October 2008)

Francios Mairesse
Learning to Adapt in Dialogue Systems: Data-driven Models for Personality Recognition and Generation
(Award Date: 30 September 2008)

Hrafn Loftsson
Tagging and Parsing Icelandic Text
(Award Date: 5 February 2008)


Michael Conway
Approaches to Automatic Biographical Sentence Classification: An Empirical Study
(Award Date: 27 July 2007)


Mark Greenwood
Open-Domain Question Answering
(Award Date: 13 March 2006)


Fang Huang
Multi-Document Summarization with Latent Semantic Analysis
(Award Date: 19 May 2005)

Ekaterini Pastra
Vision \96 Language Integration: a Double-Grounding Case
(Award Date: 5 January 2005)


Alexiei Dingli
Annotating the Semantic Web
(Award Date: 6 December 2004)

Wim Peters
Detection and Characterization of Figurative Language Use WordNet
(Award Date: 29 November 2004)

Diego Uribe
LEEP: Learning Event Extraction Patterns
(Award Date: 18 October 2004)

Brian Mitchell
Prepositional Phase Attachment using Machine Learning Algorithms
(Award Date: 5 July 2004)


Paul Clough
Measuring Text Reuse
(Award Date: 11 April 2003)


Tomas By
Tears in the Rain
(Award Date: 15 March 2002)

Andrea Setzer
Temporal information in newswrite articles: An annotation scheme and corpus study
(Award Date: 15 March 2002)


Kalina Bontcheva
Generating Adaptive Hypertext
(Award Date: 17 September 2001)

Alexandar Krotov
Parsing with a Compacted Treebank Grammar
(Award Date: 17 September 2001)


ChunYu Kit 
Unsupervised Lexical Learning as Inductive Inference
(Award Date: 15 November 2000)

Hamish Cunningham
Software Architecture for Language Engineering
(Award Date: 10 July 2000)

H.M. Harmain
Building Object-Oriented Conceptual Models Using Natural Language Processing Techniques
(Award Date: 2000)

Paul Woods
Cognitive Schemas for Chinese Noun Classifiers: A Corpus-Based Investigation
(Award Date: 25 February 2000)


Ted Dunning
Finding Structure In Text Genome And Other Symbolic Sequences
(Award Date: 29 November 1999)

Mark Stevenson
Multiple Knowledge Sources for Word Sense Disambiguation
(Award Date: 27 September 1999)

Hammid Khosravi
Extracting Pragmatic Content From Email
(Award Date: 9 August 1999)


Mark Lee
Belief Rationality and Inference
(Award Date: 14 December 1998)

Rob Collier
Automatic Template Creation for Information Extraction
(Award Date: 10 August 1998)

Resources Group member resources