Information Access

Building applications to improve access to information in massive text collections, such as the web, newswires and the scientific literature. Subtopics include:


Information Extraction, Text Mining and Semantic Annotation

Information extraction (IE) refers to the activity of automatically recognizing pre-specified sorts of information in short, natural language texts. For instance, one might scan business newswire texts for announcements of management succession events (retirements, appointments, promotions, etc.), identify the names of the participating companies and individuals, the post involved, the vacancy reason, and so on. Or, one might scan biomedical research papers, identify the names of proteins and determine which proteins are engaged in interactions with which other proteins.

Once identified in texts, the specified information may be utilised in various ways. The information may be annotated in the source texts -- so-called semantic annotation -- and used as the basis for semantic search, i.e. for the Semantic Web. Or the information may be extracted from the source texts and stored in a separate structured information repository or database. This structured database may then be used for searching or linking using conventional database queries or analysis using data-mining techniques -- potentially leading to the discovery of novel associations, i.e. text mining. Or the extracted information can be used for generating summaries focused on the extraction targets.

Our contribution

The NLP group has worked intensively on IE-related topics since its inception. The group has produced a wide range of IE systems and components, some of which are freely available with the GATE platform, and has embedded them in prototype applications. In these systems we have investigated techniques ranging from relatively deep, linguistically-motivated knowledge engineering approaches, including full parsing and discourse interpretation using models of domain and world knowledge, to supervised and semi-supervised machine learning approaches, such as support vector machines, that exploit labelled corpora. The group has worked in a variety of domains and application areas, including newswire analysis for competitor intelligence, biomedical research paper analysis to support scientific research, clinical records analysis to support clinical research and patient care.

People

Kalina Bontcheva, Hamish Cunningham, Rob Gaizauskas, Mark Hepple, Diana Maynard, Lucia Specia, Mark Stevenson, Yorick Wilks

Projects

  • ARCOMEM: From Collect-All Archives to Community Memories - Leveraging the Wisdom of the Crowds for Intelligent Preservation
  • KDisc: Language Processing for Literature Based Discovery in Medicine
  • PATHS - personalised access to cultural heritage collections.
  • TrendMiner: Large-scale, Cross-lingual Trend Mining and Summarisation of Real-time Media Streams
  • AMILCARE - Amilcare is an adaptive Information Extraction (IE) system intended as support to document annotation in the Semantic Web framework
  • ANNIE - ANNIE is a robust, multi-genre Information Extraction application distributed with Sheffield's GATE.
  • ARMADILLO - Armadillo is a knowledge mining system used to extract information from several sources.
  • CLEF - CLEF, the CLinical E-Science Framework project, aims to extract information such as symptoms, diagnosis and treatment from clinical records of cancer patients.
  • DOT.KOM - Dot.Kom aims at designing adaptive Information Extraction from text for Knowledge Management and the Semantic Web
  • EMPathIE - EMPathIE -- Enzyme and Metabolic Pathways Information Extraction -- aims to extract information about enzymes and enzyme reactions from biomedical journal papers with a view to supporting researchers investigating metabolic pathways
  • LaSIE - LaSIE -- Large Scale Information Extraction -- is an IE system designed for research and benchmarking purposes and tailored for participation in the 6th and 7th US ARPA-sponsored Message Understanding Conference system evaluations (MUC-6 and MUC-7)
  • MELITA - Melita is an ontology-based text annotation tool. It implements a methodology with the intent to manage the whole annotation process for the users.
  • Musing - Musing aims to develop knowledge extraction systems to support business intelligence applications.
  • MyGRID - MyGRID is an E-Science project aiming to produce a virtual laboratory workbench for biological researchers, one component of which is information extraction from the biological research literature
  • PASTA - PASTA -- Protein Active Site Template Acquistion -- aims to extract information about protein active sites from the biomedical research literature to assist research scientists
  • RESuLT - relation extraction using semi-supervised learing and ontologies
  • SOCIS - SOCIS, also based on GATE, extracts scene-of-crime information. The project is a collaboration between the Universities of Sheffield and Surrey and four police departments.
  • TRESTLE - TRESTLE -- Text Retrieval, Extraction and Summarization Technologies for Large Enterprises -- aims to find, extract summarize information of relevance to large enterprises in competitive environments, such as the pharmaceutical industry.


Question Answering

Open domain question answering (QA) systems aim to support a user who wishes to ask a specific question in natural language and receive a specific answer to that question, where the answer is to be sought in a (potentially huge) collection of natural language texts. QA has become a important application area of natural language processing technologies in the past few years, stimulated by the TREC QA track and more recently the Text Analysis Conferences (TAC).

Rapid advances have been made in developing systems that can answer specific questions, but increasingly it is becoming clear that the questions of most information seekers are not simply of the pub quizz variety (e.g. When was the telephone invented? ), but rather questions where the asker seeks a brief summary or synopsis of facts relating to the question. For example, a question such as Who was Tertullian? cannot be answered in one or two words, but requires a number of related facts, giving for example, Tertullian's nationality, birth date and place, his major achievements, etc. This information may be distributed across multiple documents, many of which will repeat each other in various ways. Thus, QA quite naturally relates to both to information extraction and to multi-document summarization.

Our contribution

The NLP group has developed several QA systems to investigate how useful varying amounts of linguistic knowledge are in QA. We have been regular participants in the TREC QA evaluations and have made contributions to the literature on evaluation of QA systems. We have had a specific interest in investigating the role that information retrieval systems have as a first but critical stage in most QA systems -- retrieving a small set of candidate answer-bearing documents to be intensively analyzed by a second stage answer extraction component -- and have run several workshops on this topic.

People

Rob Gaizauskas, Mark Greenwood, Mark Hepple, Yorick Wilks

Projects

  • Cub Reporter - Cub Reporter aims to support a journalist preparing background on breaking news stories through use of automatic question answering and summarization techniques.

Summarization

Summarization systems aim to take one or more documents and produce from them a reduced document which contains essential information from the source documents. Of course what constitutes essential information is relative to the goals of the user who wishes to use the summary as a surrogate for the original(s), and varying types of summaries may be produced to meet the needs to various users whose information needs may be represented in various ways.

Summarization is a natural counterpoint to QA, as both are technologies which aim to support information seekers in finding relevant information as efficiently and effectively as possible.

Our contribution

The NLP group has explored various approaches to single and multidocument summarization, including abstractive as well as extractive approaches. The group has participated in the Document Understanding Conference (DUC) and Text Analysis Conference (TAC) summarization system evaluations. The group distributes the SUMMA toolkit, a robust and customisable toolkit for experimenting with various approaches to single and multidocument summarization. We have contributed to the literature on the evaluation of summarization systems and to the construction of resources for evaluation. We have explored techniques for topic-focussed summarization and pioneered work on multidocument summarization for image captioning.

People

Kalina Bontcheva, Rob Gaizauskas,

Projects

  • Cub Reporter - Cub Reporter aims to support a journalist preparing background on breaking news stories through use of automatic question answering and summarization techniques.
  • TRESTLE - TRESTLE -- Text Retrieval, Extraction and Summarization Technologies for Large Enterprises -- aims to find, extract summarize information of relevance to large enterprises in competitive environments, such as the pharmaceutical industry.
  • TRIPOD - TRIPOD's overall objective is to create captions automatically for images with associated positional (i.e. GPS) metadata. One element of this is to develop multidocument summarization techniques to generate captions from web pages deemed relevant to an image.