Foundational Topics

Developing applications with human-like capabilities for processing language requires progress in foundational topics in language processing. Areas of interest include: word sense disambiguation and semantics of time and events.

Word Sense Disambiguation

Word often have more than one meaning. For example, "bat" refer to either a nocturnal mamal, the blink of an eye or a piece of sports equipment. Knowing the meaning is often vital in order to understand a sentence. This process is normally carried out effortlessly by humans but turns out to be very difficult to automate. The process of determining the meanings of words in text, which is known as Word Sense Disambiguation (WSD), has a long history in NLP research and is regarded as an important stage in the process of text understanding.

Our contribution

The NLP group have developed a number of novel approaches to WSD. In particular, members of the group have developed techniques that combine mutiple knowledge sources to improve disambiguation accuracy. These approaches are currently being applied to a range of ambiguities in the biomedical domain. WSD systems developed within the group have also been show to improve results when applied to Machine Translation and Cross-language Information Retrieval systems.


Rob Gaizauskas, Wim Peters, Mark Stevenson, Yorick Wilks


  • Scaling-up WSD for the Life Sciences - aims to show that WSD in the life sciences can be carried out accurately and efficiently enough to be used within practical text mining systems.
  • CASTLE - automatic adaptation of lexicons to specific domains
  • BioWSD - aims to develop tools and algorithms to resolve lexical ambiguity in the biomedical domain MALT - developed techniques for mapping between lexicons ECRAN - integrated WSD into Information Extraction systems

Semantics of Time and Events

The capability to identify times, events and temporal relations in text is a fundamental requirement for text understanding and underlies most language processing applications. Question answering and information extraction, for example, seek to derive factual information from text corpora. However, most factual information needs temporal qualification to be meaningful. Aside from questions with an obvious temporal aspect, such as When was the first ascent of Everest?, answers to questions like Who is the CEO of Microsoft? or How many representatives does the UK have in the European Parliament? do not have single answers that are true for all time, but rather multiple answers each true at a separate time.

One recent effort to advance our understanding of the temporal semantics of language and our ability extract temporal information from text has been the international collaborative effor to develop TimeML. TimeML is a XML standard for annotating times, events and temporal relations in natural language texts. TimeML effectively embodies a theory of just what we mean by "events" and "times". Furthermore, corpora of data annotated according to the TimeML guidelines provide a data resource both for studying the range and distribution of temporal phenomena in text and for training adaptive algorithms to extract reference to times and events and to determine temporal relations between them.

Our contribution

The NLP group developed an initial set of temporal annotation guidelines that were taken as the starting point for TimeML and have contributed significantly to its on-going development. The group has also proposed approaches to evaluating temporally annotated texts against a gold standard and contributed to the organization and running of the first international temporal annotation evaluation exercice, TempEval.


Rob Gaizauskas, Mark Hepple