Language Resources and Architectures

To use computational methods in studying language or to develop prototype language processing application systems, both data and processing resources are necessary. Data resources -- corpora, both annotated and unannotated -- are necessary for analysis and for training and testing components and systems. Reusable processing resources -- such as tokenizers, part-of-speech taggers, parsers -- enable new research and development to build on earlier efforts and free researchers from re-implementing components for each new project.

Enabling multiple data and processing resources to be accessible and to interoperate within a single environment is a challenging task and requires a language processing platform or architecture.

Our contribution

The NLP group has developed perhaps the best known and most widely used architecture for language engineering -- the General Architecture for Text Engineering. GATE is a powerful open source, Java-based platform for language engineering with capabilities for processing a wide range of document formats (XML, HTML, PDF, Word, email, plain text, etc.), building modular systems from reusable components and storing, evaluating and visualising results. It can be used as a research platform or as an integrated development environment for building complex language processing systems, which can then be embedded in larger end user applications. GATE is delivered with a set of information extraction tools developed at Sheffield and in addition a wide of contributed or third party modules have been integrated within it.


Kalina Bontcheva, Hamish Cunningham, Rob Gaizauskas, Diana Maynard, Wim Peters, Yorick Wilks


Current and Recent

  • CLARIN - The CLARIN project is a large-scale pan-European collaborative effort to create, coordinate and make language resources and technology available and readily usable.
  • FLaReNet The European FLaReNet -- Fostering Language Resources Network is intended to develop a common vision of the area of Language Resources and Language Technologies for the next years and foster a European strategy for consolidating the sector and enhancing competitiveness at EU level and worldwide


  • GATE 2 - The GATE 2 project aimed to redesign and generalize version 1 of GATE in order to deliver a powerful infrastructure for language research and development.