Detection of Reuse and Anomaly

Authors or owners of text copyright have an interest in knowing when their texts have been reused by others, whether with or without permission. Educators have an interest in knowing when text has been plagiarised. Scholars or intelligence gatherers may be interested in knowing where anomalous events occur in text, as these may signal changes in authorship or the insertion of "hidden" content.

These application scenarios motivate research into statistical techniques for automatically detecting similarity and difference between and within texts. Similarity or difference may be in topic, genre, style or authorship -- an interesting research question is the extent to which common techniques may be developed for detecting similarity or difference along all these dimensions.

Our contribution   The NLP group has pioneered work in studying text reuse in newswires, developing corpus analysis concepts and methods, annotated corpus resources and automated methods for detecting and measuring reuse. The group has also developed novel statistical models and techniques for identifying anomalous segments within larger texts.


Rob Gaizauskas, Mark Stevenson, Dr. Lucia Specia Yorick Wilks


  • CLUE - Contextual Learning for detecting Unexpected Events
  • CLUE II - Contextual Learning for detecting Unexpected Events
  • PEEC: Partitioning the Enron Email Corpus
  • PEEC II: Partitioning the Enron Email Corpus
  • REVEAL - The identification of anomalous segments in text on a large scale.
  • METER - METER -- Measuring TExt Reuse -- aims to investigate the issue of automatically detecting and measuring text reuse, focusing on the domain of journalism.