Unlocking linguistic DNA

A collaborative venture between digital humanities experts and linguists is developing sophisticated algorithms to discover previously invisible patterns and relationships between concepts and ideas in more than 37 million pages of printed material.

The ultimate aim is nothing less than to unlock the linguistic DNA of early modern western thought. But the route to that goal traverses new and difficult ground. Crossing it involves the development and testing of processes and tools that will allow them to make sense of the vast, untapped resource contained in billions of words of digitised printed material.


“What we are doing is groundbreaking,” says Professor Susan Fitzmaurice from the School of English of this Arts and Humanities Research Council (AHRC) funded project with research colleagues at the universities of Glasgow and Sussex, “We are turning on its head the way research in this field has previously been conducted. In the past, researchers tended to adopt a top down approach, selecting material that they thought was important. This approach is bottom up. We are asking the data to speak to us. The tools we are developing are utterly agnostic, so the patterns and associations they reveal will emerge from the data itself.”

The result will be a set of automated processes that will not only mine millions and millions of pages of early modern printed text but also reveal the associations between words and concepts. 

Professor susan fitzmaurice

An authority on the evolution of the English language, Professor Fitzmaurice is working closely with data specialists in HRI Digital who are using high-performance computing and data visualisation to identify lexical and semantic patterns in texts from the early modern period.

Professor Susan Fitzmaurice commented, “This is a remarkable collaboration between humanities specialists and digital experts at Humanities Research Institute (HRI) Digital, so we have software developers who are responding to and acting upon humanities questions. This marks a significant advance from the previous approach of creating digital libraries to developing digital methods for investigating major research questions in the humanities. The result will be a set of automated processes that will not only mine millions and millions of pages of early modern printed text but also reveal the associations between words and concepts. The data will be in the form of giant matrices which will allow us to discern patterns that will enable us to see for the first time what it was that these early modern writers considered to be of interest.”

HRI Digital Director, Michael Pidd adds, “We are working with Susan and her collaborators at two other universities. Our ambition is to help them create a research model that is able to explore the history, linguistic features and characteristics of word formation and vocabulary in the evolution of modern western thinking.”

He added that his team is using information extraction techniques to identify lexical patterns within approximately 37 million pages. “The total dataset comprises over 250,000 texts,” he said. “We also want to be able to demonstrate the wider applicability of the information extraction and concept modelling techniques by developing a demonstrator for a modern body of scholarship such as JSTOR, a digital library of academic journals, books, and primary sources.”

“It’s a major undertaking involving complex data,” said Professor Fitzmaurice. “The possibility of mapping the linguistic and conceptual changes that quite possibly started modernity has long been a Holy Grail of the arts and humanities. Our research will now make it possible to discern trends, relationships, and anomalies across an enormous amount of linguistic data to identify the often surprising complexities, continuities, and discontinuities of conceptual change.”