Addressing linguistic challenges in information extraction on social media data

Social media buttons on a phone.

Event details

Wednesday 20 January 2021
Online event
Access link will open 30 minutes prior to the seminar
Join the event


Speaker Bio:

Thamar Solorio is an Associate Professor of the Department of Computer Science at the University of Houston (UH). She holds graduate degrees in Computer Science from the Instituto Nacional de Astrofísica, Óptica y Electrónica, in Puebla, Mexico. Her research interests include information extraction from social media data, enabling technology for code-switched data, stylistic modeling of text and more recently multimodal approaches to online content understanding. She is the director and founder of the Research in Text Understanding and Language Analysis Lab at UH. She is the recipient of an NSF CAREER award for her work on authorship attribution, and recipient of the 2014 Emerging Leader ABIE Award in Honor of Denice Denton. She is an elected board member of the North American Chapter of the Association of Computational Linguistics (2020-2021). Her research is currently funded by the National Science Foundation and ADOBE, and in the past she has received support from the Office of Naval Research and the Defense Advanced Research Projects Agency (DARPA).


Social media data poses several interesting challenges to information extraction technology. In my group, we have been working on studying how and why we observe lower performance of sequence labelling methods on social media data, compared to performance of the same models on more edited text, such as newswire data. These studies have informed our design choices for models that are more robust to naturalistic data, even data that includes language switching. My goal is to contribute to increasing the coverage of language abilities by NLP technology.

During this talk, I'll briefly discuss the different proposals we have developed that include enhanced versions of ELMo embeddings, and a more flexible subword tokenization approach than what is available in the commonly used byte-pair encoding of language models. I’ll conclude with a discussion of possible research lines for the near future.

Events at the University

Browse upcoming public lectures, exhibitions, family events, concerts, shows and festivals across the University.