Ari Klein
Toward Using Twitter Data for Tracking COVID-19: A Natural Language Processing Pipeline
Presenter
Ari Z. Klein, PhD is a Research Associate in the Health Language Processing Center, in the Division of Informatics.
Abstract
In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results have presented challenges for actively monitoring its spread based on testing alone. We developed an automatic natural language processing pipeline for using Twitter data to identify potential cases of COVID-19 that are not based on testing and, thus, may not have been reported to the CDC. Beginning January 23, 2020, we collected English tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing those that self-report potential cases from those that do not. A supervised deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases. We deployed the pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020, identifying 13,714 tweets that self-report potential cases and have United States state-level geolocations. This publicly available data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
Keywords
natural language processing; social media; data mining; COVID-19; coronavirus; pandemics; epidemiologyCommenting is now closed.
About Us
To understand health and disease today, we need new thinking and novel science —the kind we create when multiple disciplines work together from the ground up. That is why this department has put forward a bold vision in population-health science: a single academic home for biostatistics, epidemiology and informatics.
© 2023 Trustees of the University of Pennsylvania. All rights reserved.. | Disclaimer