Decay No More: A Persistent Twitter Dataset for Learning Social Meaning



 

Congratulations to PhD candidate Chiyu Zhang, Assistant Professor and Canada Research Chair Dr. Muhammad Abdul-Mageed and Postdoctoral fellow ElMoatez Billah Nagoudi on receiving the Best Paper Award for their work “Decay No More: A Persistent Twitter Dataset for Learning Social Meaning”. The paper was published in the 1st Workshop on Novel Evaluation Approaches for Text Classification Systems on Social Media (NEATCLasS), at the International AAAI Conference on Web and Social Media (ICWSM 2022).

This work tackles a serious issue for natural language processing and machine learning research making use of social media data. It alleviates the challenge of data becoming inaccessible over time, which they call “data decay”, and hence making it hard to replicate past research. The work also indirectly ameliorates issues of privacy in social media work by allowing models to be trained on paraphrases of real-world data rather than the original data itself. The team leverages a state-of-the-art deep learning model to paraphrase social media data from seventeen different social meaning tasks, proposing a new and persistent dataset. Their experimental results demonstrate the promise of learning social meaning by exploiting synthetic data only.

 

 

Abstract

With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of 17 social meaning datasets in 10 categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.

[ Read the full paper ]