Alejandro Mosquera López is an online safety expert and Kaggle Grandmaster working in cybersecurity. His main research interests are Trustworthy AI and NLP. ORCID iD icon https://orcid.org/0000-0002-6020-3569

Thursday, February 16, 2023

Pretrained Models with Adversarial Training for Online Sexism Detection @ SemEval 2023

         Abstract 

Adversarial training can provide neural networks with significantly improved resistance to adversarial attacks, thus improving model robustness. However, a major drawback of many existing adversarial training workflows is the computational cost and extra processing time when using data augmentation techniques. This post explores the application of embedding perturbations via the fast gradient method (FGM) when finetuning large language models (LLMs) to short text classification tasks. This adversarial training approach has been evaluated as part of the first sub-task of SemEval 2023-Task 10, focused on explainable detection of sexism in social networks (EDOS). Empirical results show that adversarially finetuned models with FGM had on average a 25% longer training time and 0.2% higher F1 than their respective baselines. 


Introduction

Social media is ubiquitous in everyday communications and a place where users express freely
their views and thoughts. However, while most users make a fair use of social media platforms, the
presence of detrimental and undesirable content that can be deemed abusive towards other users
or communities has drawn lot of attention lately. The threat this poses to online safety has sparked
different content moderation strategies than the traditionally used for fighting spam or phishing, since
in this case they have to deal with legit users occasionally posting messages using toxic, sexist or
abusive language instead of financially motivated cybercrime.

The use of Natural Language Processing (NLP) to detect, and assess sexist content at scale is
widespread, however this is far from being considered a solved problem. The lack of finegrained 
classification and poor interpretability have been highlighted as current shortcomings for this
task. The Explainable Detection of Online Sexism (EDOS) workshop organized as part of SemEval
2023 aims to bridge the above-mentioned gaps in sexist language detection for the English language.

In addition to the identified challenges there are other nuances common to many other language-
based  moderation systems. Social media publications can be short and subject to interpretation
without enough context. Likewise, the presence of slang, spelling variations and purposely obfuscated
content can lower the confidence of NLP tools (Mosquera and Moreda, 2012) and evade detection
(Mosquera, 2022a). Finally, labeling large training sets can be expensive since it requires a
pool of annotators with expert knowledge (Vidgen and Derczynski, 2021). Therefore, training sets
with only a few thousands of positive examples and high class imbalance are not uncommon, which 
can affect model robustness.

Background

The EDOS challenge provided a training dataset (Kirk et al., 2023) of 14000 examples with
 finegrained  classifications for sexist posts written in English extracted from Gab and Reddit social
 networks. Out of these, there were only 3398 entries labeled as sexist, thus existing a substantial 
 class imbalance for the proposed sub-task A, which was binary classification of sexist content.

There is extensive previous work tackling the detection of sexist content in social media using
pre-trained models, either as an standalone task (Rodríguez-Sánchez et al., 2020; Abburi et al.,
2021) or as a subset of toxic language classification (Park and Fung, 2017; Pitsilis et al., 2018).
Some instances of sexist language can be subtle and difficult to identify in absence of proper context
(Swim et al., 2004), specially when working with short texts such as Tweets or isolated sentences
part of a larger conversation. Recent attempts to improve generalization and robustness in these 
cases usually involve ensembles (Davies et al., 2021), adversarial training (Samory et al., 2020) or 
data augmentation (Butt et al., 2021). However, these techniques are not foolproof, and in addition to
 the extra training cost they can introduce unintended bias (Sen et al., 2022).

Approach

For these reasons, in this post we approach the EDOS binary classification task by automatically
finetuning and ensembling sets of different large pre-trained models. The Fast Gradient Method
(FGM), an intuitive backpropagation based method to generate adversarial samples (Szegedy et al.,
2013), is used during training in order to improve model robustness.

The submission produced by the proposed system obtained a 85.6% F1 and ranked 15th out of
84 competing systems. The main takeaway is that adversarial training can be a viable strategy under
domain (text size) or cost (training time) constraints in comparison with other more resource-intensive
techniques such as text augmentation (Mosquera, 2022b).

Results


Final Ranking     Team NameMacro F1        Dev Ranking
1PingAnLifeInsurance0.87461
2stce0.87402
2FiRC-NLP0.87403
4PALI0.87174
5GZHU / UCAS-IIE0.86928
6Zhegu0.86749
7ITNLP20230.863013
8YD0.862914
9MilaNLP0.861617
10NLP-LTU0.861319
11niceNLP0.860924
12CIC-SDS.KN0.860726
13cl-uzh0.858630
14Aston NLP0.858331
15Alejandro Mosquera0.856234
16DCU0.855935
17Chride0.854838
18SUTNLP0.854540
19JUST_ONE0.853841
19UIRISC0.853842
19Just_Dream0.853843
22ABC0.853744
23MarSan0.851846
24UMUTeam0.849547
25A2Z0.847950
26AKD0.846151
27AutoHome_Xu0.845253
28PCJ0.844954
29LCT-10.844656
30alexa0.844457
31tmn0.843458
32DH-FBK0.840261
33I2C-HUELVA_G20.839662
34SSS0.839263
35Attention0.839064
36AdamR0.838365
37Team Daunting0.838066
38DAY0.837367
39KingsmanTrio0.836671
40Efrat Luzzon0.836272
41DUTIR0.835273
42UniBoe's0.834674
43HULAT0.829875
44Group2-RUG0.829576
45SKAM0.825077
46SINAI0.824579
47HHS0.823980
48msharma950.823081
49HausaNLP0.822882
50UL & UM6P0.822584
51CSECU-DSG0.821885
52HITSZ-Q0.819787
53MaChAmp0.819288
54RRGI0.818689
55hhuEDOS0.818590
56shefnlp0.818191
57tsingriver0.818093
58JUAGE0.816694
59Lexicools0.812496
60PA6660.811997
61Brainstormers_msec0.8073100
62ACSMKRHR0.8009102
63TEDOS0.8001103
64Stanford MLab0.7975104
65AU_NLP0.7943105
66PoSh0.7937106
67coco0.7895108
68PadmaDS0.7826109
69iREL0.7815110
70shm20230.7758111
71IUST_NLP0.7750112
72INFOTEC0.7584115
73PanwarJayant0.7583116
74LT30.7542117
75USMBA_NLP0.7540118
76CNLP-NITS0.7478119
77YPMS0.7350120
78we_have_no_idea0.7304121
79danch220.7184122
80NLP-CogSci0.6325127
81OPEN SESAME0.6044128
82Judith Jeyafreeda Andrew0.5191129
83UTB-NLP0.5185130
84NLP_CHRISTINE0.5029131

References

Harika Abburi, Shradha Sehgal, Himanshu Maheshwari, and Vasudeva Varma. 2021. Knowledge-based neural framework for sexism detection and classification. In IberLEF@ SEPLN, pages 402–414.

Sabur Butt, Noman Ashraf, Grigori Sidorov, and Alexander Gelbukh. 2021. Sexism identification using bert and data augmentation - exist2021. CEUR Workshop Proceedings, 2943:381–389. Publisher Copyright: © 2021 CEUR-WS. All rights reserved.; Conference date: 21-09-2021.

Lily Davies, Marta Baldracchi, Carlo Alessandro Borella, and Konstantinos Perifanos. 2021. Transformer ensembles for sexism detection.

Hannah Rose Kirk, Wenjie Yin, Bertie Vidgen, and Paul Röttger. 2023. SemEval-2023 Task 10: Explainable Detection of Online Sexism. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023). Association for Computational Linguistics.

Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. 2016. Adversarial training methods for semisupervised text classification.

Alejandro Mosquera. 2022a. Alejandro mosquera at politices 2022: Towards robust spanish author profiling and lessons learned from adversarial attacks. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2022) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2022), A Coruña, Spain, September 20, 2022, volume 3202 of CEUR Workshop Proceedings. CEUR-WS.org.

Alejandro Mosquera. 2022b. Amsqr at SemEval-2022 task 4: Towards AutoNLP via meta-learning and adversarial data augmentation for PCL detection. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 485–489, Seattle, United States. Association for Computational Linguistics.

Alejandro Mosquera and Paloma Moreda. 2012. Tenor: A lexical normalisation tool for spanish web 2.0 texts. In Text, Speech and Dialogue, pages 535–542, Berlin, Heidelberg. Springer Berlin Heidelberg.

Ji Ho Park and Pascale Fung. 2017. One-step and twostep classification for abusive language detection on twitter.

Georgios K. Pitsilis, Heri Ramampiaro, and Helge Langseth. 2018. Effective hate-speech detection in twitter data using recurrent neural networks. Applied Intelligence, 48(12):4730–4742.

Francisco Rodríguez-Sánchez, Jorge Carrillo-de Albornoz, and Laura Plaza. 2020. Automatic classification of sexism in social networks: An empirical study on twitter data. IEEE Access, 8:219563–219576.

Mattia Samory, Indira Sen, Julian Kohne, Fabian Floeck, and Claudia Wagner. 2020. "call me sexist, but...": Revisiting sexism detection using psychological scales and adversarial samples.

Indira Sen, Mattia Samory, Claudia Wagner, and Isabelle Augenstein. 2022. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection.

Janet K Swim, Robyn Mallett, and Charles Stangor. 2004. Understanding subtle sexism: Detection anduse of sexist language. Sex roles, 51:117–128.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks.

Bertie Vidgen and Leon Derczynski. 2021. Directions in abusive language training data, a systematic review: Garbage in, garbage out. PLOS ONE, 15(12):1–32.

No comments:

Post a Comment