Alejandro Mosquera López is an online safety expert and Kaggle Grandmaster working in cybersecurity. His main research interests are Trustworthy AI and NLP. ORCID iD icon

Saturday, February 17, 2024

Detecting LLM hallucinations and overgeneration mistakes @ SemEval 2024

  The modern NLG landscape is plagued by two interlinked problems: On the one hand, our current neural models have a propensity to produce inaccurate but fluent outputs; on the other hand, our metrics are most apt at describing fluency, rather than correctness. This leads neural networks to “hallucinate”, e.g., produce fluent but incorrect outputs that we currently struggle to detect automatically. For many NLG applications, the correctness of an output is however mission-critical. For instance, producing a plausible-sounding translation that is inconsistent with the source text puts in jeopardy the usefulness of a machine translation pipeline. For this reason, SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes aims to foster the growing interest in this topic in the community.

In this competition participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations in two different setups: model-aware and model-agnostic tracks. In order to do this, they had to detect grammatically sound outputs which contain incorrect or unsupported semantic information, inconsistent with the source input, with or without having access to the model that produced the output.

The evaluated approach using a simple linear combination of reference models ranked 3rd in the model-agnostic track with a 0.826 accuracy.

Monday, June 5, 2023

Hackaprompt-2023 @ AICrowd write up

HackAPrompt was a prompt hacking competition aimed at enhancing AI safety and education by challenging participants to outsmart large language models (e.g. ChatGPT, GPT-3). In particular, encouraged participants to attempt to hack through many prompt hacking defenses as possible.

The task organizers provided a set of 10 challenges in which the participants should bypass the expected model behavior and force the LLM to reply "I have been PWNED" instead. In order to do that, users could make use of 3 different LLMs: GPT-3 (text-davinci-003), ChatGPT (gpt-3.5-turbo), or FlanT5 -XXL.

Saturday, May 13, 2023

Living off the land: Solving ML problems without training a single model


The concept of living off the land is related to surviving on what you can forage, hunt, or grow in nature.

Considering the current Machine Learning landscape, we can draw a parallelism between living off the land and "shopping around" for ready-made models for a given task.  While this has been partially true for some time thanks to model repositories such as HuggingFace, it still required some degree of involvement by applying finetuning or retraining for most advanced use cases. 

However, the appearance of large language models (LLMs) with instruction-following capabilities beyond next-word prediction has opened the doors to many applications that require little supervision, and in some cases, true 100% no-code solutions.

In this post I will be describing a recent "living off the land" approach in order to solve an NLP competitive ML challenge: WASSA 2023, An ACL shared Task on Empathy Emotion and Personality Detection in Interactions

Thursday, February 16, 2023

Pretrained Models with Adversarial Training for Online Sexism Detection @ SemEval 2023


Adversarial training can provide neural networks with significantly improved resistance to adversarial attacks, thus improving model robustness. However, a major drawback of many existing adversarial training workflows is the computational cost and extra processing time when using data augmentation techniques. This post explores the application of embedding perturbations via the fast gradient method (FGM) when finetuning large language models (LLMs) to short text classification tasks. This adversarial training approach has been evaluated as part of the first sub-task of SemEval 2023-Task 10, focused on explainable detection of sexism in social networks (EDOS). Empirical results show that adversarially finetuned models with FGM had on average a 25% longer training time and 0.2% higher F1 than their respective baselines. 

Tuesday, January 24, 2023

The string similarity problem

For two strings A and B (in the ASCII [a-z] range), we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

The reader is asked to calculate the sum of similarities of a string S with each of its suffixes. Reference (

Wednesday, December 7, 2022

Revisiting the Microsoft Malware Classification Challenge (BIG 2015) in 2022

 In 2015, Microsoft provided the data science community with an unprecedented malware dataset and encouraging open-source progress on effective techniques for grouping variants of malware files into their respective families. Formatted as a Kaggle Competition, it featured a very large (for that time) dataset comprising of almost 40GB of compressed files containing disarmed malware samples and their corresponding disassembled ASM code.

Tuesday, December 6, 2022

On the Intriguing Properties of Backdoored Neural Networks


Malicious actors can alter the expected behavior of a neural network in order to respond to data containing certain triggers only known to the attacker, without disrupting model performance when presented with normal inputs. An adversary will commonly force these misclassifications by either performing trigger injection [19] or dataset poisoning [6]. Less popular techniques that operate at hardware level such as manipulating the binary code of a neural network or the tainting the physical circuitry [26, 8] can be equally effective.

Tuesday, September 27, 2022

AI Village Capture the Flag @ DEFCON write up

 In August 2022 I had the chance to participate in an AI-themed CTF collocated with the DEF CON 30 security (hacking) conference. This was particularly interesting since it was presented in a novel format as a Kaggle competition where the leaderboard was ranked based on the points that each of the discovered flags was providing. Despite entering the competition in its latest stage I did manage to solve all the challenges but two, therefore achieving the second best score (although my final ranking was lower due to submission times being used as tie-breakers). No one was able to find the last flag corresponding to the Crop-2 challenge until after the CTF ended.

Thursday, March 17, 2022

Defending and attacking ML Malware Classifiers for Fun and Profit: 2x prize winner at MLSEC-2021

MLSEC (Machine Learning Security Evasion Competition) is an initiative sponsored by Microsoft and partners CUJO AI, NVIDIA, VMRay, and MRG Effitas with the purpose of raising awareness of the expanding attack surface which is now also affecting AI-powered systems. 

In its 3rd edition the competition allowed defenders and attackers to exercise their security and machine learning skills under a plausible threat model: evading antimalware and anti-phishing filters. In the competition, defenders aimed to detect evasive submissions by using machine learning (ML), and attackers attempted to circumvent those detections.

Towards Machines that Capture and Reason with Science Knowledge

 In 2015 I took part on a machine learning competition hosted on Kaggle aiming to solve a multiple-question 8th grade science test. At that time there weren't large pretrained models to leverage and (unsurprisingly) best performing models were IR-based that would barely achieve a GPA of 1.0 in the US grading system: