Living off the land: Solving ML problems without training a single model

Introduction

The concept of living off the land is related to surviving on what you can forage, hunt, or grow in nature.

Considering the current Machine Learning landscape, we can draw a parallelism between living off the land and "shopping around" for ready-made models for a given task. While this has been partially true for some time thanks to model repositories such as HuggingFace, it still required some degree of involvement by applying finetuning or retraining for most advanced use cases.

However, the appearance of large language models (LLMs) with instruction-following capabilities beyond next-word prediction has opened the doors to many applications that require little supervision, and in some cases, true 100% no-code solutions.

In this post I will be describing a recent "living off the land" approach in order to solve an NLP competitive ML challenge: WASSA 2023, An ACL shared Task on Empathy Emotion and Personality Detection in Interactions

The task

Emotion is a concept that is challenging to describe. Yet, as human beings, we understand the emotional effect situations have or could have on us and other people. How can we transfer this knowledge to machines? Is it possible to learn the link between situations and the emotions they trigger in an automatic way? What about empathy and how to link it with the emotions and the personality?

We know that some LLMs have good built-in knowledge of many human-made concepts such as sentiment or polarity among others. Therefore it would be reasonable to attempt solving an emotion classification tasks in a zero-shot fashion. On the other hand, NLP research on emotion detection is a well-worn path and I would expect pre-trained resources to perform reasonably well in this area.

Approach

In order to validate the above hypothesis I decided to solve this emotion detection tasks for a given text without training any model. With that self-imposed limitation in mind, the first step was investigating potential approaches:

Next word prediction: If we transform emotion classification into a next word prediction task we could rewrite the original sentence using prompt templates such as:

    'This article produces ' + mask + '. {}',
    'The following text causes ' + mask + '. {}',
    'The emotion in this article is ' + mask + ': {}'

where the mask is an special model-dependent token.

Prompt engineering: We can just "ask" a text completion model for the solution e.g.

prompt = "Classify the emotion of a text between the following options: sadness,

neutral, anger, disgust, fear, hope, joy and surprise.\n\nText: \"" + text + "\"\nEmotion:

Pre-trained emotion models: Finally, since this is a relatively common NLP task, we can just leverage whatever text classification models are available in the wild.

Evaluation

Using 4 fold cross validation:

Model	Approach	F1 (macro)
gpt-3.5-turbo	Prompt engineering	0.2893
AdapterHub/roberta-base-pf-emotion	Pretrained emotion model	0.2842
text-davinci-003	Prompt engineering	0.2828
j-hartmann/emotion-english-roberta-large	Pretrained emotion model	0.2674
flan-t5-base	Prompt engineering	0.2396
bert-base-uncased	Next word prediction	0.2164
alpaca-lora-7b	Prompt engineering	0.21
j-hartmann/emotion-english-distilroberta-base	Pretrained emotion model	0.1912
opt-iml-max-1.3b	Prompt engineering	0.1837
HuggingChat	Prompt engineering	0.1634
roberta-base	Next word prediction	0.1628
text-davinci-002	Prompt engineering	0.1426
bart-large-mnli	Next word prediction	0.1412

Results

Below are the results during the training phase of the competition. The final submission was based on a weighted average of the output of the above models.

#	User	Entries	Date of Last Entry	Team Name	Macro F1-Score	Micro F1-Score	Micro Jaccard	Macro Precision	Macro Recall	Micro Precision	Micro Recall
#	User	Entries	Date of Last Entry	Team Name	Macro F1-Score	Micro F1-Score	Micro Jaccard	Macro Precision	Macro Recall	Micro Precision	Micro Recall
1	adityapatkar	19	04/17/23		0.579 (1)	0.736 (1)	0.583 (1)	0.571 (3)	0.625 (1)	0.729 (2)	0.744 (1)
2	gauravk	8	04/26/23	Team Converge	0.544 (2)	0.703 (2)	0.542 (2)	0.604 (2)	0.537 (2)	0.690 (5)	0.715 (2)
3	amsqr	3	04/30/23	Alejandro Mosquera	0.527 (3)	0.670 (4)	0.503 (4)	0.622 (1)	0.500 (5)	0.720 (4)	0.626 (6)
4	Cordyceps	18	05/05/23		0.504 (4)	0.647 (6)	0.478 (6)	0.567 (4)	0.501 (4)	0.628 (7)	0.667 (3)
5	anedilko	1	04/30/23	Bias Busters	0.462 (5)	0.572 (8)	0.400 (8)	0.502 (7)	0.523 (3)	0.542 (8)	0.606 (8)
6	surajtc	2	04/22/23		0.425 (6)	0.661 (5)	0.493 (5)	0.525 (6)	0.370 (8)	0.721 (3)	0.610 (7)
7	lazyboy.blk	12	04/22/23	Team Name	0.402 (7)	0.696 (3)	0.534 (3)	0.556 (5)	0.380 (7)	0.760 (1)	0.642 (5)
8	warrior1127	5	05/02/23	SAIL	0.388 (8)	0.548 (9)	0.377 (9)	0.373 (9)	0.483 (6)	0.473 (10)	0.650 (4)
9	hammadfahim	4	04/19/23		0.248 (9)	0.476 (10)	0.312 (10)	0.399 (8)	0.292 (9)	0.519 (9)	0.439 (10)
10	kunwarv4	5	05/01/23	VISU_UNiCA	0.180 (10)	0.595 (7)	0.423 (7)	0.158 (10)	0.213 (10)	0.649 (6)	0.549 (9)

The dev results were pretty good in comparison with other (likely supervised) models. In the test phase the final submission scored even higher F1 (0.533) but ranked lower against other teams.

#	User	Entries	Date of Last Entry	Team Name	Macro F1-Score	Micro F1-Score	Micro Jaccard	Macro Precision	Macro Recall	Micro Precision	Micro Recall
#	User	Entries	Date of Last Entry	Team Name	Macro F1-Score	Micro F1-Score	Micro Jaccard	Macro Precision	Macro Recall	Micro Precision	Micro Recall
1	adityapatkar	4	05/10/23		0.701 (1)	0.750 (1)	0.600 (1)	0.810 (1)	0.677 (2)	0.778 (1)	0.724 (3)
2	anedilko	2	05/10/23	Bias Busters	0.647 (2)	0.700 (6)	0.538 (6)	0.630 (6)	0.730 (1)	0.626 (8)	0.793 (1)
3	luxinxyz	4	05/09/23	tRNA	0.644 (3)	0.720 (2)	0.562 (2)	0.721 (4)	0.631 (4)	0.743 (3)	0.698 (4)
4	gauravk	5	05/13/23	Team Converge	0.628 (4)	0.707 (4)	0.547 (4)	0.700 (5)	0.622 (5)	0.717 (6)	0.698 (4)
5	lazyboy.blk	2	05/10/23	Team Name	0.612 (5)	0.713 (3)	0.554 (3)	0.776 (2)	0.600 (6)	0.770 (2)	0.664 (6)
6	amsqr	4	05/09/23	Alejandro Mosquera	0.533 (6)	0.673 (7)	0.507 (7)	0.752 (3)	0.479 (8)	0.723 (5)	0.629 (7)
7	surajtc	11	05/10/23		0.522 (7)	0.622 (8)	0.451 (8)	0.463 (8)	0.668 (3)	0.527 (10)	0.759 (2)
8	alili_wyk	4	05/11/23	YNU-HPCC	0.514 (8)	0.703 (5)	0.542 (5)	0.575 (7)	0.502 (7)	0.736 (4)	0.672 (5)
9	mimmu3302	37	05/11/23		0.332 (9)	0.546 (10)	0.376 (10)	0.394 (9)	0.322 (9)	0.590 (9)	0.509 (9)
10	kunwarv4	2	05/11/23	VISU_UNiCA	0.284 (10)	0.593 (9)	0.421 (9)	0.282 (11)	0.318 (10)	0.640 (7)	0.552 (8)
11	Sidpan	23	05/12/23	SidShank	0.263 (11)	0.400 (11)	0.250 (11)	0.299 (10)	0.249 (11)	0.404 (11)	0.397 (10)

Conclusion

Overall, the results above show that is possible to generate strong baselines for some NLP tasks with very little code and without training any model. Such approach ranked 3rd during the dev phase and proved superior to 50% of the competing teams in the final results.

Alejandro Mosquera | Blog

Saturday, May 13, 2023