Zero-Shot Topic Classification

Jed Lee
7 min readSep 8, 2022

--

Using Language Transformers for Natural Language Processing (NLP)

Using Hugging Face’s Transformer as the core of ZSTC

Content of this Article

  1. Introduction
  2. The dilemma of Text Classification
  3. What is Zero-Shot Learning in the first place?
  4. Traditional Supervised Learning VS Zero-Shot Learning
  5. Let us get our hands dirty with code!
  6. Introducing Hugging Face + Demo
  7. Conclusion & Limitations

Introduction

Natural Language Processing (NLP) is a subfield of machine learning that essentially allows computers to understand, analyze, and interpret human language.

Language transformers like BERT have really pushed the boundaries of what is possible in NLP. They have been used to solve problems such as topic classification, text summarization and question answering. They have also spawned models including RoBERTA (larger), DistilBERT (smaller) and XLM (multi-language).

Dilemma

Text classification is a task of Natural Language Processing (NLP) where the model needs to predict the classes of the text documents. One of the biggest dilemmas we encounter is getting labelled data. Almost all existing text classification models require a large amount of labelled data. To avoid data labelling, we can utilise zero-shot learning that aims to perform modelling using significantly less amount or 0 labelled data. When this learning comes to text classification, we call the whole process Zero-Shot Topic Classification.

What is Zero-Shot Learning?

How does zero-shot learning work without a large amount of labelled data?

The recognition system of models based on Zero-Shot Learning relies on the availability of any labelled data, and knowledge gain from unseen classes that are semantically related to the seen classes. Such modelling procedures are capable of making a model to learn about the unseen classes that are not labelled at the time of training. We can say that this type of learning predicts the new classes by learning the intermediate semantic layers and their attributes.

Sounds pretty confusing? Let me give you an example!

Horse Zebra example to explain Zero-Shot Learning

Assuming a child was told to recognise a zebra in a zoo. The child has not seen a zebra before in his life but he did saw a horse before. By telling the child that a zebra is very similar to a horse but with black and white stripes, the child is able to recognise a zebra pretty easily!

Contextualising Zero-Shot Learning, it is essentially learning from one set of known labels and then evaluating a different set of labels that the classifier has never seen before.

If you are still unclear about Zero-Shot Learning, you can find out more in this article.

Comparing Traditional Supervised Learning and Zero-Shot Learning.

For Traditional Supervised Learning, we get some data we want to recognise and these data live in a feature space. Annotations for the data will be provided and that gives us some labels in that feature space. In the screenshots below, Monkey, Cat, and Dog are the labels. From that, the Supervised Classifier is able to divide up the feature space according to the labels as shown on the right.

Image by Dr Timothy Hospedales’s Youtube Video as linked below.

However, when you introduce new labels that the model has not seen before, they will still live in the same feature space but the classifier will not know what to do with it as it has not been trained on the new labels.

In Zero-Shot Learning, the key idea is to embed categories as vectors. We moved away from the initial function of having output labels that are discrete to a vector space of labels. Each output label will now get a multi-dimensional vector in the vector space. At test time, this approach allows one to embed any label (seen or unseen) into the same vector space and measure its distance. Working with a vector space of labels allows us to generalise new labels which is the crux of using the Zero-Shot Classifier.

Let us get our hands dirty with code!

I will be using a dataset from Kaggle on Netflix Shows! More specifically, the description data in the dataset.

Hugging Face 🤗

Before diving into the code, I have to introduce Hugging Face 🤗. Hugging Face is a community and data science platform that provides tools that enable users to build, train and deploy ML models based on open source code and technologies.

We will be utilising 🤗 Hugging Face’s Transformers to perform Zero-Shot Classification.

Here is a live demo from the Hugging Face team, along with a sample Colab notebook.

Demo Link: https://huggingface.co/zero-shot/

Instantiation

I will be using Hugging Face’s Pipeline class to create our classifier. This class requires two inputs: task and model. A list of potential tasks can be found here.

For our purposes, we will be using the task “zero-shot-classification”. Next, the model parameter specifies which zero-shot model we wish to use. A list of potential models can be found here. We will be using a model called “facebook/bart-large-mnli” which is, as of right now, the most downloaded model.

You can find out way more in-depth about the “zero-shot-classification” classifier task here. They have a blog post for a more expansive introduction to this and other zero-shot methods which I strongly recommend giving a read.

Classifications

The classification task uses three main parameters which are:

  • sequences_to_classify corresponding to the text or sequence you want to predict.
  • candidate_labels is a list of all the candidate labels we want the prediction for. You can have 1 or multiple labels. Keep in mind that those labels do not need to be previously known.
  • multi_label is a boolean value. Let multi_label = True if we want to perform multi-class classification, in which case all the prediction probabilities will be independent, meaning that each value is between 0 and 1, and the sum is not necessarily 1. If multi_label = False, the sum of the probability score is 1. This is a very powerful feature in my opinion as you are able to get multiple class predictions that are independent of each other. What this means is that if you have 2 similar class predictions, say Horror and Fear, the model will give a prediction probability for both classes individually.

Implementation

Here is the output:

{‘sequence’: ‘Based on a true story, this action film follows an incident that stunned a nation in the early 1990s. In Mumbai, India, the notorious gangster Maya holds off veteran cop Khan and a force of more than 200 policemen in a six-hour bloody gunfight.’,
‘labels’: [‘Actions’, ‘Violence’, ‘Crime’, ‘Adventure’, ‘Finance’, ‘Food’],
‘scores’: [0.8735317587852478,
0.8443195819854736,
0.4216251075267792,
0.21568214893341064,
0.004167867824435234,
0.0024876173119992018]}

The output is a dictionary with three main keys:

  • sequence: the original sequences/text used for the prediction
  • labels: the list of all the candidate labels used for prediction.
  • scores: the list of the probability scores corresponding to the labels. And we can see that the text has been predicted as Actions (87% confidence), Violence (84% confidence), Crime (42% confidence), Adventure (22% confidence), Finance (0% confidence), and Food (0% confidence).

As you can observe from the text used, the scores given for each of the random/new labels for the model prediction are mostly accurate despite the limited information available.

Further Steps

Next, let us create a function that allows us to perform the classification task on each row of the dataset.

Key Note: Remember to change “description” to the name of your text column!

Here is the output:

Image by Author.

From this, you can easily set your own threshold and derive actionable insights from this information.

Conclusion and Limitations

With pre-trained zero-shot text classification models, you can classify text into an arbitrary list of categories by relying on a large trained model from transformers.

One limitation of Zero-Shot Learning is that when the topic is a more abstract term in relation to the text, the prediction probabilities may not be as accurate. Some tasks require a higher level of performance and so a trained classifier will always be the preferred option.

For specialized use cases, when text is based on specific words or terms, it is better to go with a supervised classification model. For general topics, the zero-shot model works amazingly well.

Zero-Shot classification is definitely worth exploring and keeping an eye on. A more advanced application of Zero-Shot Learning is Few-Shot Learning, where the model learns the underlying pattern in the data from a few training samples.

Thanks so much for reading my article!!! Here is the GitHub File that I used for this article.

--

--

Jed Lee
Jed Lee

Written by Jed Lee

Passionate about AI & NLP. Based in Singapore. Currently a Data Scientist at PatSnap.

Responses (1)