🏷️ 🔫 Faster data annotation with a zero-shot text classifier

TL;DR

  1. A simple example for data annotation with Rubrix is shown: using a zero-shot classification model to pre-annotate and hand-label data more efficiently.

  2. We use the new SELECTRA zero-shot classifier and the Spanish part of the MLSum, a multilingual dataset for text summarization.

  3. Two data annotation rounds are performed: (1) labeling random examples, and (2) bulk labeling high score examples.

  4. Besides boosting the labeling process, this workflow lets you evaluate the performance of zero-shot classification for a specific use case. In this example use case, we observe the pre-trained zero-shot classifier provides pretty decent results, which might be enough for general news categorization.

Why

  • The availability of pre-trained language models with zero-shot capabilities means you can, sometimes, accelerate your data annotation tasks by pre-annotating your corpus with a pre-trained zeroshot model.

  • The same workflow can be applied if there is a pre-trained “supervised” model that fits your categories but needs fine-tuning for your own use case. For example, fine-tuning a sentiment classifier for a very specific type of message.

Setup Rubrix

Rubrix, is a free and open-source tool to explore, annotate, and monitor data for NLP projects.

If you are new to Rubrix, check out the ⭐ Github repository.

If you have not installed and launched Rubrix, check the Setup and Installation guide.

Once installed, you only need to import Rubrix:

[ ]:
import rubrix as rb

Install dependencies

For this tutorial we only need to install a few additional dependencies:

[ ]:
%pip install transformers datasets torch -qqq

1. Load the Spanish zero-shot classifier: Selectra

We will use the recently released SELECTRA zero-shot classifier model, a zero-shot classifier for Spanish language.

[ ]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification",
                       model="Recognai/zeroshot_selectra_medium")

2. Loading the MLSum dataset

MLSUM, is a large scale multilingual text summarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian and Turkish. To illustrate the labeling process, in this tutorial we will only use the first 500 examples of its Spanish test set.

[ ]:
from datasets import load_dataset

mlsum = load_dataset("mlsum", "es", split="test[0:500]")

3. Making zero-shot predictions

The zero-shot classifier allows you to provide arbitrary candidate labels, which it will use for its predictions. Since under the hood, this zero-shot classifier is based on natural language inference (NLI), we need to convert the candidate labels into a “hypothesis”. For this we use a hypothesis_template, in which the {} will be replaced by each one of our candidate label. This template can have a big effect on the scores of your predictions and should be adopted to your use case.

[ ]:
# We adopted the hypothesis to our use case of predicting the topic of news articles
hypothesis_template = "Esta noticia habla de {}."
# The candidate labels for our zero-shot classifier
candidate_labels = ["política", "cultura", "sociedad", "economia", "deportes", "ciencia y tecnología"]

# Make predictions batch-wise
def make_prediction(rows):
    predictions = classifier(
        rows["summary"],
        candidate_labels=candidate_labels,
        hypothesis_template=hypothesis_template
    )
    return {key: [pred[key] for pred in predictions] for key in predictions[0]}

mlsum_with_predictions = mlsum.map(make_prediction, batched=True, batch_size=8)

4. Logging predictions in Rubrix

Let us log the examples to Rubrix and start our hand-labeling session, which will hopefully become more efficient with the zero-shot predictions.

[ ]:
records = []

for row in mlsum_with_predictions:
    records.append(
        rb.TextClassificationRecord(
            inputs=row["summary"],
            prediction=list(zip(row['labels'], row['scores'])),
            prediction_agent="zeroshot_selectra_medium",
            metadata={"topic": row["topic"]}
        )
    )
[ ]:
rb.log(records, name="zeroshot_noticias", metadata={"tags": "data-annotation"})

5. Hand-labeling session

Let’s do two data annotation sessions.

Label first 20 random examples

Labeling random or sequential examples is always recommended to get a sense of the data distribution, the usefulness of zero-shot predictions, and the suitability of the labeling scheme (the target labels). Typically, this is how you will build your first test set, which you can then use to validate the downstream supervised model.

Label records with high score predictions

In this case, we will use bulk-labeling (labeling a set of records with a few clicks) after quickly reviewing high score predictions from our zero-shot model. The main idea is that above a certain score, the predictions from this model are more likely to be correct.

Next steps

If you are interested in the topic of zero-shot models, check out the tutorial for using Rubrix with Flair’s zero-shot NER.

🙋‍♀️ Join the Rubrix community! A good place to start is the discussion forum.

⭐ Rubrix Github repo to stay updated.

[ ]: