🤗 Using Rubrix to explore NLP data with Hugging Face datasets and transformers

In this tutorial, we will walk through the process of using Rubrix to explore NLP datasets in combination with the amazing datasets and transformer libraries from Hugging Face.

Introduction

Our goal is to show you how to store and explore NLP datasets using Rubrix for use cases like training data management or model evaluation and debugging.

The tutorial is organized into three parts:

  1. Storing and exploring text classification data: We will use the 🤗 datasets library and Rubrix to store text classification datasets.

  2. Storing and exploring token classification data: We will use the 🤗 datasets library and Rubrix to store token classification data.

  3. Exploring predictions: We will use a pretrained 🤗 transformers model and store its predictions into Rubrix to explore and evaluate our pretrained model.

Install tutorial dependencies

In this tutorial we will be using transformers and datasets libraries. If you do not have them installed, run:

[ ]:
%pip install torch -qqq
%pip install transformers -qqq
%pip install datasets -qqq
%pip install tdqm -qqq # for progress bars

Setup Rubrix

If you have not installed and launched Rubrix, check the Setup and Installation guide.

[ ]:
import rubrix as rb

1. Storing and exploring text classification training data

Rubrix allows you to track data for different NLP tasks (such as Token Classification or Text Classification).

With Rubrix you can track both training data and predictions from models. In this part, we will focus only on training data. Typically, training data is data which has been curated or annotated by a human. Other terms for this same concept are: ground-truth data, “gold-standard” data, or even “annotated” data.

In this part of the tutorial, you will learn how to use 🤗 datasets library for quick exploration of Text Classification and Token Classification training data. This is useful during model development, for getting a sense of the data, identifying potential issues, debugging, etc. Here we will use rather static “research”datasets but Rubrix really shines when you are collecting and using training data in the wild, or in other words in real data science projects.

Let’s get started!

Text classification with the tweet_eval dataset (Emoji classification)

Text classification deals with predicting in which categories a text fits. As if you’re shown an image you could quickly tell if there’s a dog or a cat in it, we build NLP models to distinguish between a Jane Austen’s novel or a Charlotte Bronte’s poem. It’s all about feeding models with labeled examples and see how it start predicting over the very same labels.

In this first case, we are going to play with tweet_eval, a dataset with a bunch of tweets from different authors and topics and the sentiment it transmits. This is, in fact, a very common NLP task called Sentiment Analysis, but with a cool tweak: we are representing these sentiments with emojis. Each tweet comes with a number between 0 and 19, which represents different emojis. You can see each one in a cell below or in the tweet_eval site at 🤗 Hub.

First of all, we are going to load the dataset from 🤗 Hub and visualize its content.

[ ]:
from datasets import load_dataset

dataset = load_dataset("tweet_eval", 'emoji', script_version="master")
[ ]:
labels = dataset['train'].features['label'].names; labels

Usually, datasets are divided into train, validation and test splits, and each one of them is used in a certain part of the training. For now, we can stick to the training split, which usually contains the majority of the instances of a dataset. Let’s see what’s inside!

[ ]:
with dataset['train'].formatted_as("pandas"):
    print(dataset['train'][:5])

Now, we are going to create our records from this dataset and log them into rubrix. Rubrix comes with TextClassificationRecord and TokenClassificationRecord classes, which can be created from a dictionary. These objects passes information to rubrix about the input of the model, the predictions obtained and the annotations made, as well as a metadata field for other important details.

In our case, we haven’t predicted anything, so we are only going to include the labels of each instance as annotations, as we know they are the ground truth. We will also include each tweet into inputs, and specify in the metadata section that we are into the training split. Once records is populated, we can log it with rubric.log(), specifying the name of our dataset.

[ ]:
records = []

for record in dataset['train']:
    records.append(rb.TextClassificationRecord(
        inputs=record["text"],
        annotation=labels[record["label"]],
        annotation_agent="https://huggingface.co/datasets/tweet_eval",
        metadata={"split": "train"},
        )
    )
[ ]:
rb.log(records=records, name="tweet_eval_emojis")
Tweet eval dataset

Thanks to our metadata section in the Text Classification Record, we can log tweets from the validation and test splits in the same dataset to explore them using the Metadata filters.

[ ]:
records_validation = []

for record in dataset['validation']:
    records_validation.append(rb.TextClassificationRecord(
        inputs=record["text"],
        annotation=labels[record["label"]],
        annotation_agent="https://huggingface.co/datasets/tweet_eval",
        metadata={"split": "validation"},
        )
    )

rb.log(records=records_validation, name="tweet_eval_emojis")
[ ]:
records_test = []

for record in dataset['test']:
    records_test.append(rb.TextClassificationRecord(
        inputs=record["text"],
        annotation=labels[record["label"]],
        annotation_agent="https://huggingface.co/datasets/tweet_eval",
        metadata={"split": "test"},
        )
    )

rb.log(records=records_test, name="tweet_eval_emojis")
Tweet eval dataset

Natural language inference with the MRPC dataset

Natural Language Inference (NLI) is also a very common NLP task, but a little bit different to regular Text Classification. In NLI, the model receives a premise and a hypothesis, and it must figure out if the premise hypothesis is true or not given the premise. We have three categories: entailment (true), contradiction (false) or neutral (undetermined or unrelated). With the premise “We live in a flat planet called Earth”, the hypothesis “The Earth is flat” must be classified as entailment, as it is stated in the premise. NLI works with a sort of close-world assumption, in that everything not defined in the premise cannot be suppoused from the real world.

Another key difference from Text Classification is that the input come in pairs of two sentences or texts, not only one. Text Classification treats its input as a cohesive and correlated unit, while NLI treats its input as a pair and tries to find correlation.

To play around with NLI we are going to use 🤗 Hub GLUE benchmark over the MRPC task. GLUE is a well-known benchmark resource for NLP, and allow us to use its data directly over the Microsoft Research Paraphrase Corpus, a corpus of online news.

[ ]:
from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc', split='train')
[ ]:
dataset[0]

We can see the two input sentences instead of one. In order to simplify the workflow, let’s just test if they are equivalent or not.

[ ]:
labels = dataset.features['label'].names ; labels

Populating our record list follows the same procedure as in Text Classification, adapting our input to the new scenario of pairs.

[ ]:
records=[]

for record in dataset:
    records.append(rb.TextClassificationRecord(
       inputs={
           "sentence1": record["sentence1"],
           "sentence2": record["sentence2"]
        },
        annotation=labels[record["label"]],
        annotation_agent="https://huggingface.co/datasets/glue#mrpc",
        metadata={"split": "train"},
        )
    )
[ ]:
rb.log(records=records, name="mrpc")

Once your dataset is logged you can explore it using filters, keyword-based search and with Elasticsearch’s query string DSL.

For example, the folllowing query inputs.sentence2:(not or dont) lets you browse all examples containing not or dont inside the sentence2 field, which you can further filter by Annotated as to browse examples belonging to a specific category (e.g., not_equivalent)

MRPC dataset

Multilabel text classification with go_emotions dataset

Another similar task to Text Classification, but yet a bit different, is Multilabel Text Classification. Just one key difference: more than one label may be predicted. While in a regular Text Classification task we may decide that the tweet “I can’t wait to travel to Egypts and visit the pyramids” fits into the hastag #Travel, which is accurate, in Multilabel Text Classification we can classify it as more than one hastag, like #Travel #History #Africa #Sightseeing #Desert.

In Text Classification, the category with the highest score (which our model predicted) is going to be the category predicted, but in this task we need to establish a threshold, a value between 0 and 1, from which we will classify the labels as predictions or not. If we set it to 0.5, only categories with more than a 0.5 probability value will be considered predictions.

To get used to this task and see how we can log data to Rubrix, we are going to use 🤗 Hub go_emotions dataset, with comments from different reddit forums and an associated sentiment (this experiment would also be considered Sentiment Analysis).

[ ]:
from datasets import load_dataset

dataset = load_dataset('go_emotions', split='train[0:10]')

Here’s an example of an instance of the datasets, and the different labels, ordered. Each label will be represented in the dataset as a number, but we will translate to its name before logging to rubrix, to see things more clearly.

[ ]:
dataset[0]
[ ]:
labels = dataset.features['labels'].feature.names; labels

Now, we need to add a confidence value to our annotation, from 0 to 1. As these are all ground truths, we consider they have the maximum probability.

[ ]:
records= []

for record in dataset:
    records.append(rb.TextClassificationRecord(
        inputs={"text": record["text"]},
        annotation=[labels[cls] for cls in record['labels']],
        annotation_agent="https://huggingface.co/datasets/go_emotions",
        multi_label=True,
        metadata={
            "split": "train"
            },
        )
    )


And logging is just as easy as before!

[ ]:
rb.log(records=records, name="go_emotions")

2. Storing and exploring token classification training data

In this second part, we will cover Token Classification while still using 🤗 datasets library. These kind of NLP tasks aim to divide the input text into words, or syllabes, and assign certain values to them. Think about giving each word in a sentence its gramatical category, or highlight which parts of a medical report belong to a certain speciality.

We are going to cover a few cases using 🤗 datasets, and see how TokenClassificationRecord allows us to log data in rubrix in a similar fashion.

Named-Entity Recognition with wnut17 dataset

Named-Entity Recognition (NER) seeks to locate and classify named entities metioned in unstructured text into pre-defined categories. And, what’s powerful about NER is that this predefined categories can be whatever we want. Maybe gramatical categories, and be the best at syntax analysis in our English class, maybe person names, or organizations, or even medical codes.

For this case, we are going to use 🤗 Hub WNUT 17 dataset, about rare entities on written text. Take for example the tweet “so.. kktny in 30 mins?” - even human experts find entity kktny hard to detect and resolve. This task will evaluate the ability to detect and classify novel, emerging, singleton named entities in written text.

As always, let’s first dive into the data and see how it looks like.

[ ]:
from datasets import load_dataset

dataset = load_dataset("wnut_17", split="train[0:10]")
[ ]:
dataset[0]

We can see a list of tags and the tokens they are refering to. We have the following rare entities in this example.

[ ]:
for entity, token in zip(dataset[0]["ner_tags"], dataset[0]["tokens"]):
    if entity != 0:
        print(f"""{token}: {dataset.features["ner_tags"].feature.names[entity]}""")

So, it make a lot of sense to translate these tags into NER tags, which are much more self-explanatory than an integer.

[ ]:
dataset = dataset.map(lambda instance: {"ner_tags_translated": [dataset.features["ner_tags"].feature.names[tag] for tag in instance["ner_tags"]]})

What we did is a mapping function over 🤗 dataset, which allow us to make changes in every instance of the dataset. The very same instance that we printed before is much more readable now.

[ ]:
dataset[0]

Info about the meaning of the tags is available here, but to sum up, Empire and ESB has been classified as B-LOC, or beggining of a location name, State and Building has been classified as I-LOC or intermediate/final of a location name.

We need to transform a bit this information, providing an entity annotation. Entity annotations are simply tuples, with the following structure

(label, start_position, end_position)

Let’s create a function that transform our dataset records into entities. It’s a bit weird, but don’t worry! What’s doing inside is getting the entities information as shown above.

[ ]:
def parse_entities(record):

    entities, text, nr_tokens = [], " ".join(record["tokens"]), len(record["tokens"])
    token_start_indexes = [text.rfind(substr) for substr in [" ".join(record["tokens"][i:]) for i in range(nr_tokens)]]

    entity = None
    for i, tag, start in zip(range(nr_tokens), record["ner_tags_translated"], token_start_indexes):
        # end of entity
        if entity is not None and (not tag.startswith("I-") or i == nr_tokens -1):
            entity += (start-1,)
            entities.append(entity)
            entity = None
        # start new entity
        if entity is None and tag.startswith("B-"):
            entity = (tag[2:], start)

    return entities

Let’s proceed and create a record list to log it

[ ]:
records = []

for record in dataset:
    entities = parse_entities(record)
    records.append(rb.TokenClassificationRecord(
        text=" ".join(record["tokens"]),
        tokens=record["tokens"],
        annotation=entities,
        annotation_agent="https://huggingface.co/datasets/wnut_17",
        metadata={
            "split": "train"
            },
        )
    )
[ ]:
records[0]
[ ]:
rb.log(records=records, name="ner_wnut_17")

Part of speech tagging with conll2003 dataset

Another NLP task related to token-level classification is Part-of-Speech tagging (POS tagging). In it we will identify names, verbs, adverbs, adjectives…based on the context and the meaning of the words. It is a little bit trickier than having a huge dictionary where we can look up that drink is a verb and dog is a name. Many words change its gramatical type according to the context of the sentence, and here is where AI comes to save the day.

With just our dictionary and a regular script, dog in The sailor dogs the hatch. would be classified as a name, because dog is a name, right? A trained NLP model would step up and say No! That’s is a very common example to ilustrate the ambiguity of words. It is a verb!. Or maybe it would just say verb. That’s up to you.

In this dataset from 🤗 hub, we will see how differente sentence has POS and NER tags, and how we can log this POS tag information into Rubrix.

[ ]:
from datasets import load_dataset

dataset = load_dataset("conll2003", split="train[0:10]")
[ ]:
dataset[0]

Each POS and NER tag are represented by a number. In dataset.features we can see to which tag they refer (this link may serve you to look up the meaning).

[ ]:
dataset.features

The following function will help us create the entities.

[ ]:
def parse_entities_POS(record):

    entities = []
    counter = 0

    for i in range(len(record['pos_tags'])):

        entity = (dataset.features["pos_tags"].feature.names[record["pos_tags"][i]], counter, counter + len(record["tokens"][i]))
        entities.append(entity)

        counter += len(record["tokens"][i]) + 1

    return entities
[ ]:
records = []

for record in dataset:
    entities = parse_entities_POS(record)
    records.append(rb.TokenClassificationRecord(
        text=" ".join(record["tokens"]),
        tokens=record["tokens"],
        annotation=entities,
        annotation_agent="https://huggingface.co/datasets/conll2003",
        metadata={
            "split": "train"
            },
        )
    )
[ ]:
rb.log(records=records, name="conll2003")

And so it is done! We have logged data from 5 different type of experiments, which now can be visualized in Rubrix UI

3. Exploring predictions

In this third part of the tutorial we are going to focus on loading predictions and annotations into Rubrix and visualize them from the UI.

Rubrix let us play with the data in many different ways: visualizing by predicted class, by annotated class, by split, selecting which ones were wrongly classified, etc.

Agnews and zeroshot classification

To explore some logged data on Rubrix UI, we are going to predict the topic of some news with a zero-shot classifier (that we don’t need to train), and compare the predicted category with the ground truth. The dataset we are going to use in this part is ag_news, with information of over 1 million articles written in English.

First of all, as always, we are going to load the dataset from 🤗 Hub and visualize its content.

[ ]:
from datasets import load_dataset

dataset = load_dataset("ag_news", split='test[0:100]') # 20% is over 1500 records
[ ]:
dataset[0]
[ ]:
dataset.features

This dataset has articles from four different classes, so we can define a category list, which may come in handy.

[ ]:
categories = ['World', 'Sports', 'Business', 'Sci/Tech']

Now, it’s time to load our zero-shot classification model. We present to options:

  1. DistilBart-MNLI

  2. squeezebert-mnli

With the first model, the obtained results are probably going to be better, but it is a larger model, which could take longer to use. We are going to stick with the first one, but feel free to change it, and even to compare them!

[ ]:
from transformers import pipeline

model = "valhalla/distilbart-mnli-12-1"

pl = pipeline('zero-shot-classification', model=model)

Let’s try to make a quick prediction and take a look.

[ ]:
pl(dataset[0]['text'], ['World', 'Sports', 'Business', 'Sci/Tech'], hypothesis_template='This example is {}.',multi_label=False)

Knowing how to make a prediction, we can now apply this to the whole selected dataset. Here, we also present you with two options:

  1. Traverse through all records in the dataset, predict each record and log it to Rubrix.

  2. Apply a map function to make the predictions and add that field to each record, and then log it as a whole to Rubrix.

In the following categories, each approach is presented. You choose what you like the most, or even both (be careful with the time and the duplicated records, though!).

First approach

[ ]:
from tqdm import tqdm

for record in tqdm(dataset):

    # Make the prediction
    model_output = pl(record['text'], categories, hypothesis_template='This example is {}.')

    item = rb.TextClassificationRecord(
        inputs={"text": record["text"]},
        prediction=list(zip(model_output['labels'], model_output['scores'])),
        prediction_agent="https://huggingface.co/valhalla/distilbart-mnli-12-1",
        annotation=categories[record["label"]],
        annotation_agent="https://huggingface.co/datasets/ag_news",
        multi_label=True,
        metadata={
            "split": "train"
            },
        )


    # Log to rubrix
    rb.log(records=item, name="ag_news")

Second approach

[ ]:
def add_predictions(records):

    predictions = pl([record for record in records['text']], categories, hypothesis_template='This example is {}.')


    if isinstance(predictions, list):
        return {"labels_predicted": [pred["labels"] for pred in predictions], "probabilities_predicted": [pred["scores"] for pred in predictions]}
    else:
        return {"labels_predicted": predictions["labels"], "probabilities_predicted": predictions["scores"]}
[ ]:
dataset_predicted = dataset.map(add_predictions, batched=True, batch_size=4)
[ ]:
dataset_predicted[0]
[ ]:
from tqdm import tqdm

for record in tqdm(dataset_predicted):

    item = rb.TextClassificationRecord(
        inputs={"text": record["text"]},
        prediction=list(zip(record['labels_predicted'], record['probabilities_predicted'])),
        prediction_agent="https://huggingface.co/valhalla/distilbart-mnli-12-1",
        annotation=categories[record["label"]],
        annotation_agent="https://huggingface.co/datasets/ag_news",
        multi_label=True,
        metadata={
            "split": "train"
            },
        )

    # Log to rubrix
    rb.log(records=item, name="ag_news")

Summary

In this tutorial, we have learnt:

  • To log and explore NLP training datasets with the 🤗 datasets library.

  • To explore NLP predictions using a zeroshot classifier from the 🤗 model hub.

Next steps

🙋‍♀️ Join the Rubrix community! A good place to start is the discussion forum.

⭐ Rubrix Github repo to stay updated.