🧐 Find label errors with cleanlab¶

In this tutorial, we will show you how you can find possible labeling errors in your data set with the help of cleanlab and Rubrix.


As shown recently by Curtis G. Northcutt et al. label errors are pervasive even in the most-cited test sets used to benchmark the progress of the field of machine learning. In the worst-case scenario, these label errors can destabilize benchmarks and tend to favor more complex models with a higher capacity over lower capacity models.

They introduce a new principled framework to “identify label errors, characterize label noise, and learn with noisy labels” called confident learning. It is open-sourced as the cleanlab Python package that supports finding, quantifying, and learning with label errors in data sets.

This tutorial walks you through 5 basic steps to find and correct label errors in your data set:

  1. 💾 Load the data set you want to check, and a model trained on it;

  2. 💻 Make predictions for the test split of your data set;

  3. 🧐 Get label error candidates with cleanlab;

  4. 🔦 Uncover label errors with Rubrix;

  5. 🖍 Correct label errors and load the corrected data set;

Setup Rubrix¶

If you are new to Rubrix, visit and star Rubrix for updates: ⭐ Github repository

If you have not installed and launched Rubrix, check the Setup and Installation guide.

Once installed, you only need to import Rubrix:

[ ]:
import rubrix as rb

Install tutorial dependencies¶

Apart from cleanlab, we will also install the Hugging Face libraries transformers and datasets, as well as PyTorch, that provide us with the model and the data set we are going to investigate.

%pip install cleanlab torch transformers datasets -qqq


Let us import all the necessary stuff in the beginning.

import rubrix as rb
from cleanlab.pruning import get_noise_indices

import torch
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification

1. Load model and data set¶

For this tutorial we will use the well studied Microsoft Research Paraphrase Corpus (MRPC) data set that forms part of the GLUE benchmark, and a pre-trained model from the Hugging Face Hub that was fine-tuned on this specific data set.

Let us first get the model and its corresponding tokenizer to be able to make predictions. For a detailed guide on how to use the 🤗 transformers library, please refer to their excellent documentation.

[ ]:
model_name = "textattack/roberta-base-MRPC"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

We then get the test split of the MRPC data set, that we will scan for label errors.

[ ]:
dataset = datasets.load_dataset("glue", "mrpc", split="test")

Let us have a quick look at the format of the data set. Label 1 means that both sentence1 and sentence2 are semantically equivalent, a 0 as label implies that the sentence pair is not equivalent.

sentence1 sentence2 label idx
0 PCCW 's chief operating officer , Mike Butcher... Current Chief Operating Officer Mike Butcher a... 1 0
1 The world 's two largest automakers said their... Domestic sales at both GM and No. 2 Ford Motor... 1 1
2 According to the federal Centers for Disease C... The Centers for Disease Control and Prevention... 1 2
3 A tropical storm rapidly developed in the Gulf... A tropical storm rapidly developed in the Gulf... 0 3
4 The company didn 't detail the costs of the re... But company officials expect the costs of the ... 0 4

2. Make predictions¶

Now let us use the model to get predictions for our data set, and add those to our dataset instance. We will use the .map functionality of the datasets library to process our data batch-wise.

[ ]:
def get_model_predictions(batch):
    # batch is a dictionary of lists
    tokenized_input = tokenizer(
        batch["sentence1"], batch["sentence2"], padding=True, return_tensors="pt"
    # get logits of the model prediction
    logits = model(**tokenized_input).logits
    # convert logits to probabilities
    probabilities = torch.softmax(logits, dim=1).detach().numpy()

    return {"probabilities": probabilities}

# Apply predictions batch-wise
dataset = dataset.map(

3. Get label error candidates¶

To identify label error candidates the cleanlab framework simply needs the probability matrix of our predictions (n x m, where n is the number of examples and m the number of labels), and the potentially noisy labels.

# Output the data as numpy arrays

# Get a boolean array of label error candidates
label_error_candidates = get_noise_indices(

This one line of code provides us with a boolean array of label error candidates that we can investigate further. Out of the 1725 sentence pairs present in the test data set we obtain 129 candidates (7.5%) for possible label errors.

frac = label_error_candidates.sum()/len(dataset)
    f"Total: {len(dataset)}\n"
    f"Candidates: {label_error_candidates.sum()} ({100*frac:0.1f}%)"
Total: 1725
Candidates: 129 (7.5%)

4. Uncover label errors in Rubrix¶

Now that we have a list of potential candidates, let us log them to Rubrix to uncover and correct the label errors. First we switch to a pandas DataFrame to filter out our candidates.

candidates = dataset.to_pandas()[label_error_candidates]

Then we will turn those candidates into TextClassificationRecords that we will log to Rubrix.

def make_record(row):
    prediction = list(zip(["Not equivalent", "Equivalent"], row.probabilities))
    annotation = "Not equivalent"
    if row.label == 1:
        annotation = "Equivalent"

    return rb.TextClassificationRecord(
        inputs={"sentence1": row.sentence1, "sentence2": row.sentence2},

records = candidates.apply(make_record, axis=1)

Having our records at hand we can now log them to Rubrix and save them in a dataset that we call "mrpc_label_error".

[ ]:
rb.log(records, name="mrpc_label_error")

Scanning through the records in the Explore Mode of Rubrix, we were able to find at least 30 clear cases of label errors. A couple of examples are shown below, in which the noisy labels are shown in the upper right corner of each example. The predictions of the model together with their probabilities are shown below each sentence pair.

Examples of label errors in the test set uncovered with Rubrix

If your model is not terribly over-fitted, you can also try to run the candidate search over your training data to find very obvious label errors. If we repeat the steps above on the training split of the MRPC data set (3668 examples), we obtain 9 candidates (this low number is expected) out of which 5 examples were clear cases of label errors. A couple of examples are shown below.

Examples of label errors in the training set uncovered with Rubrix

5. Correct label errors¶

With Rubrix it is very easy to correct those label errors. Just switch on the Annotation Mode, correct the noisy labels and load the dataset back into your notebook.

# Load the dataset into a pandas DataFrame
dataset_with_corrected_labels = rb.load("mrpc_label_error")

inputs prediction annotation prediction_agent annotation_agent multi_label explanation id metadata status event_timestamp
0 {'sentence1': 'Deaths in rollover crashes acco... [(Equivalent, 0.9751904606819153), (Not equiva... [Not equivalent] textattack/roberta-base-MRPC MRPC False None bad3f616-46e3-43ca-8ba3-f2370d421fd2 {} Validated None
1 {'sentence1': 'Mr. Kozlowski contends that the... [(Not equivalent, 0.9878258109092712), (Equiva... [Equivalent] textattack/roberta-base-MRPC MRPC False None 50ca41c9-a147-411f-8682-1e3880a522f9 {} Validated None
2 {'sentence1': 'Larger rivals , including Tesco... [(Equivalent, 0.986499547958374), (Not equival... [Not equivalent] textattack/roberta-base-MRPC MRPC False None 6c06250f-7953-475a-934f-7eb35fc9dc4d {} Validated None
3 {'sentence1': 'The Standard & Poor 's 500 inde... [(Not equivalent, 0.9457013010978699), (Equiva... [Equivalent] textattack/roberta-base-MRPC MRPC False None 39f37fcc-ac22-4871-90f1-3766cf73f575 {} Validated None
4 {'sentence1': 'Defense lawyers had said a chan... [(Equivalent, 0.9974484443664551), (Not equiva... [Not equivalent] textattack/roberta-base-MRPC MRPC False None 080c6d5c-46de-4670-9e0a-98e0c7592b11 {} Validated None

Now you can use the corrected data set to repeat your benchmarks and measure your model’s “real-word performance” you care about in practice.


In this tutorial we saw how to leverage cleanlab and Rubrix to uncover label errors in your data set. In just a few steps you can quickly check if your test data set is seriously affected by label errors and if your benchmarks are really meaningful in practice. Maybe your less complex models turns out to beat your resource hungry super model, and the deployment process just got a little bit easier 😀.

Cleanlab and Rubrix do not care about the model architecture or the framework you are working with. They just care about the underlying data and allow you to put more humans in the loop of your AI Lifecycle.

Next steps¶

🙋‍♀️ Join the Rubrix community! A good place to start is the discussion forum.¶

⭐ Rubrix Github repo to stay updated.¶