This guide gives you a brief introduction to Rubrix Metrics. Rubrix Metrics enable you to perform fine-grained analyses of your models and training datasets. Rubrix Metrics are inspired by a a number of seminal works such as Explainaboard.

The main goal is to make it easier to build more robust models and training data, going beyond single-number metrics (e.g., F1).

This guide gives a brief overview of currently supported metrics. For the full API documentation see the Python API reference

This feature is experimental, you can expect some changes in the Python API. Please report on Github any issue you encounter.

Install dependencies

Verify you have already installed Jupyter Widgets in order to properly visualize the plots. See

For running this guide you need to install the following dependencies:

[ ]:
%pip install datasets spacy plotly -qqq

and the spacy model:

[ ]:
!python -m spacy download en_core_web_sm

1. Rubrix Metrics for NER pipelines predictions

Load dataset and spaCy model

We’ll be using spaCy for this guide, but all the metrics we’ll see are computed for any other framework (Flair, Stanza, Hugging Face, etc.). As an example will use the WNUT17 NER dataset.

[ ]:
import rubrix as rb
import spacy
from datasets import load_dataset

nlp = spacy.load("en_core_web_sm")
dataset = load_dataset("wnut_17", split="train")

Log records into a Rubrix dataset

Let’s log spaCy predictions using the built-in rb.monitor method:

[ ]:
nlp = rb.monitor(nlp, dataset="spacy_sm_wnut17")

def predict_batch(records):
    docs = nlp(" ".join(records["tokens"]))
    return {"predicted": [True for _ in docs]}

Explore the metrics for this pipeline

from rubrix.metrics.token_classification import entity_consistency

entity_consistency(name="spacy_sm_wnut17", mentions=5000, threshold=2).visualize()
from rubrix.metrics.token_classification import entity_labels

from rubrix.metrics.token_classification import entity_density

from rubrix.metrics.token_classification import entity_capitalness

from rubrix.metrics.token_classification import mention_length

2. Rubrix Metrics for training sets

Analyzing tags

dataset = load_dataset("conll2002", "es", split="train[0:5000]")
Reusing dataset conll2002 (/Users/dani/.cache/huggingface/datasets/conll2002/es/1.0.0/a3a8a8612caf57271f5b35c5ae1dd25f99ddb9efb9c1667abaa70ede33e863e5)
def parse_entities(record):
    entities = []
    counter = 0
    for i in range(len(record['ner_tags'])):
        entity = (dataset.features["ner_tags"].feature.names[record["ner_tags"][i]], counter, counter + len(record["tokens"][i]))
        counter += len(record["tokens"][i]) + 1
    return entities
records = [
        text=" ".join(example["tokens"]),
    for example in dataset
[ ]:
rb.log(records, "conll2002_es")
from rubrix.metrics.token_classification import entity_consistency
from rubrix.metrics.token_classification.metrics import Annotations

entity_consistency(name="conll2002_es", mentions=30, threshold=4, compute_for=Annotations).visualize()
from rubrix.metrics.token_classification import *

entity_density(name="conll2002_es", compute_for=Annotations).visualize()