Datasets#

This guide showcases some features of the Dataset classes in the Rubrix client. The Dataset classes are lightweight containers for Rubrix records. These classes facilitate importing from and exporting to different formats (e.g., pandas.DataFrame, datasets.Dataset) as well as sharing and versioning Rubrix datasets using the Hugging Face Hub.

For each record type thereโ€™s a corresponding Dataset class called DatasetFor<RecordType>. You can look up their API in the reference section

Working with a Dataset#

Under the hood the Dataset classes store the records in a simple Python list. Therefore, working with a Dataset class is not very different to working with a simple list of records:

[ ]:
import rubrix as rb

# Start with a list of Rubrix records
dataset_rb = rb.DatasetForTextClassification(my_records)

# Loop over the dataset
for record in dataset_rb:
    print(record)

# Index into the dataset
dataset_rb[0] = rb.TextClassificationRecord(text="replace record")

# log a dataset to the Rubrix web app
rb.log(dataset_rb, "my_dataset")

The Dataset classes do some extra checks for you, to make sure you do not mix record types when appending or indexing into a dataset.

Importing from other formats#

When you have your data in a pandas DataFrame or a datasets Dataset, we provide some neat shortcuts to import this data into a Rubrix Dataset. You have to make sure that the data follows the record model of a specific task, otherwise you will get validation errors. Columns in your DataFrame/Dataset that are not supported or recognized, will simply be ignored.

The record models of the tasks are explained in the reference section.

Note

Due to itโ€™s pyarrow nature, data in a datasets.Dataset has to follow a slightly different model, that you can look up in the examples of the Dataset*.from_datasets docstrings.

[ ]:
import rubrix as rb

# import data from a pandas DataFrame
dataset_rb = rb.read_pandas(my_dataframe, task="TextClassification")
# or
dataset_rb = rb.DatasetForTextClassification.from_pandas(my_dataframe)

# import data from a datasets Dataset
dataset_rb = rb.read_datasets(my_dataset, task="TextClassification")
# or
dataset_rb = rb.DatasetForTextClassification.from_datasets(my_dataset)

We also provide helper arguments you can use to read almost arbitrary datasets for a given task from the Hugging Face Hub. They map certain input arguments of the Rubrix records to columns of the given dataset. Letโ€™s have a look at a few examples:

[ ]:
import rubrix as rb
from datasets import load_dataset

# the "poem_sentiment" dataset has columns "verse_text" and "label"
dataset_rb = rb.DatasetForTextClassification.from_datasets(
    dataset=load_dataset("poem_sentiment", split="test"),
    text="verse_text",
    annotation="label",
)

# the "snli" dataset has the columns "premise", "hypothesis" and "label"
dataset_rb = rb.DatasetForTextClassification.from_datasets(
    dataset=load_dataset("snli", split="test"),
    inputs=["premise", "hypothesis"],
    annotation="label",
)

# the "conll2003" dataset has the columns "id", "tokens", "pos_tags", "chunk_tags" and "ner_tags"
rb.DatasetForTokenClassification.from_datasets(
    dataset=load_dataset("conll2003", split="test"),
    tags="ner_tags",
)

# the "xsum" dataset has the columns "id", "document" and "summary"
rb.DatasetForText2Text.from_datasets(
    dataset=load_dataset("xsum", split="test"),
    text="document",
    annotation="summary",
)

You can also use the shortcut rb.read_datasets(dataset=..., task=..., **kwargs) where the keyword arguments are passed on to the corresponding from_datasets() method.

Sharing via the Hugging Face Hub#

You can easily share your Rubrix dataset with your community via the Hugging Face Hub. For this you just need to export your Rubrix Dataset to a datasets.Dataset and push it to the hub:

[ ]:
import rubrix as rb

# load your annotated dataset from the Rubrix web app
dataset_rb = rb.load("my_dataset")

# export your Rubrix Dataset to a datasets Dataset
dataset_ds = dataset_rb.to_datasets()

# push the dataset to the Hugging Face Hub
dataset_ds.push_to_hub("my_dataset")

Afterward, your community can easily access your annotated dataset and log it directly to the Rubrix web app:

[ ]:
from datasets import load_dataset

# download the dataset from the Hugging Face Hub
dataset_ds = load_dataset("user/my_dataset", split="train")

# read in dataset, assuming its a dataset for text classification
dataset_rb = rb.read_datasets(dataset_ds, task="TextClassification")

# log the dataset to the Rubrix web app
rb.log(dataset_rb, "dataset_by_user")

Prepare dataset for training#

If you want to train a Hugging Face transformer with your dataset, we provide a handy method to prepare your dataset: DatasetFor*.prepare_for_training(). It will return a Hugging Face dataset, optimized for the training process with the Hugging Face Trainer.

For text classification tasks, it flattens the inputs into separate columns of the returned dataset and converts the annotations of your records into integers and writes them in a label column:

[ ]:
dataset_rb = rb.DatasetForTextClassification([
    rb.TextClassificationRecord(inputs={"title": "My title", "content": "My content"}, annotation="news")
])

dataset_rb.prepare_for_training()[0]
# Output:
# {'title': 'My title', 'content': 'My content', 'label': 0}

For token classification tasks, it converts the annotations of a record into integers representing BIO tags and writes them in a ner_tags column:

[ ]:
dataset_rb = rb.DatasetForTokenClassification([
    rb.TokenClassificationRecord(text="I live in Madrid", tokens=["I", "live", "in", "Madrid"], annotation=[("LOC", 10, 15)])
])

dataset_rb.prepare_for_training()[0]
# Output:
# {..., 'tokens': ['I', 'live', 'in', 'Madrid'], 'ner_tags': [0, 0, 0, 1], ...}