Rubrix CookbookĀ¶
This guide is a collection of recipes. It shows examples for using Rubrix with some of the most popular NLP Python libraries.
Rubrix is agnostic, it can be used with any library or framework, no need to implement any interface or modify your existing toolbox and workflows.
With these examples youāll be able to start exploring and annnotating data with these libraries or get some inspiration if your library of choice is not in this guide.
If you miss a library in this guide, leave a message at the Rubrix Github forum.
Hugging Face TransformersĀ¶
Hugging Face has made working with NLP easier than ever before. With a few lines of code we can take a pretrained Transformer model from the Hub, start making some predictions and log them into Rubrix.
[ ]:
%pip install torch
%pip install transformers
%pip install datasets
Text ClassificationĀ¶
InferenceĀ¶
Letās try a zero-shot classifier using SqueezeBERT for predicting the topic of a sentence.
[ ]:
import rubrix as rb
from transformers import pipeline
input_text = "I love watching rock climbing competitions!"
# We define our HuggingFace Pipeline
classifier = pipeline(
"zero-shot-classification",
model="typeform/squeezebert-mnli",
framework="pt",
)
# Making the prediction
prediction = classifier(
input_text,
candidate_labels=[
"politics",
"sports",
"technology",
],
hypothesis_template="This text is about {}.",
)
# Creating the prediction entity as a list of tuples (label, probability)
prediction = list(zip(prediction["labels"], prediction["scores"]))
# Building a TextClassificationRecord
record = rb.TextClassificationRecord(
inputs=input_text,
prediction=prediction,
prediction_agent="https://huggingface.co/typeform/squeezebert-mnli",
)
# Logging into Rubrix
rb.log(records=record, name="zeroshot-topic-classifier")
TrainingĀ¶
Letās read a Rubrix dataset, prepare a training set and use the Trainer
API for fine-tuning a distilbert-base-uncased
model. Take into account that a labelled_dataset
is expected to be found in your Rubrix client.
[ ]:
from datasets import Dataset
import rubrix as rb
# load rubrix dataset
df = rb.load('labelled_dataset')
# inputs can be dicts to support multifield classifiers, we just use the text here.
df['text'] = df.inputs.transform(lambda r: r['text'])
# we flatten the annotations and create a dict for turning labels into numeric ids
df['labels'] = df.annotation.transform(lambda r: r[0])
label2id = {label:id for id,label in enumerate(set(df.labels.values))}
# create š¤ dataset from pandas with labels as numeric ids
dataset = Dataset.from_pandas(df[['text', 'labels']])
dataset = dataset.map(lambda example: {'labels': label2id[example['labels']]})
[ ]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import Trainer
# from here, it's just regular fine-tuning with š¤ transformers
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=4)
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
train_dataset = dataset.map(tokenize_function, batched=True).shuffle(seed=42)
trainer = Trainer(model=model, train_dataset=train_dataset)
trainer.train()
Token ClassificationĀ¶
We will explore a DistilBERT NER classifier fine-tuned for NER using the conll03 English dataset.
[ ]:
import rubrix as rb
from transformers import pipeline
input_text = "My name is Sarah and I live in London"
# We define our HuggingFace Pipeline
classifier = pipeline(
"ner",
model="elastic/distilbert-base-cased-finetuned-conll03-english",
framework="pt",
)
# Making the prediction
predictions = classifier(
input_text,
)
# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [(pred["entity"], pred["start"], pred["end"]) for pred in predictions]
# Building a TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=input_text.split(),
prediction=prediction,
prediction_agent="https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english",
)
# Logging into Rubrix
rb.log(records=record, name="zeroshot-ner")
spaCyĀ¶
spaCy offers industrial-strength Natural Language Processing, with support for 64+ languages, trained pipelines, multi-task learning with pretrained Transformers, pretrained word vectors and much more.
[ ]:
%pip install spacy
Token ClassificationĀ¶
We will focus our spaCy recipes into Token Classification tasks, showing you how to log data from NER and POS tagging.
NERĀ¶
For this recipe, we are going to try the French language model to extract NER entities from some sentences.
[ ]:
!python -m spacy download fr_core_news_sm
[ ]:
import rubrix as rb
import spacy
input_text = "Paris a un enfant et la forĆŖt a un oiseau ; lāoiseau sāappelle le moineau ; lāenfant sāappelle le gamin"
# Loading spaCy model
nlp = spacy.load("fr_core_news_sm")
# Creating spaCy doc
doc = nlp(input_text)
# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents]
# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[token.text for token in doc],
prediction=prediction,
prediction_agent="spacy.fr_core_news_sm",
)
# Logging into Rubrix
rb.log(records=record, name="lesmiserables-ner")
POS taggingĀ¶
Changing very few parameters, we can make a POS tagging experiment, instead of NER. Letās try it out with the same input sentence.
[ ]:
import rubrix as rb
import spacy
input_text = "Paris a un enfant et la forĆŖt a un oiseau ; lāoiseau sāappelle le moineau ; lāenfant sāappelle le gamin"
# Loading spaCy model
nlp = spacy.load("fr_core_news_sm")
# Creating spaCy doc
doc = nlp(input_text)
# Creating the prediction entity as a list of tuples (tag, start_char, end_char)
prediction = [(token.pos_, token.idx, token.idx + len(token)) for token in doc]
# Building TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[token.text for token in doc],
prediction=prediction,
prediction_agent="spacy.fr_core_news_sm",
)
# Logging into Rubrix
rb.log(records=record, name="lesmiserables-pos")
FlairĀ¶
Itās a framework that provides a state-of-the-art NLP library, a text embedding library and a PyTorch framework for NLP. Flair offers sequence tagging language models in English, Spanish, Dutch, German and many more, and they are also hosted on HuggingFace Model Hub.
[ ]:
%pip install flair
Text ClassificationĀ¶
Flair offers some zero-shot models ready to be used, which we are going to use to introduce logging TextClassificationRecords
with Rubrix. Letās see how to integrate Rubrix in their Deutch offensive language model (we promise to not get very explicit).
[ ]:
import rubrix as rb
from flair.models import TextClassifier
from flair.data import Sentence
input_text = "Du erzƤhlst immer Quatsch." # something like: "You are always narrating silliness."
# Load our pre-trained TARS model for English
classifier = TextClassifier.load("de-offensive-language")
# Creating Sentence object
sentence = Sentence(input_text)
# Make the prediction
classifier.predict(sentence, multi_class_prob=True)
# Creating the prediction entity as a list of tuples (label, probability)
prediction = [(pred.value, pred.score) for pred in sentence.labels]
# Building a TextClassificationRecord
record = rb.TextClassificationRecord(
inputs=input_text,
prediction=prediction,
prediction_agent="de-offensive-language",
)
# Logging into Rubrix
rb.log(records=record, name="german-offensive-language")
Token ClassificationĀ¶
Flair offers a lot of tools for Token Classification, supporting tasks like named entity recognition (NER), part-of-speech tagging (POS), special support for biomedical data, etc. with a growing number of supported languages.
Letās see some examples for NER and POS tagging.
NERĀ¶
In this example, we will try the pretrained Dutch NER model from Flair.
[ ]:
import rubrix as rb
from flair.data import Sentence
from flair.models import SequenceTagger
input_text = "De Nachtwacht is in het Rijksmuseum"
# Loading our NER model from flair
tagger = SequenceTagger.load("flair/ner-dutch")
# Creating Sentence object
sentence = Sentence(input_text)
# run NER over sentence
tagger.predict(sentence)
# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [
(entity.get_labels()[0].value, entity.start_pos, entity.end_pos)
for entity in sentence.get_spans("ner")
]
# Building a TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[token.text for token in sentence],
prediction=prediction,
prediction_agent="flair/ner-dutch",
)
# Logging into Rubrix
rb.log(records=record, name="dutch-flair-ner")
POS taggingĀ¶
In the following snippet we will use de multilingual POS tagging model from Flair.
[ ]:
import rubrix as rb
from flair.data import Sentence
from flair.models import SequenceTagger
input_text = "George Washington went to Washington. Dort kaufte er einen Hut."
# Loading our POS tagging model from flair
tagger = SequenceTagger.load("flair/upos-multi")
# Creating Sentence object
sentence = Sentence(input_text)
# run NER over sentence
tagger.predict(sentence)
# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [
(entity.get_labels()[0].value, entity.start_pos, entity.end_pos)
for entity in sentence.get_spans()
]
# Building a TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[token.text for token in sentence],
prediction=prediction,
prediction_agent="flair/upos-multi",
)
# Logging into Rubrix
rb.log(records=record, name="flair-pos-tagging")
StanzaĀ¶
Stanza is a collection of efficient tools for many NLP tasks and processes, all in one library. Itās maintained by the Standford NLP Group. We are going to take a look at a few interactions that can be done with Rubrix.
[ ]:
%pip install stanza
Text ClassificationĀ¶
Letās start by using a Sentiment Analysis model to log some TextClassificationRecords
.
[ ]:
import rubrix as rb
import stanza
input_text = (
"There are so many NLP libraries available, I don't know which one to choose!"
)
# Downloading our model, in case we don't have it cached
stanza.download("en")
# Creating the pipeline
nlp = stanza.Pipeline(lang="en", processors="tokenize,sentiment")
# Analizing the input text
doc = nlp(input_text)
# This model returns 0 for negative, 1 for neutral and 2 for positive outcome.
# We are going to log them into Rubrix using a dictionary to translate numbers to labels.
num_to_labels = {0: "negative", 1: "neutral", 2: "positive"}
# Build a prediction entities list
# Stanza, at the moment, only output the most likely label without probability.
# So we will suppouse Stanza predicts the most likely label with 1.0 probability, and the rest with 0.
entities = []
for _, sentence in enumerate(doc.sentences):
for key in num_to_labels:
if key == sentence.sentiment:
entities.append((num_to_labels[key], 1))
else:
entities.append((num_to_labels[key], 0))
# Building a TextClassificationRecord
record = rb.TextClassificationRecord(
inputs=input_text,
prediction=entities,
prediction_agent="stanza/en",
)
# Logging into Rubrix
rb.log(records=record, name="stanza-sentiment")
Token ClassificationĀ¶
Stanza offers so many different pretrained language models for Token Classification Tasks, and the list does not stop growing.
POS taggingĀ¶
We can use one of the many UD models, used for POS tags, morphological features and syntantic relations. UD stands for Universal Dependencies, the framework where these models has been trained. For this example, letās try to extract POS tags of some Catalan lyrics.
[ ]:
import rubrix as rb
import stanza
# Loading a cool Obrint Pas lyric
input_text = "Viure mantenint viva la flama a travƩs del temps. La flama de tot un poble en moviment"
# Downloading our model, in case we don't have it cached
stanza.download("ca")
# Creating the pipeline
nlp = stanza.Pipeline(lang="ca", processors="tokenize,mwt,pos")
# Analizing the input text
doc = nlp(input_text)
# Creating the prediction entity as a list of tuples (tag, start_char, end_char)
prediction = [
(word.pos, token.start_char, token.end_char)
for sent in doc.sentences
for token in sent.tokens
for word in token.words
]
# Building a TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[word.text for sent in doc.sentences for word in sent.words],
prediction=prediction,
prediction_agent="stanza/catalan",
)
# Logging into Rubrix
rb.log(records=record, name="stanza-catalan-pos")
NERĀ¶
Stanza also offers a list of available pretrained models for NER tasks. So, letās try Russian
[ ]:
import rubrix as rb
import stanza
input_text = (
"ŠŠµŃŃŠ°-Šø-ŠŠ°Ń - Š¾Š“Š½Š° ŠøŠ· Š¼Š¾ŠøŃ
Š»ŃŠ±ŠøŠ¼ŃŃ
ŠŗŠ½ŠøŠ³" # War and Peace is one my favourite books
)
# Downloading our model, in case we don't have it cached
stanza.download("ru")
# Creating the pipeline
nlp = stanza.Pipeline(lang="ru", processors="tokenize,ner")
# Analizing the input text
doc = nlp(input_text)
# Creating the prediction entity as a list of tuples (entity, start_char, end_char)
prediction = [
(token.ner, token.start_char, token.end_char)
for sent in doc.sentences
for token in sent.tokens
]
# Building a TokenClassificationRecord
record = rb.TokenClassificationRecord(
text=input_text,
tokens=[word.text for sent in doc.sentences for word in sent.words],
prediction=prediction,
prediction_agent="flair/russian",
)
# Logging into Rubrix
rb.log(records=record, name="stanza-russian-ner")