๐Ÿ”ซ Zero-shot Named Entity Recognition with Flair#

In this tutorial you will learn how to analyze and validate NER predictions from the new zero-shot model provided by the Flair NLP library with Rubrix.

  • ๐Ÿ›  Useful for quickly bootstrapping a training set (using Rubrix Annotation Mode) as well as integrating with weak-supervision workflows.

  • ๐Ÿ‘ We will use a challenging, exciting dataset: wnut_17 (more info below).

  • ๐Ÿ”ฎ You will be able to see and work with the obtained predictions.

Introduction#

This tutorial will show you how to work with Named Entity Recognition (NER), Flair and Rubrix. But, what is NER?

According to Analytics Vidhya, โ€œNER is a natural language processing technique that can automatically scan entire articles and pull out some fundamental entities in a text and classify them into predefined categoriesโ€. These entities can be names, quantities, dates and times, amounts of money/currencies, and much more.

On the other hand, Flair is a library which facilitates the application of NLP models to NER and other NLP techniques in many different languages. It is not only a powerful library, but also intuitive.

Thanks to these resources and the Annotation Mode of Rubrix, we can quickly build up a data set to train a domain-specific model.

Setup#

Rubrix, is a free and open-source tool to explore, annotate, and monitor data for NLP projects.

If you are new to Rubrix, check out the Github repository โญ.

If you have not installed and launched Rubrix yet, check the Setup and Installation guide.

For this tutorial we also need the third party libraries datasets and flair, which can be installed via pip:

[ ]:
%pip install datasets flair -qqq

1. Load the wnut_17 dataset#

In this example, weโ€™ll use a challenging NER dataset, the โ€œWNUT 17: Emerging and Rare entity recognitionโ€ , which focuses on unusual, previously-unseen entities in the context of emerging discussions. This dataset is useful for getting a sense of the quality of our zero-shot predictions.

Letโ€™s load the test set from the Hugging Face Hub:

[ ]:
from datasets import load_dataset

# download data set
dataset = load_dataset("wnut_17", split="test")
[2]:
# define labels
labels = ['corporation', 'creative-work', 'group', 'location', 'person', 'product']

2. Configure Flair TARSTagger#

Now letโ€™s configure our NER model, following Flairโ€™s documentation:

[ ]:
from flair.models import TARSTagger

# load zero-shot NER tagger
tars = TARSTagger.load('tars-ner')

# define labels for named entities using wnut labels
tars.add_and_switch_to_new_task('task 1', labels, label_type='ner')

Letโ€™s test it with one example!

[ ]:
from flair.data import Sentence

# wrap our tokens in a flair Sentence
sentence = Sentence(" ".join(dataset[0]['tokens']))
[6]:
# add predictions to our sentence
tars.predict(sentence)

# extract predicted entities into a list of tuples (entity, start_char, end_char)
[
    (entity.get_labels()[0].value, entity.start_pos, entity.end_pos)
    for entity in sentence.get_spans("ner")
]
[6]:
[('location', 100, 107)]

3. Predict over wnut_17 and log into rubrix#

Now, letโ€™s log the predictions in Rubrix:

[ ]:
import rubrix as rb

# build records for the first 100 examples
records = []
for record in dataset.select(range(100)):
    input_text = " ".join(record["tokens"])

    sentence = Sentence(input_text)
    tars.predict(sentence)
    prediction = [
        (entity.get_labels()[0].value, entity.start_pos, entity.end_pos)
        for entity in sentence.get_spans("ner")
    ]

    # building TokenClassificationRecord
    records.append(
        rb.TokenClassificationRecord(
            text=input_text,
            tokens=[token.text for token in sentence],
            prediction=prediction,
            prediction_agent="tars-ner",
        )
    )

# log the records to Rubrix
rb.log(records, name='tars_ner_wnut_17', metadata={"split": "test"})

Now you can see the results obtained! With the annotation mode, you can change, add, validate or discard your results. Statistics are also available, to better monitor your records!

Summary#

Getting predictions with a zero-shot approach can be very helpful to guide humans in their annotation process. Especially for NER tasks, Rubrix makes it very easy to explore and correct those predictions thanks to its Annotation Mode ๐Ÿ˜Ž.

Next steps#

โญ Star Rubrix Github repo to stay updated.

๐Ÿ“š Rubrix documentation for more guides and tutorials.

๐Ÿ™‹โ€โ™€๏ธ Join the Rubrix community! A good place to start is the discussion forum.

[ ]: