Weak supervision

This guide gives you a brief introduction to weak supervision with Rubrix.

Rubrix currently supports weak supervision for text classification use cases, but we’ll be adding support for token classification (e.g., Named Entity Recognition) soon.

This feature is experimental, you can expect some changes in the Python API. Please report on Github any issue you encounter.

Labeling workflow

Rubrix weak supervision in a nutshell

Doing weak supervision with Rubrix should be straightforward. Keeping the same spirit as other parts of the library, you can virtually use any weak supervision library or method, such as Snorkel or Flyingsquid.

Rubrix weak supervision support is built around two basic abstractions:

Rule

A rule encodes an heuristic for labeling a record.

Heuristics can be defined using Elasticsearch’s queries:

plz = Rule(query="plz OR please", label="SPAM")

or with Python functions (similar to Snorkel’s labeling functions, which you can use as well):

def contains_http(record: rb.TextClassificationRecord) -> Optional[str]:
    if "http" in record.inputs["text"]:
        return "SPAM"

Besides textual features, Python labeling functions can exploit metadata features:

def author_channel(record: rb.TextClassificationRecord) -> Optional[str]:
    # the word channel appears in the comment author name
    if "channel" in record.metadata["author"]:
        return "SPAM"

A rule should either return a string value, that is a weak label, or a None type in case of abstention.

Weak Labels

Weak Labels objects bundle and apply a set of rules to the records of a Rubrix dataset. Applying a rule to a record means assigning a weak label or abstaining.

This abstraction provides you with the building blocks for training and testing weak supervision “denoising”, “label” or even “end” models:

rules = [contains_http, author_channel]
weak_labels = WeakLabels(
    rules=rules,
    dataset="weak_supervision_yt"
)

# returns a summary of the applied rules
weak_labels.summary()

More information about these abstractions can be found in the Python Labeling module docs.

Workflow

A typical workflow to use weak supervision is:

  1. Create a Rubrix dataset with your raw dataset. If you actually have some labelled data you can log it into the the same dataset.

  2. Define a set of rules, exploring and trying out different things directly in the Rubrix web app.

  3. Create a WeakLabels object and apply the rules. Typically, you’ll iterate between this step and step 2.

  4. Once you are satisfied with your weak labels, use the matrix of the WeakLabels instance with your library/method of choice to build a training set or even train a downstream text classification model.

This guide shows you an end-to-end example using Snorkel and Flyingsquid. Let’s get started!

Example dataset

We’ll be using a well-known dataset for weak supervision examples, the YouTube Spam Collection dataset, which is a binary classification task for detecting spam comments in Youtube videos.

[74]:
import pandas as pd

train_df = pd.read_csv('../tutorials/data/yt_comments_train.csv')
test_df = pd.read_csv('../tutorials/data/yt_comments_test.csv')

train_df.head()
[74]:
Unnamed: 0 author date text label video
0 0 Alessandro leite 2014-11-05T22:21:36 pls http://www10.vakinha.com.br/VaquinhaE.aspx?e=313327 help me get vip gun cross fire al -1.0 1
1 1 Salim Tayara 2014-11-02T14:33:30 if your like drones, plz subscribe to Kamal Tayara. He takes videos with his drone that are absolutely beautiful. -1.0 1
2 2 Phuc Ly 2014-01-20T15:27:47 go here to check the views :3 -1.0 1
3 3 DropShotSk8r 2014-01-19T04:27:18 Came here to check the views, goodbye. -1.0 1
4 4 css403 2014-11-07T14:25:48 i am 2,126,492,636 viewer :D -1.0 1

1. Create a Rubrix dataset with unlabelled data and test data

Let’s load the train (non-labelled) dataset and the test dataset (containing labels).

[ ]:
import rubrix as rb

# unlabelled data
records = [
    rb.TextClassificationRecord(
        inputs=row.text,
        metadata={"video":row.video, "author": row.author}
    )
    for i,row in train_df.iterrows()
]
rb.log(records, name="weak_supervision_yt")
[ ]:
labels = ["HAM", "SPAM"]

# labelled data for testing
records = [
    rb.TextClassificationRecord(
        inputs=row.text,
        annotation=labels[row.label],
        metadata={"video":row.video, "author": row.author}
    )
    for i,row in test_df.iterrows()
]
rb.log(records, name="weak_supervision_yt")

After this step, you have a fully browsable dataset available at http://localhost:6900/weak_supervision_yt (or the base URL where your Rubrix instance is hosted).

2. Defining rules

Let’s now define some of the rules proposed in the tutorial Snorkel Intro Tutorial: Data Labeling.

Remember you can use Elasticsearch’s query string DSL and test your queries directly in the web app. Available fields in the query are described in the Rubrix web app reference.

[2]:
from rubrix.labeling.text_classification import Rule, WeakLabels

# Rules defined as Elasticsearch queries
check_out = Rule(query="check out", label="SPAM")
plz = Rule(query="plz OR please", label="SPAM")
subscribe = Rule(query="subscribe", label="SPAM")
my = Rule(query="my", label="SPAM")
song = Rule(query="song", label="HAM")
love = Rule(query="love", label="HAM")

Besides using the UI, if you want to quickly see the effect of a rule, you can do:

[72]:
# display full length text
pd.set_option('display.max_colwidth', None)

# Get the subset for the rule query
rb.load(name="weak_supervision_yt", query="plz OR please")[['inputs']]
[72]:
inputs
0 {'text': 'Our Beautiful Bella has been diagnosed with Wobbler's Syndrome. There is no way we could afford to do her MRI or surgery. She is not just a dog she is a very special member of our family. Without the surgery we fear we will lose her. Please help! http://www.gofundme.com/f7ekgw'}
1 {'text': 'I KNOW YOU MAY NOT WANT TO READ THIS BUT please do I'm 87 Cypher an 11 year old rapper I have skill people said .my stuff isn't as good as my new stuff but its good please check out my current songs comment and like thank you for reading rap is my life'}
2 {'text': 'Hello everyone my name's Anderson and i'm a singer. not expecting to buy subscribers with words BUT to gain them with my voice. I might not be the best but my voice is different (in a good way) and i'll work harder than anyone out there to get better, 'cuz "yeah" i have a dream a HUGE one, (who doesn't?) so please take 3 minutes of your time to check out my covers. Give me a chance you won't regret it If you feel like subscribing that'd be awesome and it'd mean the world to me THANK YOU SO MUCH'}
3 {'text': 'Please Subscribe In My Channel →'}
4 {'text': 'Hey ! I know most people don't like these kind of comments & see at spam, but I see as free advertising . So please check out my cover of Sparks Fly by Taylor Swift ! It is not the best ever I know, but maybe with some encouraging words of wisdom from many of you I can become better! Please go to my channel and check it out !'}
... ...
181 {'text': '♫I know someone will see this ♥ I have a dream… I don’t have the greatest videos or the best quality Right now I feel like i'm not getting anywhere and I need your help ♫ If you could possibly watch my videos it means the world to me ♥ Please thumbs this up so others can see… I appreciate it so much ♥♫ Please listen before you hate. Honestly i appreciate it so much You don’t have to love me just give this 17 year old a chance'}
182 {'text': 'Hi everyone. We are a duo and we are starting to record freestyles and put them on youtube. If any of you could check it out and like/comment it would mean so much to us because we love doing this. We may not have the best recording equipment but if you listen to our lyrics and rhymes I think you'll like it. If you do then please subscribe and share because we love making these videos and we want you to like them as much as possible so feel free to comment and give us pointers! Thank you!'}
183 {'text': 'http://www.ermail.pl/dolacz/UnNfY2I= Please click on the link'}
184 {'text': 'please suscribe i am bored of 5 subscribers try to get it to 20!'}
185 {'text': 'PLEASE SUBSCRIBE ME!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'}

186 rows × 1 columns

You can also define plain Python labeling functions:

[ ]:
import re

# Rules defined as Python labeling functions
def contains_http(record: rb.TextClassificationRecord):
    if "http" in record.inputs["text"]:
        return "SPAM"

def short_comment(record: rb.TextClassificationRecord):
    return "HAM" if len(record.inputs["text"].split()) < 5 else None

def regex_check_out(record: rb.TextClassificationRecord):
    return "SPAM" if re.search(r"check.*out", record.inputs["text"], flags=re.I) else None

3. Building and analizing weak labels

[4]:
# bundle our rules in a list
rules = [check_out, plz, subscribe, my, song, love, contains_http, short_comment, regex_check_out]

# apply the rules to a dataset to obtain the weak labels
weak_labels = WeakLabels(
    rules=rules,
    dataset="weak_supervision_yt"
)

# show some stats about the rules, see the `summary()` docstring for details
weak_labels.summary()


[4]:
polarity coverage overlaps conflicts correct incorrect precision
check out {SPAM} 0.235379 0.229147 0.028763 90 0 1.000000
plz OR please {SPAM} 0.089166 0.079099 0.019175 40 0 1.000000
subscribe {SPAM} 0.108341 0.084372 0.028763 60 0 1.000000
my {SPAM} 0.190316 0.167306 0.050815 82 12 0.872340
song {HAM} 0.139981 0.085331 0.034995 78 18 0.812500
love {HAM} 0.097795 0.075743 0.032119 56 14 0.800000
contains_http {SPAM} 0.096357 0.066155 0.045062 12 0 1.000000
short_comment {HAM} 0.259827 0.113135 0.058965 168 16 0.913043
regex_check_out {SPAM} 0.220997 0.220518 0.026846 90 0 1.000000
total {HAM, SPAM} 0.764621 0.447267 0.116970 676 60 0.918478

4. Using the weak labels

At this step you have at least two options:

  1. Use the weak labels for training a “denoising” or label model to build a less noisy training set. Highly popular options for this are Snorkel or Flyingsquid. After this step, you can train a downstream model with the “clean” labels.

  2. Use the weak labels directly with recent “end-to-end” (e.g., Weasel) or joint models (e.g., COSINE).

Let’s see some examples:

Label model with Snorkel

Snorkel is by far the most popular option for using weak supervision. Using Snorkel with Rubrix’s WeakLabels is as simple as:

[ ]:
%pip install snorkel -qqq
[ ]:
from snorkel.labeling.model import LabelModel

# train our label model
label_model = LabelModel()
label_model.fit(L_train=weak_labels.matrix(has_annotation=False))

# check its performance
label_model.score(L=weak_labels.matrix(has_annotation=True), Y=weak_labels.annotation())

Log Label model predictions into a Rubrix dataset

After fitting your label model, you can quickly explore its predictions, before building a training set for training a downstream text classifier.

This step is useful for validation, manual revision, or defining score thresholds for accepting labels from your label model (for example, only considering labels with a score greater then 0.8.)

[ ]:
# Get the part of the weak label matrix that has no corresponding annotation
train_matrix = weak_labels.matrix(has_annotation=False)

# Get predictions from our label model
predictions = label_model.predict_proba(L=train_matrix)
predicted_labels = label_model.predict(L=train_matrix)
preds = [[('SPAM', pred[0]), ('HAM', pred[1])] for pred in predictions]

# Get the records that do not have an annotation
train_records = weak_labels.records(has_annotation=False)
[ ]:
# Add the predictions to the records
def add_prediction(record, prediction):
    record.prediction = prediction
    return record

train_records_with_lm_prediction = [
    add_prediction(rec, pred)
    for rec, pred, label in zip(train_records, preds, predicted_labels)
    if label != weak_labels.label2int[None] # exclude records where the label model abstains
]

# Log a new dataset to Rubrix
rb.log(train_records_with_lm_prediction, name="snorkel_results")

Label model with Flyingsquid

Flyingsquid is a powerful method developed by Hazy Research, a research group from Stanford behind ground-breaking work on programmatic data labeling, including Snorkel. Flyingsquid uses a closed-form solution for fitting the label model with great speed gains and similar performance.

[21]:
%pip install flyingsquid pgmpy -qqq

Flyingsquid defines a different value for abstain votes, with Rubrix you can define a custom label2int mapping like this:

[ ]:
weak_labels = WeakLabels(rules=rules, dataset="weak_supervision_yt", label2int={None: 0, 'SPAM': -1, 'HAM': 1})
[ ]:
from flyingsquid.label_model import LabelModel
import numpy as np

# train our label model
label_model = LabelModel(len(weak_labels.rules))
label_model.fit(L_train=weak_labels.matrix(has_annotation=False),verbose=True)

Log Label model predictions into a Rubrix dataset

[ ]:
# Get the part of the weak label matrix that has no corresponding annotation
train_matrix = weak_labels.matrix(has_annotation=False)

# Get predictions from our label model
predictions = label_model.predict_proba(L_matrix=train_matrix)
predicted_labels = label_model.predict(L_matrix=train_matrix)
preds = [[('SPAM', pred[0]), ('HAM', pred[1])] for pred in predictions]

# Get the records that do not have an annotation
train_records = weak_labels.records(has_annotation=False)
[ ]:
# Add the predictions to the records
def add_prediction(record, prediction):
    record.prediction = prediction
    return record

train_records_with_lm_prediction = [
    add_prediction(rec, pred)
    for rec, pred, label in zip(train_records, preds, predicted_labels)
    if label != weak_labels.label2int[None] # exclude records where the label model abstains
]

# Log a new dataset to Rubrix
rb.log(train_records_with_lm_prediction, name="flyingsquid_results")

Joint Model with Weasel

Weasel lets you train downstream models end-to-end using directly weak labels.

Coming soon.