Labeling (Experimental)

The rubrix.labeling module aims at providing tools to enhance your labeling workflow.

Text classification

Labeling tools for the text classification task.

class rubrix.labeling.text_classification.rule.Rule(query: str, label: str, name: Optional[str] = None, author: Optional[str] = None)

A rule (labeling function) in form of an ElasticSearch query.

Parameters
  • query (str) – An ElasticSearch query with the query string syntax.

  • label (str) – The label associated to the query.

  • name (Optional[str]) – An optional name for the rule to be used as identifier in the rubrix.labeling.text_classification.WeakLabels class. By default, we will use the query string.

  • author (Optional[str]) –

Examples

>>> import rubrix as rb
>>> urgent_rule = Rule(query="inputs.text:(urgent AND immediately)", label="urgent", name="urgent_rule")
>>> not_urgent_rule = Rule(query="inputs.text:(NOT urgent) AND metadata.title_length>20", label="not urgent")
>>> not_urgent_rule.apply("my_dataset")
>>> my_dataset_records = rb.load(name="my_dataset", as_pandas=False)
>>> not_urgent_rule(my_dataset_records[0])
"not urgent"
__call__(record: rubrix.client.models.TextClassificationRecord) Optional[str]

Check if the given record is among the matching ids from the self.apply call.

Parameters

record (rubrix.client.models.TextClassificationRecord) – The record to be labelled.

Returns

A label if the record id is among the matching ids, otherwise None.

Raises

RuleNotAppliedError – If the rule was not applied to the dataset before.

Return type

Optional[str]

apply(dataset: str)

Apply the rule to a dataset and save matching ids of the records.

Parameters

dataset (str) – The name of the dataset.

property author

Who authored the rule.

property label: str

The rule label

metrics(dataset: str) Dict[str, Union[int, float]]

Compute the rule metrics for a given dataset:

  • coverage: Fraction of the records labeled by the rule.

  • annotated_coverage: Fraction of annotated records labeled by the rule.

  • correct: Number of records the rule labeled correctly (if annotations are available).

  • incorrect: Number of records the rule labeled incorrectly (if annotations are available).

  • precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.

Parameters

dataset (str) – Name of the dataset for which to compute the rule metrics.

Returns

The rule metrics.

Return type

Dict[str, Union[int, float]]

property name

The name of the rule.

property query: str

The rule query

rubrix.labeling.text_classification.rule.load_rules(dataset: str) List[rubrix.labeling.text_classification.rule.Rule]

load the rules defined in a given dataset.

Parameters

dataset (str) – Name of the dataset.

Returns

A list of rules defined in the given dataset.

Return type

List[rubrix.labeling.text_classification.rule.Rule]

class rubrix.labeling.text_classification.weak_labels.WeakLabels(rules: List[Callable], dataset: str, ids: Optional[List[Union[str, int]]] = None, query: Optional[str] = None, label2int: Optional[Dict[Optional[str], int]] = None)

Computes the weak labels of a dataset by applying a given list of rules.

Parameters
  • rules (List[Callable]) – A list of rules (labeling functions). They must return a string, or None in case of abstention.

  • dataset (str) – Name of the dataset to which the rules will be applied.

  • ids (Optional[List[Union[int, str]]]) – An optional list of record ids to filter the dataset before applying the rules.

  • query (Optional[str]) –

    An optional ElasticSearch query with the query string syntax to filter the dataset before applying the rules.

  • label2int (Optional[Dict[Optional[str], int]]) – An optional dict, mapping the labels to integers. Remember that the return type None means abstention (e.g. {None: -1}). By default, we will build a mapping on the fly when applying the rules.

Raises
  • DuplicatedRuleNameError – When you provided multiple rules with the same name.

  • NoRecordsFoundError – When the filtered dataset is empty.

  • MultiLabelError – When trying to get weak labels for a multi-label text classification task.

  • MissingLabelError – When provided with a label2int dict, and a weak label or annotation label is not present in its keys.

Examples

Get the weak label matrix and a summary of the applied rules:

>>> def awesome_rule(record: TextClassificationRecord) -> str:
...     return "Positive" if "awesome" in record.inputs["text"] else None
>>> another_rule = Rule(query="good OR best", label="Positive")
>>> weak_labels = WeakLabels(rules=[awesome_rule, another_rule], dataset="my_dataset")
>>> weak_labels.matrix()
>>> weak_labels.summary()

Use snorkel’s LabelModel:

>>> from snorkel.labeling.model import LabelModel
>>> label_model = LabelModel()
>>> label_model.fit(L_train=weak_labels.matrix(has_annotation=False))
>>> label_model.score(L=weak_labels.matrix(has_annotation=True), Y=weak_labels.annotation())
>>> label_model.predict(L=weak_labels.matrix(has_annotation=False))
annotation(exclude_missing_annotations: bool = True) numpy.ndarray

Returns the annotation labels as an array of integers.

Parameters

exclude_missing_annotations (bool) – If True, excludes all entries with the self.label2int[None] integer, that is all records for which there is an annotation missing.

Returns

The annotation array of integers.

Return type

numpy.ndarray

change_mapping(label2int: Dict[str, int])

Allows you to change the mapping between labels and integers.

This will update the self.matrix as well as the self.annotation.

Parameters

label2int (Dict[str, int]) – New label to integer mapping. Must cover all previous labels.

property int2label: Dict[int, Optional[str]]

The dictionary that maps integers to weak/annotation labels.

property label2int: Dict[Optional[str], int]

The dictionary that maps weak/annotation labels to integers.

matrix(has_annotation: Optional[bool] = None) numpy.ndarray

Returns the weak label matrix, or optionally just a part of it.

Parameters

has_annotation (Optional[bool]) – If True, return only the part of the matrix that has a corresponding annotation. If False, return only the part of the matrix that has NOT a corresponding annotation. By default, we return the whole weak label matrix.

Returns

The weak label matrix, or optionally just a part of it.

Return type

numpy.ndarray

records(has_annotation: Optional[bool] = None) List[rubrix.client.models.TextClassificationRecord]

Returns the records corresponding to the weak label matrix.

Parameters

has_annotation (Optional[bool]) – If True, return only the records that have an annotation. If False, return only the records that have NO annotation. By default, we return all the records.

Returns

A list of records, or optionally just a part of them.

Return type

List[rubrix.client.models.TextClassificationRecord]

property rules: List[Callable]

The rules (labeling functions) that were used to produce the weak labels.

show_records(labels: Optional[List[str]] = None, rules: Optional[List[Union[str, int]]] = None) pandas.core.frame.DataFrame

Shows records in a pandas DataFrame, optionally filtered by weak labels and non-abstaining rules.

If you provide both labels and rules, we take the intersection of both filters.

Parameters
  • labels (Optional[List[str]]) – All of these labels are in the record’s weak labels. If None, do not filter by labels.

  • rules (Optional[List[Union[str, int]]]) – All of these rules did not abstain for the record. If None, do not filter by rules. You can refer to the rules by their (function) name or by their index in the self.rules list.

Returns

The optionally filtered records as a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

summary(normalize_by_coverage: bool = False, annotation: Optional[numpy.ndarray] = None) pandas.core.frame.DataFrame

Returns following summary statistics for each rule:

  • label: Set of unique labels returned by the rule, excluding “None” (abstain).

  • coverage: Fraction of the records labeled by the rule.

  • annotated_coverage: Fraction of annotated records labeled by the rule (if annotations are available).

  • overlaps: Fraction of the records labeled by the rule together with at least one other rule.

  • conflicts: Fraction of the records where the rule disagrees with at least one other rule.

  • correct: Number of records the rule labeled correctly (if annotations are available).

  • incorrect: Number of records the rule labeled incorrectly (if annotations are available).

  • precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.

Parameters
  • normalize_by_coverage (bool) – Normalize the overlaps and conflicts by the respective coverage.

  • annotation (Optional[numpy.ndarray]) – An optional array with ints holding the annotations. By default we will use self.annotation(exclude_missing_annotations=False).

Returns

The summary statistics for each rule in a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

class rubrix.labeling.text_classification.label_models.FlyingSquid(weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels, **kwargs)

The label model by FlyingSquid.

Parameters

Examples

>>> from rubrix.labeling.text_classification import Rule, WeakLabels
>>> rule = Rule(query="good OR best", label="Positive")
>>> weak_labels = WeakLabels(rules=[rule], dataset="my_dataset")
>>> label_model = FlyingSquid(weak_labels)
>>> label_model.fit()
>>> records = label_model.predict()
fit(include_annotated_records: bool = False, **kwargs)

Fits the label model.

Parameters
  • include_annotated_records (bool) – Whether or not to include annotated records in the training.

  • **kwargs – Passed on to the FlyingSquid’s LabelModel.fit() method.

predict(include_annotated_records: bool = False, include_abstentions: bool = False, verbose: bool = True, tie_break_policy: str = 'abstain') List[rubrix.client.models.TextClassificationRecord]

Applies the label model.

Parameters
  • include_annotated_records (bool) – Whether or not to include annotated records.

  • include_abstentions (bool) – Whether or not to include records in the output, for which the label model abstained.

  • verbose (bool) – If True, print out messages of the progress to stderr.

  • tie_break_policy (str) –

    Policy to break ties. You can choose among two policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.

Returns

A list of records that include the predictions of the label model.

Raises

NotFittedError – If the label model was still not fitted.

Return type

List[rubrix.client.models.TextClassificationRecord]

score(tie_break_policy: Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str] = 'abstain', verbose: bool = False) Dict[str, float]

Returns some scores of the label model with respect to the annotated records.

Parameters
  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties. You can choose among two policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.

  • verbose (bool) – If True, print out messages of the progress to stderr.

Returns

The scores/metrics as a dictionary.

Raises
  • NotFittedError – If the label model was still not fitted.

  • MissingAnnotationError – If the weak_labels do not contain annotated records.

Return type

Dict[str, float]

class rubrix.labeling.text_classification.label_models.LabelModel(weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels)

Abstract base class for a label model implementation.

Parameters

weak_labels (rubrix.labeling.text_classification.weak_labels.WeakLabels) – Every label model implementation needs at least a WeakLabels instance.

fit(include_annotated_records: bool = False, *args, **kwargs)

Fits the label model.

Parameters

include_annotated_records (bool) – Whether or not to include annotated records in the training.

predict(include_annotated_records: bool = False, include_abstentions: bool = False, **kwargs) List[rubrix.client.models.TextClassificationRecord]

Applies the label model.

Parameters
  • include_annotated_records (bool) – Whether or not to include annotated records.

  • include_abstentions (bool) – Whether or not to include records in the output, for which the label model abstained.

Returns

A list of records that include the predictions of the label model.

Return type

List[rubrix.client.models.TextClassificationRecord]

score(*args, **kwargs) Dict

Evaluates the label model.

Return type

Dict

property weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels

The underlying WeakLabels object, containing the weak labels and records.

class rubrix.labeling.text_classification.label_models.Snorkel(weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels, verbose: bool = True, device: str = 'cpu')

The label model by Snorkel.

Parameters
  • weak_labels (rubrix.labeling.text_classification.weak_labels.WeakLabels) – A WeakLabels object containing the weak labels and records.

  • verbose (bool) – Whether to show print statements

  • device (str) – What device to place the model on (‘cpu’ or ‘cuda:0’, for example). Passed on to the torch.Tensor.to() calls.

Examples

>>> from rubrix.labeling.text_classification import Rule, WeakLabels
>>> rule = Rule(query="good OR best", label="Positive")
>>> weak_labels = WeakLabels(rules=[rule], dataset="my_dataset")
>>> label_model = Snorkel(weak_labels)
>>> label_model.fit()
>>> records = label_model.predict()
fit(include_annotated_records: bool = False, **kwargs)

Fits the label model.

Parameters
  • include_annotated_records (bool) – Whether or not to include annotated records in the training.

  • **kwargs – Additional kwargs are passed on to Snorkel’s fit method. They must not contain L_train, the label matrix is provided automatically.

predict(include_annotated_records: bool = False, include_abstentions: bool = False, tie_break_policy: Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str] = 'abstain') List[rubrix.client.models.TextClassificationRecord]

Returns a list of records that contain the predictions of the label model

Parameters
  • include_annotated_records (bool) – Whether or not to include annotated records.

  • include_abstentions (bool) – Whether or not to include records in the output, for which the label model abstained.

  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties. You can choose among three policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    • true-random: randomly choose among the tied options. NOTE: repeated runs may have slightly different results due to differences in broken ties

    The last two policies can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.

Returns

A list of records that include the predictions of the label model.

Return type

List[rubrix.client.models.TextClassificationRecord]

score(tie_break_policy: Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str] = 'abstain') Dict[str, float]

Returns some scores of the label model with respect to the annotated records.

Parameters

tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

Policy to break ties. You can choose among three policies:

  • abstain: Do not provide any prediction

  • random: randomly choose among tied option using deterministic hash

  • true-random: randomly choose among the tied options. NOTE: repeated runs may have slightly different results due to differences in broken ties

The last two policies can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.

Returns

The scores/metrics as a dictionary.

Raises

MissingAnnotationError – If the weak_labels do not contain annotated records.

Return type

Dict[str, float]

rubrix.labeling.text_classification.label_errors.find_label_errors(records: List[rubrix.client.models.TextClassificationRecord], sort_by: Union[str, rubrix.labeling.text_classification.label_errors.SortBy] = 'likelihood', metadata_key: str = 'label_error_candidate', **kwargs) List[rubrix.client.models.TextClassificationRecord]

Finds potential annotation/label errors in your records using [cleanlab](https://github.com/cleanlab/cleanlab).

We will consider all records for which a prediction AND annotation is available. Make sure the predictions were made in a holdout manner, that is you should only include records that were not used in the training of the predictor.

Parameters
  • records (List[rubrix.client.models.TextClassificationRecord]) – A list of text classification records

  • sort_by (Union[str, rubrix.labeling.text_classification.label_errors.SortBy]) – One of the three options - “likelihood”: sort the returned records by likelihood of containing a label error (most likely first) - “prediction”: sort the returned records by the probability of the prediction (highest probability first) - “none”: do not sort the returned records

  • metadata_key (str) – The key added to the record’s metadata that holds the order, if sort_by is not “none”.

  • **kwargs – Passed on to cleanlab.pruning.get_noise_indices

Returns

A list of records containing potential annotation/label errors

Raises
  • NoRecordsError – If none of the records has a prediction AND annotation.

  • MissingPredictionError – If a prediction is missing for one of the labels.

  • ValueError – If not supported kwargs are passed on, e.g. ‘sorted_index_method’.

Return type

List[rubrix.client.models.TextClassificationRecord]

Examples

>>> import rubrix as rb
>>> records = rb.load("my_dataset", as_pandas=False)
>>> records_with_label_errors = find_label_errors(records)