Labeling (Experimental)

The rubrix.labeling module aims at providing tools to enhance your labeling workflow.

Text classification

Labeling tools for the text classification task.

class rubrix.labeling.text_classification.weak_labels.WeakLabels(rules, dataset, ids=None, query=None, label2int=None)

Computes the weak labels of a dataset by applying a given list of rules.

Parameters
  • rules (List[Callable]) – A list of rules (labeling functions). They must return a string, or None in case of abstention.

  • dataset (str) – Name of the dataset to which the rules will be applied.

  • ids (Optional[List[Union[int, str]]]) – An optional list of record ids to filter the dataset before applying the rules.

  • query (Optional[str]) – An optional ElasticSearch query with the query string syntax to filter the dataset before applying the rules.

  • label2int (Optional[Dict[Optional[str], int]]) – An optional dict, mapping the labels to integers. Remember that the return type None means abstention (e.g. {None: -1}). By default, we will build a mapping on the fly when applying the rules.

Raises
  • MultiLabelError – When trying to get weak labels for a multi-label text classification task.

  • MissingLabelError – When provided with a label2int dict, and a weak label or annotation label is not present in its keys.

Examples

Get the weak label matrix and a summary of the applied rules:

>>> def awesome_rule(record: TextClassificationRecord) -> str:
...     return "Positive" if "awesome" in record.inputs["text"] else None
>>> another_rule = Rule(query="good OR best", label="Positive")
>>> weak_labels = WeakLabels(rules=[awesome_rule, another_rule], dataset="my_dataset")
>>> weak_labels.matrix()
>>> weak_labels.summary()

Use snorkel’s LabelModel:

>>> from snorkel.labeling.model import LabelModel
>>> label_model = LabelModel()
>>> label_model.fit(L_train=weak_labels.matrix(has_annotation=False))
>>> label_model.score(L=weak_labels.matrix(has_annotation=True), Y=weak_labels.annotation())
>>> label_model.predict(L=weak_labels.matrix(has_annotation=False))
annotation(exclude_missing_annotations=True)

Returns the annotation labels as an array of integers.

Parameters

exclude_missing_annotations (bool) – If True, excludes missing annotations, that is all entries with the self.label2int[None] integer.

Returns

The annotation array of integers.

Return type

numpy.ndarray

property int2label: Dict[int, Optional[str]]

The dictionary that maps integers to weak/annotation labels.

property label2int: Dict[Optional[str], int]

The dictionary that maps weak/annotation labels to integers.

matrix(has_annotation=None)

Returns the weak label matrix, or optionally just a part of it.

Parameters

has_annotation (Optional[bool]) – If True, return only the part of the matrix that has a corresponding annotation. If False, return only the part of the matrix that has NOT a corresponding annotation. By default, we return the whole weak label matrix.

Returns

The weak label matrix, or optionally just a part of it.

Return type

numpy.ndarray

records(has_annotation=None)

Returns the records corresponding to the weak label matrix.

Parameters

has_annotation (Optional[bool]) – If True, return only the records that have an annotation. If False, return only the records that have NO annotation. By default, we return all the records.

Returns

A list of records, or optionally just a part of them.

Return type

List[rubrix.client.models.TextClassificationRecord]

property rules: List[Callable]

The rules (labeling functions) that were used to produce the weak labels.

show_records(labels=None, rules=None)

Shows records in a pandas DataFrame, optionally filtered by weak labels and non-abstaining rules.

If you provide both labels and rules, we take the intersection of both filters.

Parameters
  • labels (Optional[List[str]]) – All of these labels are in the record’s weak labels. If None, do not filter by labels.

  • rules (Optional[List[Union[int, str]]]) – All of these rules did not abstain for the record. If None, do not filter by rules. You can refer to the rules by their (function) name or by their index in the self.rules list.

Returns

The optionally filtered records as a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

summary(normalize_by_coverage=False, annotation=None)

Returns following summary statistics for each rule:

  • polarity: Set of unique labels returned by the rule, excluding “None” (abstain).

  • coverage: Fraction of the records labeled by the rule.

  • overlaps: Fraction of the records labeled by the rule together with at least one other rule.

  • conflicts: Fraction of the records where the rule disagrees with at least one other rule.

  • correct: Number of records the rule labeled correctly (if annotations are available).

  • incorrect: Number of records the rule labels incorrectly (if annotations are available).

  • precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.

Parameters
  • normalize_by_coverage (bool) – Normalize the overlaps and conflicts by the respective coverage.

  • annotation (Optional[numpy.ndarray]) – An optional array with ints holding the annotations. By default we will use self.annotation(exclude_missing_annotations=False).

Returns

The summary statistics for each rule in a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

class rubrix.labeling.text_classification.rule.Rule(query, label, name=None)

A rule (labeling function) in form of an ElasticSearch query.

Parameters
  • query (str) –

    An ElasticSearch query with the query string syntax.

  • label (str) – The label associated to the query.

  • name (Optional[str]) – An optional name for the rule to be used as identifier in the rubrix.labeling.text_classification.WeakLabels class. By default, we will use the query string.

Examples

>>> import rubrix as rb
>>> urgent_rule = Rule(query="inputs.text:(urgent AND immediately)", label="urgent", name="urgent_rule")
>>> not_urgent_rule = Rule(query="inputs.text:(NOT urgent) AND metadata.title_length>20", label="not urgent")
>>> not_urgent_rule.apply("my_dataset")
>>> my_dataset_records = rb.load(name="my_dataset", as_pandas=False)
>>> not_urgent_rule(my_dataset_records[0])
"not urgent"
__call__(record)

Check if the given record is among the matching ids from the self.apply call.

Parameters

record (rubrix.client.models.TextClassificationRecord) – The record to be labelled.

Returns

A label if the record id is among the matching ids, otherwise None.

Raises

RuleNotAppliedError – If the rule was not applied to the dataset before.

Return type

Optional[str]

apply(dataset)

Apply the rule to a dataset and save matching ids of the records.

Parameters

dataset (str) – The name of the dataset.

property name

The name of the rule.