Labeling (Experimental)¶
The rubrix.labeling
module aims at providing tools to enhance your labeling workflow.
Text classification¶
Labeling tools for the text classification task.
- class rubrix.labeling.text_classification.weak_labels.WeakLabels(rules, dataset, ids=None, query=None, label2int=None)¶
Computes the weak labels of a dataset by applying a given list of rules.
- Parameters
rules (List[Callable]) – A list of rules (labeling functions). They must return a string, or
None
in case of abstention.dataset (str) – Name of the dataset to which the rules will be applied.
ids (Optional[List[Union[int, str]]]) – An optional list of record ids to filter the dataset before applying the rules.
query (Optional[str]) – An optional ElasticSearch query with the query string syntax to filter the dataset before applying the rules.
label2int (Optional[Dict[Optional[str], int]]) – An optional dict, mapping the labels to integers. Remember that the return type
None
means abstention (e.g.{None: -1}
). By default, we will build a mapping on the fly when applying the rules.
- Raises
MultiLabelError – When trying to get weak labels for a multi-label text classification task.
MissingLabelError – When provided with a
label2int
dict, and a weak label or annotation label is not present in its keys.
Examples
Get the weak label matrix and a summary of the applied rules:
>>> def awesome_rule(record: TextClassificationRecord) -> str: ... return "Positive" if "awesome" in record.inputs["text"] else None >>> another_rule = Rule(query="good OR best", label="Positive") >>> weak_labels = WeakLabels(rules=[awesome_rule, another_rule], dataset="my_dataset") >>> weak_labels.matrix() >>> weak_labels.summary()
Use snorkel’s LabelModel:
>>> from snorkel.labeling.model import LabelModel >>> label_model = LabelModel() >>> label_model.fit(L_train=weak_labels.matrix(has_annotation=False)) >>> label_model.score(L=weak_labels.matrix(has_annotation=True), Y=weak_labels.annotation()) >>> label_model.predict(L=weak_labels.matrix(has_annotation=False))
- annotation(exclude_missing_annotations=True)¶
Returns the annotation labels as an array of integers.
- Parameters
exclude_missing_annotations (bool) – If True, excludes missing annotations, that is all entries with the
self.label2int[None]
integer.- Returns
The annotation array of integers.
- Return type
numpy.ndarray
- property int2label: Dict[int, Optional[str]]¶
The dictionary that maps integers to weak/annotation labels.
- property label2int: Dict[Optional[str], int]¶
The dictionary that maps weak/annotation labels to integers.
- matrix(has_annotation=None)¶
Returns the weak label matrix, or optionally just a part of it.
- Parameters
has_annotation (Optional[bool]) – If True, return only the part of the matrix that has a corresponding annotation. If False, return only the part of the matrix that has NOT a corresponding annotation. By default, we return the whole weak label matrix.
- Returns
The weak label matrix, or optionally just a part of it.
- Return type
numpy.ndarray
- records(has_annotation=None)¶
Returns the records corresponding to the weak label matrix.
- Parameters
has_annotation (Optional[bool]) – If True, return only the records that have an annotation. If False, return only the records that have NO annotation. By default, we return all the records.
- Returns
A list of records, or optionally just a part of them.
- Return type
- property rules: List[Callable]¶
The rules (labeling functions) that were used to produce the weak labels.
- show_records(labels=None, rules=None)¶
Shows records in a pandas DataFrame, optionally filtered by weak labels and non-abstaining rules.
If you provide both
labels
andrules
, we take the intersection of both filters.- Parameters
labels (Optional[List[str]]) – All of these labels are in the record’s weak labels. If None, do not filter by labels.
rules (Optional[List[Union[int, str]]]) – All of these rules did not abstain for the record. If None, do not filter by rules. You can refer to the rules by their (function) name or by their index in the
self.rules
list.
- Returns
The optionally filtered records as a pandas DataFrame.
- Return type
pandas.core.frame.DataFrame
- summary(normalize_by_coverage=False, annotation=None)¶
Returns following summary statistics for each rule:
polarity: Set of unique labels returned by the rule, excluding “None” (abstain).
coverage: Fraction of the records labeled by the rule.
overlaps: Fraction of the records labeled by the rule together with at least one other rule.
conflicts: Fraction of the records where the rule disagrees with at least one other rule.
correct: Number of records the rule labeled correctly (if annotations are available).
incorrect: Number of records the rule labels incorrectly (if annotations are available).
precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.
- Parameters
normalize_by_coverage (bool) – Normalize the overlaps and conflicts by the respective coverage.
annotation (Optional[numpy.ndarray]) – An optional array with ints holding the annotations. By default we will use
self.annotation(exclude_missing_annotations=False)
.
- Returns
The summary statistics for each rule in a pandas DataFrame.
- Return type
pandas.core.frame.DataFrame
- class rubrix.labeling.text_classification.rule.Rule(query, label, name=None)¶
A rule (labeling function) in form of an ElasticSearch query.
- Parameters
query (str) –
An ElasticSearch query with the query string syntax.
label (str) – The label associated to the query.
name (Optional[str]) – An optional name for the rule to be used as identifier in the rubrix.labeling.text_classification.WeakLabels class. By default, we will use the
query
string.
Examples
>>> import rubrix as rb >>> urgent_rule = Rule(query="inputs.text:(urgent AND immediately)", label="urgent", name="urgent_rule") >>> not_urgent_rule = Rule(query="inputs.text:(NOT urgent) AND metadata.title_length>20", label="not urgent") >>> not_urgent_rule.apply("my_dataset") >>> my_dataset_records = rb.load(name="my_dataset", as_pandas=False) >>> not_urgent_rule(my_dataset_records[0]) "not urgent"
- __call__(record)¶
Check if the given record is among the matching ids from the
self.apply
call.- Parameters
record (rubrix.client.models.TextClassificationRecord) – The record to be labelled.
- Returns
A label if the record id is among the matching ids, otherwise None.
- Raises
RuleNotAppliedError – If the rule was not applied to the dataset before.
- Return type
Optional[str]
- apply(dataset)¶
Apply the rule to a dataset and save matching ids of the records.
- Parameters
dataset (str) – The name of the dataset.
- property name¶
The name of the rule.