Labeling (Experimental)¶
The rubrix.labeling
module aims at providing tools to enhance your labeling workflow.
Text classification¶
Labeling tools for the text classification task.
- class rubrix.labeling.text_classification.rule.Rule(query: str, label: str, name: Optional[str] = None, author: Optional[str] = None)¶
A rule (labeling function) in form of an ElasticSearch query.
- Parameters
query (str) – An ElasticSearch query with the query string syntax.
label (str) – The label associated to the query.
name (Optional[str]) – An optional name for the rule to be used as identifier in the rubrix.labeling.text_classification.WeakLabels class. By default, we will use the
query
string.author (Optional[str]) –
Examples
>>> import rubrix as rb >>> urgent_rule = Rule(query="inputs.text:(urgent AND immediately)", label="urgent", name="urgent_rule") >>> not_urgent_rule = Rule(query="inputs.text:(NOT urgent) AND metadata.title_length>20", label="not urgent") >>> not_urgent_rule.apply("my_dataset") >>> my_dataset_records = rb.load(name="my_dataset", as_pandas=False) >>> not_urgent_rule(my_dataset_records[0]) "not urgent"
- __call__(record: rubrix.client.models.TextClassificationRecord) Optional[str] ¶
Check if the given record is among the matching ids from the
self.apply
call.- Parameters
record (rubrix.client.models.TextClassificationRecord) – The record to be labelled.
- Returns
A label if the record id is among the matching ids, otherwise None.
- Raises
RuleNotAppliedError – If the rule was not applied to the dataset before.
- Return type
Optional[str]
- apply(dataset: str)¶
Apply the rule to a dataset and save matching ids of the records.
- Parameters
dataset (str) – The name of the dataset.
- property author¶
Who authored the rule.
- property label: str¶
The rule label
- metrics(dataset: str) Dict[str, Union[int, float]] ¶
Compute the rule metrics for a given dataset:
coverage: Fraction of the records labeled by the rule.
annotated_coverage: Fraction of annotated records labeled by the rule.
correct: Number of records the rule labeled correctly (if annotations are available).
incorrect: Number of records the rule labeled incorrectly (if annotations are available).
precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.
- Parameters
dataset (str) – Name of the dataset for which to compute the rule metrics.
- Returns
The rule metrics.
- Return type
Dict[str, Union[int, float]]
- property name¶
The name of the rule.
- property query: str¶
The rule query
- rubrix.labeling.text_classification.rule.load_rules(dataset: str) List[rubrix.labeling.text_classification.rule.Rule] ¶
load the rules defined in a given dataset.
- Parameters
dataset (str) – Name of the dataset.
- Returns
A list of rules defined in the given dataset.
- Return type
- class rubrix.labeling.text_classification.weak_labels.WeakLabels(rules: List[Callable], dataset: str, ids: Optional[List[Union[str, int]]] = None, query: Optional[str] = None, label2int: Optional[Dict[Optional[str], int]] = None)¶
Computes the weak labels of a dataset by applying a given list of rules.
- Parameters
rules (List[Callable]) – A list of rules (labeling functions). They must return a string, or
None
in case of abstention.dataset (str) – Name of the dataset to which the rules will be applied.
ids (Optional[List[Union[int, str]]]) – An optional list of record ids to filter the dataset before applying the rules.
query (Optional[str]) –
An optional ElasticSearch query with the query string syntax to filter the dataset before applying the rules.
label2int (Optional[Dict[Optional[str], int]]) – An optional dict, mapping the labels to integers. Remember that the return type
None
means abstention (e.g.{None: -1}
). By default, we will build a mapping on the fly when applying the rules.
- Raises
DuplicatedRuleNameError – When you provided multiple rules with the same name.
NoRecordsFoundError – When the filtered dataset is empty.
MultiLabelError – When trying to get weak labels for a multi-label text classification task.
MissingLabelError – When provided with a
label2int
dict, and a weak label or annotation label is not present in its keys.
Examples
Get the weak label matrix and a summary of the applied rules:
>>> def awesome_rule(record: TextClassificationRecord) -> str: ... return "Positive" if "awesome" in record.inputs["text"] else None >>> another_rule = Rule(query="good OR best", label="Positive") >>> weak_labels = WeakLabels(rules=[awesome_rule, another_rule], dataset="my_dataset") >>> weak_labels.matrix() >>> weak_labels.summary()
Use snorkel’s LabelModel:
>>> from snorkel.labeling.model import LabelModel >>> label_model = LabelModel() >>> label_model.fit(L_train=weak_labels.matrix(has_annotation=False)) >>> label_model.score(L=weak_labels.matrix(has_annotation=True), Y=weak_labels.annotation()) >>> label_model.predict(L=weak_labels.matrix(has_annotation=False))
- annotation(exclude_missing_annotations: bool = True) numpy.ndarray ¶
Returns the annotation labels as an array of integers.
- Parameters
exclude_missing_annotations (bool) – If True, excludes all entries with the
self.label2int[None]
integer, that is all records for which there is an annotation missing.- Returns
The annotation array of integers.
- Return type
numpy.ndarray
- change_mapping(label2int: Dict[str, int])¶
Allows you to change the mapping between labels and integers.
This will update the
self.matrix
as well as theself.annotation
.- Parameters
label2int (Dict[str, int]) – New label to integer mapping. Must cover all previous labels.
- property int2label: Dict[int, Optional[str]]¶
The dictionary that maps integers to weak/annotation labels.
- property label2int: Dict[Optional[str], int]¶
The dictionary that maps weak/annotation labels to integers.
- matrix(has_annotation: Optional[bool] = None) numpy.ndarray ¶
Returns the weak label matrix, or optionally just a part of it.
- Parameters
has_annotation (Optional[bool]) – If True, return only the part of the matrix that has a corresponding annotation. If False, return only the part of the matrix that has NOT a corresponding annotation. By default, we return the whole weak label matrix.
- Returns
The weak label matrix, or optionally just a part of it.
- Return type
numpy.ndarray
- records(has_annotation: Optional[bool] = None) List[rubrix.client.models.TextClassificationRecord] ¶
Returns the records corresponding to the weak label matrix.
- Parameters
has_annotation (Optional[bool]) – If True, return only the records that have an annotation. If False, return only the records that have NO annotation. By default, we return all the records.
- Returns
A list of records, or optionally just a part of them.
- Return type
- property rules: List[Callable]¶
The rules (labeling functions) that were used to produce the weak labels.
- show_records(labels: Optional[List[str]] = None, rules: Optional[List[Union[str, int]]] = None) pandas.core.frame.DataFrame ¶
Shows records in a pandas DataFrame, optionally filtered by weak labels and non-abstaining rules.
If you provide both
labels
andrules
, we take the intersection of both filters.- Parameters
labels (Optional[List[str]]) – All of these labels are in the record’s weak labels. If None, do not filter by labels.
rules (Optional[List[Union[str, int]]]) – All of these rules did not abstain for the record. If None, do not filter by rules. You can refer to the rules by their (function) name or by their index in the
self.rules
list.
- Returns
The optionally filtered records as a pandas DataFrame.
- Return type
pandas.core.frame.DataFrame
- summary(normalize_by_coverage: bool = False, annotation: Optional[numpy.ndarray] = None) pandas.core.frame.DataFrame ¶
Returns following summary statistics for each rule:
label: Set of unique labels returned by the rule, excluding “None” (abstain).
coverage: Fraction of the records labeled by the rule.
annotated_coverage: Fraction of annotated records labeled by the rule (if annotations are available).
overlaps: Fraction of the records labeled by the rule together with at least one other rule.
conflicts: Fraction of the records where the rule disagrees with at least one other rule.
correct: Number of records the rule labeled correctly (if annotations are available).
incorrect: Number of records the rule labeled incorrectly (if annotations are available).
precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.
- Parameters
normalize_by_coverage (bool) – Normalize the overlaps and conflicts by the respective coverage.
annotation (Optional[numpy.ndarray]) – An optional array with ints holding the annotations. By default we will use
self.annotation(exclude_missing_annotations=False)
.
- Returns
The summary statistics for each rule in a pandas DataFrame.
- Return type
pandas.core.frame.DataFrame
- class rubrix.labeling.text_classification.label_models.FlyingSquid(weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels, **kwargs)¶
The label model by FlyingSquid.
- Parameters
weak_labels (rubrix.labeling.text_classification.weak_labels.WeakLabels) – A WeakLabels object containing the weak labels and records.
**kwargs – Passed on to the init of the FlyingSquid’s LabelModel.
Examples
>>> from rubrix.labeling.text_classification import Rule, WeakLabels >>> rule = Rule(query="good OR best", label="Positive") >>> weak_labels = WeakLabels(rules=[rule], dataset="my_dataset") >>> label_model = FlyingSquid(weak_labels) >>> label_model.fit() >>> records = label_model.predict()
- fit(include_annotated_records: bool = False, **kwargs)¶
Fits the label model.
- Parameters
include_annotated_records (bool) – Whether or not to include annotated records in the training.
**kwargs – Passed on to the FlyingSquid’s LabelModel.fit() method.
- predict(include_annotated_records: bool = False, include_abstentions: bool = False, verbose: bool = True, tie_break_policy: str = 'abstain') List[rubrix.client.models.TextClassificationRecord] ¶
Applies the label model.
- Parameters
include_annotated_records (bool) – Whether or not to include annotated records.
include_abstentions (bool) – Whether or not to include records in the output, for which the label model abstained.
verbose (bool) – If True, print out messages of the progress to stderr.
tie_break_policy (str) –
Policy to break ties. You can choose among two policies:
abstain: Do not provide any prediction
random: randomly choose among tied option using deterministic hash
The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.
- Returns
A list of records that include the predictions of the label model.
- Raises
NotFittedError – If the label model was still not fitted.
- Return type
- score(tie_break_policy: Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str] = 'abstain', verbose: bool = False) Dict[str, float] ¶
Returns some scores of the label model with respect to the annotated records.
- Parameters
tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –
Policy to break ties. You can choose among two policies:
abstain: Do not provide any prediction
random: randomly choose among tied option using deterministic hash
The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.
verbose (bool) – If True, print out messages of the progress to stderr.
- Returns
The scores/metrics as a dictionary.
- Raises
NotFittedError – If the label model was still not fitted.
MissingAnnotationError – If the
weak_labels
do not contain annotated records.
- Return type
Dict[str, float]
- class rubrix.labeling.text_classification.label_models.LabelModel(weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels)¶
Abstract base class for a label model implementation.
- Parameters
weak_labels (rubrix.labeling.text_classification.weak_labels.WeakLabels) – Every label model implementation needs at least a WeakLabels instance.
- fit(include_annotated_records: bool = False, *args, **kwargs)¶
Fits the label model.
- Parameters
include_annotated_records (bool) – Whether or not to include annotated records in the training.
- predict(include_annotated_records: bool = False, include_abstentions: bool = False, **kwargs) List[rubrix.client.models.TextClassificationRecord] ¶
Applies the label model.
- Parameters
include_annotated_records (bool) – Whether or not to include annotated records.
include_abstentions (bool) – Whether or not to include records in the output, for which the label model abstained.
- Returns
A list of records that include the predictions of the label model.
- Return type
- score(*args, **kwargs) Dict ¶
Evaluates the label model.
- Return type
Dict
- property weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels¶
The underlying WeakLabels object, containing the weak labels and records.
- class rubrix.labeling.text_classification.label_models.Snorkel(weak_labels: rubrix.labeling.text_classification.weak_labels.WeakLabels, verbose: bool = True, device: str = 'cpu')¶
The label model by Snorkel.
- Parameters
weak_labels (rubrix.labeling.text_classification.weak_labels.WeakLabels) – A WeakLabels object containing the weak labels and records.
verbose (bool) – Whether to show print statements
device (str) – What device to place the model on (‘cpu’ or ‘cuda:0’, for example). Passed on to the torch.Tensor.to() calls.
Examples
>>> from rubrix.labeling.text_classification import Rule, WeakLabels >>> rule = Rule(query="good OR best", label="Positive") >>> weak_labels = WeakLabels(rules=[rule], dataset="my_dataset") >>> label_model = Snorkel(weak_labels) >>> label_model.fit() >>> records = label_model.predict()
- fit(include_annotated_records: bool = False, **kwargs)¶
Fits the label model.
- Parameters
include_annotated_records (bool) – Whether or not to include annotated records in the training.
**kwargs – Additional kwargs are passed on to Snorkel’s fit method. They must not contain
L_train
, the label matrix is provided automatically.
- predict(include_annotated_records: bool = False, include_abstentions: bool = False, tie_break_policy: Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str] = 'abstain') List[rubrix.client.models.TextClassificationRecord] ¶
Returns a list of records that contain the predictions of the label model
- Parameters
include_annotated_records (bool) – Whether or not to include annotated records.
include_abstentions (bool) – Whether or not to include records in the output, for which the label model abstained.
tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –
Policy to break ties. You can choose among three policies:
abstain: Do not provide any prediction
random: randomly choose among tied option using deterministic hash
true-random: randomly choose among the tied options. NOTE: repeated runs may have slightly different results due to differences in broken ties
The last two policies can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.
- Returns
A list of records that include the predictions of the label model.
- Return type
- score(tie_break_policy: Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str] = 'abstain') Dict[str, float] ¶
Returns some scores of the label model with respect to the annotated records.
- Parameters
tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –
Policy to break ties. You can choose among three policies:
abstain: Do not provide any prediction
random: randomly choose among tied option using deterministic hash
true-random: randomly choose among the tied options. NOTE: repeated runs may have slightly different results due to differences in broken ties
The last two policies can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all of the labeling functions abstained.
- Returns
The scores/metrics as a dictionary.
- Raises
MissingAnnotationError – If the
weak_labels
do not contain annotated records.- Return type
Dict[str, float]
- rubrix.labeling.text_classification.label_errors.find_label_errors(records: List[rubrix.client.models.TextClassificationRecord], sort_by: Union[str, rubrix.labeling.text_classification.label_errors.SortBy] = 'likelihood', metadata_key: str = 'label_error_candidate', **kwargs) List[rubrix.client.models.TextClassificationRecord] ¶
Finds potential annotation/label errors in your records using [cleanlab](https://github.com/cleanlab/cleanlab).
We will consider all records for which a prediction AND annotation is available. Make sure the predictions were made in a holdout manner, that is you should only include records that were not used in the training of the predictor.
- Parameters
records (List[rubrix.client.models.TextClassificationRecord]) – A list of text classification records
sort_by (Union[str, rubrix.labeling.text_classification.label_errors.SortBy]) – One of the three options - “likelihood”: sort the returned records by likelihood of containing a label error (most likely first) - “prediction”: sort the returned records by the probability of the prediction (highest probability first) - “none”: do not sort the returned records
metadata_key (str) – The key added to the record’s metadata that holds the order, if
sort_by
is not “none”.**kwargs – Passed on to cleanlab.pruning.get_noise_indices
- Returns
A list of records containing potential annotation/label errors
- Raises
NoRecordsError – If none of the records has a prediction AND annotation.
MissingPredictionError – If a prediction is missing for one of the labels.
ValueError – If not supported kwargs are passed on, e.g. ‘sorted_index_method’.
- Return type
Examples
>>> import rubrix as rb >>> records = rb.load("my_dataset", as_pandas=False) >>> records_with_label_errors = find_label_errors(records)