Labeling#

The rubrix.labeling module aims at providing tools to enhance your labeling workflow.

Text classification#

Labeling tools for the text classification task.

class rubrix.labeling.text_classification.rule.Rule(query, label, name=None, author=None)#

A rule (labeling function) in form of an ElasticSearch query.

Parameters
  • query (str) – An ElasticSearch query with the query string syntax.

  • label (Union[str, List[str]]) – The label associated to the query. Can also be a list of labels.

  • name (Optional[str]) – An optional name for the rule to be used as identifier in the rubrix.labeling.text_classification.WeakLabels class. By default, we will use the query string.

  • author (Optional[str]) –

Examples

>>> import rubrix as rb
>>> urgent_rule = Rule(query="inputs.text:(urgent AND immediately)", label="urgent", name="urgent_rule")
>>> not_urgent_rule = Rule(query="inputs.text:(NOT urgent) AND metadata.title_length>20", label="not urgent")
>>> not_urgent_rule.apply("my_dataset")
>>> my_dataset_records = rb.load(name="my_dataset")
>>> not_urgent_rule(my_dataset_records[0])
"not urgent"
__call__(record)#

Check if the given record is among the matching ids from the self.apply call.

Parameters

record (rubrix.client.models.TextClassificationRecord) – The record to be labelled.

Returns

A label or list of labels if the record id is among the matching ids, otherwise None.

Raises

RuleNotAppliedError – If the rule was not applied to the dataset before.

Return type

Optional[Union[str, List[str]]]

apply(dataset)#

Apply the rule to a dataset and save matching ids of the records.

Parameters

dataset (str) – The name of the dataset.

property author#

Who authored the rule.

property label: Union[str, List[str]]#

The rule label

metrics(dataset)#

Compute the rule metrics for a given dataset:

  • coverage: Fraction of the records labeled by the rule.

  • annotated_coverage: Fraction of annotated records labeled by the rule.

  • correct: Number of records the rule labeled correctly (if annotations are available).

  • incorrect: Number of records the rule labeled incorrectly (if annotations are available).

  • precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.

Parameters

dataset (str) – Name of the dataset for which to compute the rule metrics.

Returns

The rule metrics.

Return type

Dict[str, Union[int, float]]

property name#

The name of the rule.

property query: str#

The rule query

rubrix.labeling.text_classification.rule.load_rules(dataset)#

load the rules defined in a given dataset.

Parameters

dataset (str) – Name of the dataset.

Returns

A list of rules defined in the given dataset.

Return type

List[rubrix.labeling.text_classification.rule.Rule]

class rubrix.labeling.text_classification.weak_labels.WeakLabels(dataset, rules=None, ids=None, query=None, label2int=None)#

Computes the weak labels of a single-label text classification dataset by applying a given list of rules.

Parameters
  • dataset (str) – Name of the dataset to which the rules will be applied.

  • rules (Optional[List[Callable]]) – A list of rules (labeling functions). They must return a string, or None in case of abstention. If None, we will use the rules of the dataset (Default).

  • ids (Optional[List[Union[str, int]]]) – An optional list of record ids to filter the dataset before applying the rules.

  • query (Optional[str]) –

    An optional ElasticSearch query with the query string syntax to filter the dataset before applying the rules.

  • label2int (Optional[Dict[Optional[str], int]]) – An optional dict, mapping the labels to integers. Remember that the return type None means abstention (e.g. {None: -1}). By default, we will build a mapping on the fly when applying the rules.

Raises
  • NoRulesFoundError – When you do not provide rules, and the dataset has no rules either.

  • DuplicatedRuleNameError – When you provided multiple rules with the same name.

  • NoRecordsFoundError – When the filtered dataset is empty.

  • MultiLabelError – When trying to get weak labels for a multi-label text classification task.

  • MissingLabelError – When provided with a label2int dict, and a weak label or annotation label is not present in its keys.

Examples

>>> # Get the weak label matrix from a dataset with rules:
>>> weak_labels = WeakLabels(dataset="my_dataset")
>>> weak_labels.matrix()
>>> weak_labels.summary()
>>>
>>> # Get the weak label matrix from rules defined in Python:
>>> def awesome_rule(record: TextClassificationRecord) -> str:
...     return "Positive" if "awesome" in record.text else None
>>> another_rule = Rule(query="good OR best", label="Positive")
>>> weak_labels = WeakLabels(dataset="my_dataset", rules=[awesome_rule, another_rule])
>>> weak_labels.matrix()
>>> weak_labels.summary()
>>>
>>> # Use the WeakLabels object with snorkel's LabelModel:
>>> from snorkel.labeling.model import LabelModel
>>> label_model = LabelModel()
>>> label_model.fit(L_train=weak_labels.matrix(has_annotation=False))
>>> label_model.score(L=weak_labels.matrix(has_annotation=True), Y=weak_labels.annotation())
>>> label_model.predict(L=weak_labels.matrix(has_annotation=False))
>>>
>>> # For a builtin integration with Snorkel, see `rubrix.labeling.text_classification.Snorkel`.
annotation(include_missing=False, exclude_missing_annotations=None)#

Returns the annotation labels as an array of integers.

Parameters
  • include_missing (bool) – If True, returns an array of the length of the record list (self.records()). For this, we will fill the array with the self.label2int[None] integer for records without an annotation.

  • exclude_missing_annotations (Optional[bool]) – DEPRECATED

Returns

The annotation array of integers.

Return type

numpy.ndarray

property cardinality: int#

The number of labels.

change_mapping(label2int)#

Allows you to change the mapping between labels and integers.

This will update the self.matrix as well as the self.annotation.

Parameters

label2int (Dict[str, int]) – New label to integer mapping. Must cover all previous labels.

extend_matrix(thresholds, embeddings=None, gpu=False)#

Extends the weak label matrix through embeddings according to the similarity thresholds for each rule.

Implementation based on Epoxy.

Parameters
  • thresholds (Union[List[float], numpy.ndarray]) – An array of thresholds between 0.0 and 1.0, one for each column of the weak labels matrix. Each one stands for the minimum cosine similarity between two sentences for a rule to be extended.

  • embeddings (Optional[numpy.ndarray]) – Embeddings for each row of the weak label matrix. If not provided, we will use the ones from the last WeakLabels.extend_matrix() call.

  • gpu (bool) – If True, perform FAISS similarity queries on GPU.

Examples

>>> # Choose any model to generate the embeddings.
>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer('all-mpnet-base-v2', device='cuda')
>>>
>>> # Generate the embeddings and set the thresholds.
>>> weak_labels = WeakLabels(dataset="my_dataset")
>>> embeddings = np.array([ model.encode(rec.text) for rec in weak_labels.records() ])
>>> thresholds = [0.6] * len(weak_labels.rules)
>>>
>>> # Extend the weak labels matrix.
>>> weak_labels.extend_matrix(thresholds, embeddings)
>>>
>>> # Calling the method below will now retrieve the extended matrix.
>>> weak_labels.matrix()
>>>
>>> # Subsequent calls without the embeddings parameter will reutilize the faiss index built on the first call.
>>> thresholds = [0.75] * len(weak_labels.rules)
>>> weak_labels.extend_matrix(thresholds)
>>> weak_labels.matrix()
property int2label: Dict[int, Optional[str]]#

The dictionary that maps integers to weak/annotation labels.

property label2int: Dict[Optional[str], int]#

The dictionary that maps weak/annotation labels to integers.

property labels: List[str]#

The list of labels.

matrix(has_annotation=None)#

Returns the weak label matrix, or optionally just a part of it.

Parameters

has_annotation (Optional[bool]) – If True, return only the part of the matrix that has a corresponding annotation. If False, return only the part of the matrix that has NOT a corresponding annotation. By default, we return the whole weak label matrix.

Returns

The weak label matrix, or optionally just a part of it.

Return type

numpy.ndarray

show_records(labels=None, rules=None)#

Shows records in a pandas DataFrame, optionally filtered by weak labels and non-abstaining rules.

If you provide both labels and rules, we take the intersection of both filters.

Parameters
  • labels (Optional[List[str]]) – All of these labels are in the record’s weak labels. If None, do not filter by labels.

  • rules (Optional[List[Union[str, int]]]) – All of these rules did not abstain for the record. If None, do not filter by rules. You can refer to the rules by their (function) name or by their index in the self.rules list.

Returns

The optionally filtered records as a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

summary(normalize_by_coverage=False, annotation=None)#

Returns following summary statistics for each rule:

  • label: Set of unique labels returned by the rule, excluding “None” (abstain).

  • coverage: Fraction of the records labeled by the rule.

  • annotated_coverage: Fraction of annotated records labeled by the rule (if annotations are available).

  • overlaps: Fraction of the records labeled by the rule together with at least one other rule.

  • conflicts: Fraction of the records where the rule disagrees with at least one other rule.

  • correct: Number of labels the rule predicted correctly (if annotations are available).

  • incorrect: Number of labels the rule predicted incorrectly (if annotations are available).

  • precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.

Parameters
  • normalize_by_coverage (bool) – Normalize the overlaps and conflicts by the respective coverage.

  • annotation (Optional[numpy.ndarray]) – An optional array with ints holding the annotations. By default, we will use self.annotation(include_missing=True).

Returns

The summary statistics for each rule in a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

class rubrix.labeling.text_classification.weak_labels.WeakMultiLabels(dataset, rules=None, ids=None, query=None)#

Computes the weak labels of a multi-label text classification dataset by applying a given list of rules.

Parameters
  • dataset (str) – Name of the dataset to which the rules will be applied.

  • rules (Optional[List[Callable]]) – A list of rules (labeling functions). They must return a string, list of strings, or None in case of abstention. If None, we will use the rules of the dataset (Default).

  • ids (Optional[List[Union[str, int]]]) – An optional list of record ids to filter the dataset before applying the rules.

  • query (Optional[str]) –

    An optional ElasticSearch query with the query string syntax to filter the dataset before applying the rules.

Raises
  • NoRulesFoundError – When you do not provide rules, and the dataset has no rules either.

  • DuplicatedRuleNameError – When you provided multiple rules with the same name.

  • NoRecordsFoundError – When the filtered dataset is empty.

Examples

>>> # Get the 3 dimensional weak label matrix from a multi-label classification dataset with rules:
>>> weak_labels = WeakMultiLabels(dataset="my_dataset")
>>> weak_labels.matrix()
>>> weak_labels.summary()
>>>
>>> # Get the 3 dimensional weak label matrix from rules defined in Python:
>>> def awesome_rule(record: TextClassificationRecord) -> str:
...     return ["Positive", "Slang"] if "next level" in record.text else None
>>> another_rule = Rule(query="amped OR psyched", label=["Positive", "Slang"])
>>> weak_labels = WeakMultiLabels(dataset="my_dataset", rules=[awesome_rule, another_rule])
>>> weak_labels.matrix()
>>> weak_labels.summary()
annotation(include_missing=False)#

Returns the annotation labels as a matrix of integers.

It has the dimensions (“nr of record” x “nr of labels”). It holds a 1 or 0 to indicate if the record is annotated with the corresponding label. In case there is no annotation for the record, it holds a -1 for each label.

Parameters

include_missing (bool) – If True, returns a matrix of the length of the record list (self.records()). For this, we will fill the matrix with -1 for records without an annotation.

Returns

The annotation labels as a matrix of integers.

Return type

numpy.ndarray

property cardinality: int#

The number of labels.

extend_matrix(thresholds, embeddings=None, gpu=False)#

Extends the weak label matrix through embeddings according to the similarity thresholds for each rule.

Implementation based on Epoxy.

Parameters
  • thresholds (Union[List[float], numpy.ndarray]) – An array of thresholds between 0.0 and 1.0, one for each column of the weak labels matrix. Each one stands for the minimum cosine similarity between two sentences for a rule to be extended.

  • embeddings (Optional[numpy.ndarray]) – Embeddings for each row of the weak label matrix. If not provided, we will use the ones from the last WeakLabels.extend_matrix() call.

  • gpu (bool) – If True, perform FAISS similarity queries on GPU.

Examples

>>> # Choose any model to generate the embeddings.
>>> from sentence_transformers import SentenceTransformer
>>> model = SentenceTransformer('all-mpnet-base-v2', device='cuda')
>>>
>>> # Generate the embeddings and set the thresholds.
>>> weak_labels = WeakMultiLabels(dataset="my_dataset")
>>> embeddings = np.array([ model.encode(rec.text) for rec in weak_labels.records() ])
>>> thresholds = [0.6] * len(weak_labels.rules)
>>>
>>> # Extend the weak labels matrix.
>>> weak_labels.extend_matrix(thresholds, embeddings)
>>>
>>> # Calling the method below will now retrieve the extended matrix.
>>> weak_labels.matrix()
>>>
>>> # Subsequent calls without the embeddings parameter will reutilize the faiss index built on the first call.
>>> thresholds = [0.75] * len(weak_labels.rules)
>>> weak_labels.extend_matrix(thresholds)
>>> weak_labels.matrix()
property labels: List[str]#

The labels of the multi-label text classification dataset.

matrix(has_annotation=None)#

Returns the 3 dimensional weak label matrix, or optionally just a part of it.

It has the dimensions (“nr of record” x “nr of rules” x “nr of labels”). It holds a 1 or 0 in case a rule votes for a label or not. If the rule abstains, it holds a -1 for each label.

Parameters

has_annotation (Optional[bool]) – If True, return only the part of the matrix that has a corresponding annotation. If False, return only the part of the matrix that has NOT a corresponding annotation. By default, we return the whole weak label matrix.

Returns

The 3 dimensional weak label matrix, or optionally just a part of it.

Return type

numpy.ndarray

show_records(labels=None, rules=None)#

Shows records in a pandas DataFrame, optionally filtered by weak labels and non-abstaining rules.

If you provide both labels and rules, we take the intersection of both filters.

Parameters
  • labels (Optional[List[str]]) – All of these labels are in the record’s weak labels. If None, do not filter by labels.

  • rules (Optional[List[Union[str, int]]]) – All of these rules did not abstain for the record. If None, do not filter by rules. You can refer to the rules by their (function) name or by their index in the self.rules list.

Returns

The optionally filtered records as a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

summary(normalize_by_coverage=False, annotation=None)#

Returns following summary statistics for each rule:

  • label: Set of unique labels returned by the rule, excluding “None” (abstain).

  • coverage: Fraction of the records labeled by the rule.

  • annotated_coverage: Fraction of annotated records labeled by the rule (if annotations are available).

  • overlaps: Fraction of the records labeled by the rule together with at least one other rule.

  • correct: Number of labels the rule predicted correctly (if annotations are available).

  • incorrect: Number of labels the rule predicted incorrectly or missed (if annotations are available).

  • precision: Fraction of correct labels given by the rule (if annotations are available). The precision does not penalize the rule for abstains.

Parameters
  • normalize_by_coverage (bool) – Normalize the overlaps by the respective coverage.

  • annotation (Optional[numpy.ndarray]) – An optional matrix with ints holding the annotations (see self.annotation). By default, we will use self.annotation(include_missing=True).

Returns

The summary statistics for each rule in a pandas DataFrame.

Return type

pandas.core.frame.DataFrame

class rubrix.labeling.text_classification.label_models.FlyingSquid(weak_labels, **kwargs)#

The label model by FlyingSquid.

Note

It is not suited for multi-label classification and does not support it!

Parameters

Examples

>>> from rubrix.labeling.text_classification import WeakLabels
>>> weak_labels = WeakLabels(dataset="my_dataset")
>>> label_model = FlyingSquid(weak_labels)
>>> label_model.fit()
>>> records = label_model.predict()
fit(include_annotated_records=False, **kwargs)#

Fits the label model.

Parameters
  • include_annotated_records (bool) – Whether to include annotated records in the fitting.

  • **kwargs – Passed on to the FlyingSquid’s LabelModel.fit() method.

predict(include_annotated_records=False, include_abstentions=False, prediction_agent='FlyingSquid', verbose=True, tie_break_policy='abstain')#

Applies the label model.

Parameters
  • include_annotated_records (bool) – Whether to include annotated records.

  • include_abstentions (bool) – Whether to include records in the output, for which the label model abstained.

  • prediction_agent (str) – String used for the prediction_agent in the returned records.

  • verbose (bool) – If True, print out messages of the progress to stderr.

  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties. You can choose among two policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all the labeling functions (rules) abstained.

Returns

A dataset of records that include the predictions of the label model.

Raises

NotFittedError – If the label model was still not fitted.

Return type

rubrix.client.datasets.DatasetForTextClassification

score(tie_break_policy='abstain', verbose=False, output_str=False)#

Returns some scores/metrics of the label model with respect to the annotated records.

The metrics are:

  • accuracy

  • micro/macro averages for precision, recall and f1

  • precision, recall, f1 and support for each label

For more details about the metrics, check out the sklearn docs.

Note

Metrics are only calculated over non-abstained predictions!

Parameters
  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties. You can choose among two policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all the labeling functions (rules) abstained.

  • verbose (bool) – If True, print out messages of the progress to stderr.

  • output_str (bool) – If True, return output as nicely formatted string.

Returns

The scores/metrics in a dictionary or as a nicely formatted str.

Raises
  • NotFittedError – If the label model was still not fitted.

  • MissingAnnotationError – If the weak_labels do not contain annotated records.

Return type

Union[Dict[str, float], str]

class rubrix.labeling.text_classification.label_models.MajorityVoter(weak_labels)#

A basic label model that computes the majority vote across all rules.

For single-label classification, it will predict the label with the most votes. For multi-label classification, it will predict all labels that got at least one vote by the rules.

Parameters

weak_labels (Union[rubrix.labeling.text_classification.weak_labels.WeakLabels, rubrix.labeling.text_classification.weak_labels.WeakMultiLabels]) – The weak labels object.

fit(*args, **kwargs)#

Raises a NotImplementedError.

No need to call fit on the MajorityVoter!

predict(include_annotated_records=False, include_abstentions=False, prediction_agent='MajorityVoter', tie_break_policy='abstain')#

Applies the label model.

Parameters
  • include_annotated_records (bool) – Whether to include annotated records.

  • include_abstentions (bool) – Whether to include records in the output, for which the label model abstained.

  • prediction_agent (str) – String used for the prediction_agent in the returned records.

  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties (IGNORED FOR MULTI-LABEL!). You can choose among two policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all the labeling functions (rules) abstained.

Returns

A dataset of records that include the predictions of the label model.

Return type

rubrix.client.datasets.DatasetForTextClassification

score(tie_break_policy='abstain', output_str=False)#

Returns some scores/metrics of the label model with respect to the annotated records.

The metrics are:

  • accuracy

  • micro/macro averages for precision, recall and f1

  • precision, recall, f1 and support for each label

For more details about the metrics, check out the sklearn docs.

Note

Metrics are only calculated over non-abstained predictions!

Parameters
  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties (IGNORED FOR MULTI-LABEL). You can choose among two policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    The last policy can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all the labeling functions (rules) abstained.

  • output_str (bool) – If True, return output as nicely formatted string.

Returns

The scores/metrics in a dictionary or as a nicely formatted str.

Raises

MissingAnnotationError – If the weak_labels do not contain annotated records.

Return type

Union[Dict[str, float], str]

class rubrix.labeling.text_classification.label_models.Snorkel(weak_labels, verbose=True, device='cpu')#

The label model by Snorkel.

Note

It is not suited for multi-label classification and does not support it!

Parameters
  • weak_labels (rubrix.labeling.text_classification.weak_labels.WeakLabels) – A WeakLabels object containing the weak labels and records.

  • verbose (bool) – Whether to show print statements

  • device (str) – What device to place the model on (‘cpu’ or ‘cuda:0’, for example). Passed on to the torch.Tensor.to() calls.

Examples

>>> from rubrix.labeling.text_classification import WeakLabels
>>> weak_labels = WeakLabels(dataset="my_dataset")
>>> label_model = Snorkel(weak_labels)
>>> label_model.fit()
>>> records = label_model.predict()
fit(include_annotated_records=False, **kwargs)#

Fits the label model.

Parameters
  • include_annotated_records (bool) – Whether to include annotated records in the fitting.

  • **kwargs – Additional kwargs are passed on to Snorkel’s fit method. They must not contain L_train, the label matrix is provided automatically.

predict(include_annotated_records=False, include_abstentions=False, prediction_agent='Snorkel', tie_break_policy='abstain')#

Returns a list of records that contain the predictions of the label model

Parameters
  • include_annotated_records (bool) – Whether to include annotated records.

  • include_abstentions (bool) – Whether to include records in the output, for which the label model abstained.

  • prediction_agent (str) – String used for the prediction_agent in the returned records.

  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties. You can choose among three policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    • true-random: randomly choose among the tied options. NOTE: repeated runs may have slightly different results due to differences in broken ties

    The last two policies can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all the labeling functions (rules) abstained.

Returns

A dataset of records that include the predictions of the label model.

Return type

rubrix.client.datasets.DatasetForTextClassification

score(tie_break_policy='abstain', output_str=False)#

Returns some scores/metrics of the label model with respect to the annotated records.

The metrics are:

  • accuracy

  • micro/macro averages for precision, recall and f1

  • precision, recall, f1 and support for each label

For more details about the metrics, check out the sklearn docs.

Note

Metrics are only calculated over non-abstained predictions!

Parameters
  • tie_break_policy (Union[rubrix.labeling.text_classification.label_models.TieBreakPolicy, str]) –

    Policy to break ties. You can choose among three policies:

    • abstain: Do not provide any prediction

    • random: randomly choose among tied option using deterministic hash

    • true-random: randomly choose among the tied options. NOTE: repeated runs may have slightly different results due to differences in broken ties

    The last two policies can introduce quite a bit of noise, especially when the tie is among many labels, as is the case when all the labeling functions (rules) abstained.

  • output_str (bool) – If True, return output as nicely formatted string.

Returns

The scores/metrics in a dictionary or as a nicely formatted str.

Raises

MissingAnnotationError – If the weak_labels do not contain annotated records.

Return type

Union[Dict[str, float], str]

rubrix.labeling.text_classification.label_errors.find_label_errors(records, sort_by='likelihood', metadata_key='label_error_candidate', n_jobs=1, **kwargs)#

Finds potential annotation/label errors in your records using [cleanlab](https://github.com/cleanlab/cleanlab).

We will consider all records for which a prediction AND annotation is available. Make sure the predictions were made in a holdout manner, that is you should only include records that were not used in the training of the predictor.

Parameters
  • records (Union[List[rubrix.client.models.TextClassificationRecord], rubrix.client.datasets.DatasetForTextClassification]) – A list of text classification records

  • sort_by (Union[str, rubrix.labeling.text_classification.label_errors.SortBy]) – One of the three options - “likelihood”: sort the returned records by likelihood of containing a label error (most likely first) - “prediction”: sort the returned records by the probability of the prediction (highest probability first) - “none”: do not sort the returned records

  • metadata_key (str) – The key added to the record’s metadata that holds the order, if sort_by is not “none”.

  • n_jobs (Optional[int]) – Number of processing threads used by multiprocessing. If None, uses the number of threads on your CPU. Defaults to 1, which removes parallel processing.

  • **kwargs – Passed on to cleanlab.pruning.get_noise_indices (cleanlab < 2.0) or cleanlab.filter.find_label_issues (cleanlab >= 2.0)

Returns

A list of records containing potential annotation/label errors

Raises
  • NoRecordsError – If none of the records has a prediction AND annotation.

  • MissingPredictionError – If a prediction is missing for one of the labels.

  • ValueError – If not supported kwargs are passed on, e.g. ‘sorted_index_method’.

Return type

List[rubrix.client.models.TextClassificationRecord]

Examples

>>> import rubrix as rb
>>> records = rb.load("my_dataset")
>>> records_with_label_errors = find_label_errors(records)