Metrics (Experimental)

Here we describe the available metrics in Rubrix:

  • Text classification: Metrics for text classification

  • Token classification: Metrics for token classification

Text classification

rubrix.metrics.text_classification.metrics.f1(name)

Computes the single label f1 metric for a dataset

Parameters

name (str) – The dataset name.

Returns

The f1 metric summary

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.text_classification import f1
>>> summary = f1(name="example-dataset")
>>> summary.visualize() # will plot a bar chart with results
>>> summary.data # returns the raw result data
rubrix.metrics.text_classification.metrics.f1_multilabel(name)

Computes the multi-label label f1 metric for a dataset

Parameters

name (str) – The dataset name.

Returns

The f1 metric summary

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.text_classification import f1_multilabel
>>> summary = f1_multilabel(name="example-dataset")
>>> summary.visualize() # will plot a bar chart with results
>>> summary.data # returns the raw result data

Token classification

rubrix.metrics.token_classification.metrics.entity_capitalness(name)

Computes the entity capitalness. The entity capitalness splits the entity mention shape in 4 groups:

UPPER: All charactes in entity mention are upper case

LOWER: All charactes in entity mention are lower case

FIRST: The mention is capitalized

MIDDLE: Some character in mention between first and last is capitalized

Parameters

name (str) – The dataset name.

Returns

The summary entity capitalness distribution

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.token_classification import entity_capitalness
>>> summary = entity_capitalness(name="example-dataset")
>>> summary.visualize()
rubrix.metrics.token_classification.metrics.entity_consistency(name, mentions=10, threshold=2)

Computes the consistency for top entity mentions in the dataset.

Entity consistency defines the label variability for a given mention. For example, a mention first identified in the whole dataset as Cardinal, Person and Time is less consistent than a mention Peter identified as Person in the dataset.

Parameters
  • name (str) – The dataset name.

  • mentions (int) – The number of top mentions to retrieve

  • threshold (int) – The entity variability threshold (Must be greater or equal to 2)

Returns

The summary entity capitalness distribution

Examples

>>> from rubrix.metrics.token_classification import entity_consistency
>>> summary = entity_consistency(name="example-dataset")
>>> summary.visualize()
rubrix.metrics.token_classification.metrics.entity_density(name, interval=0.005)

Computes the entity density distribution. Then entity density is calculated at record level for each mention as mention_length/tokens_length

Parameters
  • name (str) – The dataset name.

  • interval (float) – The interval for histogram. The entity density is defined in the range 0-1

Returns

The summary entity density distribution

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.token_classification import entity_density
>>> summary = entity_density(name="example-dataset")
>>> summary.visualize()
rubrix.metrics.token_classification.metrics.entity_labels(name, labels=50)

Computes the entity labels distribution

Parameters
  • name (str) – The dataset name.

  • labels (int) – The number of top entities to retrieve. Lower numbers will be better performants

Returns

The summary for entity tags distribution

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.token_classification import entity_labels
>>> summary = entity_labels(name="example-dataset", labels=10)
>>> summary.visualize() # will plot a bar chart with results
>>> summary.data # The top-20 entity tags
rubrix.metrics.token_classification.metrics.mention_length(name, interval=1)

Computes mentions length distribution (in number of tokens)

Parameters
  • name (str) – The dataset name.

  • interval (int) – The bins or bucket for result histogram

Returns

The summary for mention token distribution

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.token_classification import mention_length
>>> summary = mention_length(name="example-dataset", interval=2)
>>> summary.visualize() # will plot a histogram chart with results
>>> summary.data # the raw histogram data with bins of size 2
rubrix.metrics.token_classification.metrics.tokens_length(name, interval=1)

Computes the tokens length distribution

Parameters
  • name (str) – The dataset name.

  • interval (int) – The bins or bucket for result histogram

Returns

The summary for token distribution

Return type

rubrix.metrics.models.MetricSummary

Examples

>>> from rubrix.metrics.token_classification import tokens_length
>>> summary = tokens_length(name="example-dataset", interval=5)
>>> summary.visualize() # will plot a histogram with results
>>> summary.data # the raw histogram data with bins of size 5