Client¶

Here we describe the Python client of Rubrix that we divide into two basic modules:

Methods: These methods make up the interface to interact with Rubrix’s REST API.
Models: You need to wrap your data in these data models for Rubrix to understand it.

Methods¶

This module contains the interface to access Rubrix’s REST API.

rubrix.copy(dataset, name_of_copy, workspace=None)¶

Creates a copy of a dataset including its tags and metadata

Parameters

dataset (str) – Name of the source dataset
name_of_copy (str) – Name of the copied dataset
workspace (Optional[str]) – If provided, dataset will be copied to that workspace

Examples

>>> import rubrix as rb
>>> rb.copy("my_dataset", name_of_copy="new_dataset")
>>> dataframe = rb.load("new_dataset")

rubrix.delete(name)¶

Delete a dataset.

Parameters: name (str) – The dataset name.
Return type: None

Examples

>>> import rubrix as rb
>>> rb.delete(name="example-dataset")

rubrix.get_workspace()¶

Returns the name of the active workspace for the current client session.

Returns: The name of the active workspace as a string.
Return type: str

rubrix.init(api_url=None, api_key=None, workspace=None, timeout=60)¶

Init the python client.

Passing an api_url disables environment variable reading, which will provide default values.

Parameters

api_url (Optional[str]) – Address of the REST API. If None (default) and the env variable RUBRIX_API_URL is not set, it will default to http://localhost:6900.
api_key (Optional[str]) – Authentification key for the REST API. If None (default) and the env variable RUBRIX_API_KEY is not set, it will default to rubrix.apikey.
workspace (Optional[str]) – The workspace to which records will be logged/loaded. If None (default) and the env variable RUBRIX_WORKSPACE is not set, it will default to the private user workspace.
timeout (int) – Wait timeout seconds for the connection to timeout. Default: 60.

Return type

None

Examples

>>> import rubrix as rb
>>> rb.init(api_url="http://localhost:9090", api_key="4AkeAPIk3Y")

rubrix.load(name, query=None, ids=None, limit=None, as_pandas=True)¶

Loads a dataset as a pandas DataFrame or a Dataset.

Parameters

name (str) – The dataset name.
query (Optional[str]) – An ElasticSearch query with the query string syntax
ids (Optional[List[Union[str, int]]]) – If provided, load dataset records with given ids.
limit (Optional[int]) – The number of records to retrieve.
as_pandas (bool) – If True, return a pandas DataFrame. If False, return a Dataset.

Returns

The dataset as a pandas Dataframe or a Dataset.

Return type

Union[pandas.core.frame.DataFrame, rubrix.client.datasets.DatasetForTextClassification, rubrix.client.datasets.DatasetForTokenClassification, rubrix.client.datasets.DatasetForText2Text]

Examples

>>> import rubrix as rb
>>> dataframe = rb.load(name="example-dataset")

rubrix.log(records, name, tags=None, metadata=None, chunk_size=500, verbose=True)¶

Log Records to Rubrix.

Parameters

records (Union[rubrix.client.models.TextClassificationRecord, rubrix.client.models.TokenClassificationRecord, rubrix.client.models.Text2TextRecord, Iterable[Union[rubrix.client.models.TextClassificationRecord, rubrix.client.models.TokenClassificationRecord, rubrix.client.models.Text2TextRecord]], rubrix.client.datasets.DatasetForTextClassification, rubrix.client.datasets.DatasetForTokenClassification, rubrix.client.datasets.DatasetForText2Text]) – The record or an iterable of records.
name (str) – The dataset name.
tags (Optional[Dict[str, str]]) – A dictionary of tags related to the dataset.
metadata (Optional[Dict[str, Any]]) – A dictionary of extra info for the dataset.
chunk_size (int) – The chunk size for a data bulk.
verbose (bool) – If True, shows a progress bar and prints out a quick summary at the end.

Returns

Summary of the response from the REST API

Return type

rubrix.client.models.BulkResponse

Examples

>>> import rubrix as rb
>>> record = rb.TextClassificationRecord(
...     inputs={"text": "my first rubrix example"},
...     prediction=[('spam', 0.8), ('ham', 0.2)]
... )
>>> response = rb.log(record, name="example-dataset")

rubrix.set_workspace(ws)¶

Sets the active workspace for the current client session.

Parameters: ws (str) – The new workspace
Return type: None

Models¶

This module contains the data models for the interface

class rubrix.client.models.Text2TextRecord(*, text, prediction=None, prediction_agent=None, annotation=None, annotation_agent=None, id=None, metadata=None, status=None, event_timestamp=None, metrics=None)¶

Record for a text to text task

Parameters

text (str) – The input of the record
prediction (Optional[List[Union[str, Tuple[str, float]]]]) – A list of strings or tuples containing predictions for the input text. If tuples, the first entry is the predicted text, the second entry is its corresponding score.
prediction_agent (Optional[str]) – Name of the prediction agent. By default, this is set to the hostname of your machine.
annotation (Optional[str]) – A string representing the expected output text for the given input text.
annotation_agent (Optional[str]) – Name of the prediction agent. By default, this is set to the hostname of your machine.
id (Optional[Union[int, str]]) – The id of the record. By default (None), we will generate a unique ID for you.
metadata (Optional[Dict[str, Any]]) – Meta data for the record. Defaults to {}.
status (Optional[str]) – The status of the record. Options: ‘Default’, ‘Edited’, ‘Discarded’, ‘Validated’. If an annotation is provided, this defaults to ‘Validated’, otherwise ‘Default’.
event_timestamp (Optional[datetime.datetime]) – The timestamp of the record.
metrics (Optional[Dict[str, Any]]) – READ ONLY! Metrics at record level provided by the server when using rb.load. This attribute will be ignored when using rb.log.

Return type

None

Examples

>>> import rubrix as rb
>>> record = rb.Text2TextRecord(
...     text="My name is Sarah and I love my dog.",
...     prediction=["Je m'appelle Sarah et j'aime mon chien."]
... )

classmethod prediction_as_tuples(prediction)¶

Preprocess the predictions and wraps them in a tuple if needed

Parameters: prediction (Optional[List[Union[str, Tuple[str, float]]]]) –

class rubrix.client.models.TextClassificationRecord(*, inputs, prediction=None, prediction_agent=None, annotation=None, annotation_agent=None, multi_label=False, explanation=None, id=None, metadata=None, status=None, event_timestamp=None, metrics=None)¶

Record for text classification

Parameters

inputs (Union[str, List[str], Dict[str, Union[str, List[str]]]]) – The inputs of the record
prediction (Optional[List[Tuple[str, float]]]) – A list of tuples containing the predictions for the record. The first entry of the tuple is the predicted label, the second entry is its corresponding score.
prediction_agent (Optional[str]) – Name of the prediction agent. By default, this is set to the hostname of your machine.
annotation (Optional[Union[str, List[str]]]) – A string or a list of strings (multilabel) corresponding to the annotation (gold label) for the record.
annotation_agent (Optional[str]) – Name of the prediction agent. By default, this is set to the hostname of your machine.
multi_label (bool) – Is the prediction/annotation for a multi label classification task? Defaults to False.
explanation (Optional[Dict[str, List[rubrix.client.models.TokenAttributions]]]) – A dictionary containing the attributions of each token to the prediction. The keys map the input of the record (see inputs) to the TokenAttributions.
id (Optional[Union[int, str]]) – The id of the record. By default (None), we will generate a unique ID for you.
metadata (Optional[Dict[str, Any]]) – Meta data for the record. Defaults to {}.
status (Optional[str]) – The status of the record. Options: ‘Default’, ‘Edited’, ‘Discarded’, ‘Validated’. If an annotation is provided, this defaults to ‘Validated’, otherwise ‘Default’.
event_timestamp (Optional[datetime.datetime]) – The timestamp of the record.
metrics (Optional[Dict[str, Any]]) – READ ONLY! Metrics at record level provided by the server when using rb.load. This attribute will be ignored when using rb.log.

Return type

None

Examples

>>> import rubrix as rb
>>> record = rb.TextClassificationRecord(
...     inputs={"text": "my first rubrix example"},
...     prediction=[('spam', 0.8), ('ham', 0.2)]
... )

classmethod input_as_dict(inputs)¶: Preprocess record inputs and wraps as dictionary if needed

class rubrix.client.models.TokenAttributions(*, token, attributions=None)¶

Attribution of the token to the predicted label.

In the Rubrix app this is only supported for TextClassificationRecord and the multi_label=False case.

Parameters

token (str) – The input token.
attributions (Dict[str, float]) – A dictionary containing label-attribution pairs.

Return type

None

class rubrix.client.models.TokenClassificationRecord(*, text, tokens, prediction=None, prediction_agent=None, annotation=None, annotation_agent=None, id=None, metadata=None, status=None, event_timestamp=None, metrics=None)¶

Record for a token classification task

Parameters

text (str) – The input of the record
tokens (List[str]) – The tokenized input of the record. We use this to guide the annotation process and to cross-check the spans of your prediction/annotation.
prediction (Optional[List[Union[Tuple[str, int, int], Tuple[str, int, int, float]]]]) – A list of tuples containing the predictions for the record. The first entry of the tuple is the name of predicted entity, the second and third entry correspond to the start and stop character index of the entity. The fourth entry is optional and corresponds to the score of the entity (a float number between 0 and 1).
prediction_agent (Optional[str]) – Name of the prediction agent. By default, this is set to the hostname of your machine.
annotation (Optional[List[Tuple[str, int, int]]]) – A list of tuples containing annotations (gold labels) for the record. The first entry of the tuple is the name of the entity, the second and third entry correspond to the start and stop char index of the entity.
annotation_agent (Optional[str]) – Name of the prediction agent. By default, this is set to the hostname of your machine.
id (Optional[Union[int, str]]) – The id of the record. By default (None), we will generate a unique ID for you.
metadata (Optional[Dict[str, Any]]) – Meta data for the record. Defaults to {}.
status (Optional[str]) – The status of the record. Options: ‘Default’, ‘Edited’, ‘Discarded’, ‘Validated’. If an annotation is provided, this defaults to ‘Validated’, otherwise ‘Default’.
event_timestamp (Optional[datetime.datetime]) – The timestamp of the record.
metrics (Optional[Dict[str, Any]]) – READ ONLY! Metrics at record level provided by the server when using rb.load. This attribute will be ignored when using rb.log.

Return type

None

Examples

>>> import rubrix as rb
>>> record = rb.TokenClassificationRecord(
...     text = "Michael is a professor at Harvard",
...     tokens = ["Michael", "is", "a", "professor", "at", "Harvard"],
...     prediction = [('NAME', 0, 7), ('LOC', 26, 33)]
... )

classmethod add_default_score(prediction)¶

Adds the default score to the predictions if it is missing

Parameters: prediction (Optional[List[Union[Tuple[str, int, int], Tuple[str, int, int, float]]]]) –

Datasets¶

class rubrix.client.datasets.DatasetForText2Text(records=None)¶

This Dataset contains Text2TextRecord records.

It allows you to export/import records into/from different formats, loop over the records, and access them by index.

Parameters: records (Optional[List[rubrix.client.models.Text2TextRecord]]) – A list of `Text2TextRecord`s.
Raises: WrongRecordTypeError – When the record type in the provided list does not correspond to the dataset type.

Examples

>>> import rubrix as rb
>>> # Import/export records:
>>> dataset = rb.DatasetForText2Text.from_pandas(my_dataframe)
>>> dataset.to_datasets()
>>> # Passing in a list of records:
>>> records = [
...     rb.Text2TextRecord(text="example"),
...     rb.Text2TextRecord(text="another example"),
... ]
>>> dataset = rb.DatasetForText2Text(records)
>>> assert len(dataset) == 2
>>> # Looping over the dataset:
>>> for record in dataset:
...     print(record)
>>> # Indexing into the dataset:
>>> dataset[0]
... rb.Text2TextRecord(text="example"})
>>> dataset[0] = rb.Text2TextRecord(text="replaced example")

classmethod from_datasets(dataset)¶

Imports records from a datasets.Dataset.

Columns that are not supported are ignored.

Parameters: dataset (datasets.Dataset) – A datasets Dataset from which to import the records.
Returns: The imported records in a Rubrix Dataset.
Return type: DatasetForText2Text

Examples

>>> import datasets
>>> ds = datasets.Dataset.from_dict({
...     "text": ["my example"],
...     "prediction": [["mi ejemplo", "ejemplo mio"]]
... })
>>> # or
>>> ds = datasets.Dataset.from_dict({
...     "text": ["my example"],
...     "prediction": [[{"text": "mi ejemplo", "score": 0.9}]]
... })
>>> DatasetForText2Text.from_datasets(ds)

classmethod from_pandas(dataframe)¶

Imports records from a pandas.DataFrame.

Columns that are not supported are ignored.

Parameters: dataframe (pandas.core.frame.DataFrame) – A pandas DataFrame from which to import the records.
Returns: The imported records in a Rubrix Dataset.
Return type: rubrix.client.datasets.DatasetForText2Text

class rubrix.client.datasets.DatasetForTextClassification(records=None)¶

This Dataset contains TextClassificationRecord records.

It allows you to export/import records into/from different formats, loop over the records, and access them by index.

Parameters: records (Optional[List[rubrix.client.models.TextClassificationRecord]]) – A list of `TextClassificationRecord`s.
Raises: WrongRecordTypeError – When the record type in the provided list does not correspond to the dataset type.

Examples

>>> import rubrix as rb
>>> # Import/export records:
>>> dataset = rb.DatasetForTextClassification.from_pandas(my_dataframe)
>>> dataset.to_datasets()
>>> # Looping over the dataset:
>>> for record in dataset:
...     print(record)
>>> # Passing in a list of records:
>>> records = [
...     rb.TextClassificationRecord(inputs="example"),
...     rb.TextClassificationRecord(inputs="another example"),
... ]
>>> dataset = rb.DatasetForTextClassification(records)
>>> assert len(dataset) == 2
>>> # Indexing into the dataset:
>>> dataset[0]
... rb.TextClassificationRecord(inputs={"text": "example"})
>>> dataset[0] = rb.TextClassificationRecord(inputs="replaced example")

classmethod from_datasets(dataset)¶

Imports records from a datasets.Dataset.

Columns that are not supported are ignored.

Parameters: dataset (datasets.Dataset) – A datasets Dataset from which to import the records.
Returns: The imported records in a Rubrix Dataset.
Return type: DatasetForTextClassification

Examples

>>> import datasets
>>> ds = datasets.Dataset.from_dict({
...     "inputs": ["example"],
...     "prediction": [
...         [{"label": "LABEL1", "score": 0.9}, {"label": "LABEL2", "score": 0.1}]
...     ]
... })
>>> DatasetForTextClassification.from_datasets(ds)

classmethod from_pandas(dataframe)¶

Imports records from a pandas.DataFrame.

Columns that are not supported are ignored.

Parameters: dataframe (pandas.core.frame.DataFrame) – A pandas DataFrame from which to import the records.
Returns: The imported records in a Rubrix Dataset.
Return type: rubrix.client.datasets.DatasetForTextClassification

class rubrix.client.datasets.DatasetForTokenClassification(records=None)¶

This Dataset contains TokenClassificationRecord records.

It allows you to export/import records into/from different formats, loop over the records, and access them by index.

Parameters: records (Optional[List[rubrix.client.models.TokenClassificationRecord]]) – A list of `TokenClassificationRecord`s.
Raises: WrongRecordTypeError – When the record type in the provided list does not correspond to the dataset type.

Examples

>>> import rubrix as rb
>>> # Import/export records:
>>> dataset = rb.DatasetForTokenClassification.from_pandas(my_dataframe)
>>> dataset.to_datasets()
>>> # Looping over the dataset:
>>> assert len(dataset) == 2
>>> for record in dataset:
...     print(record)
>>> # Passing in a list of records:
>>> import rubrix as rb
>>> records = [
...     rb.TokenClassificationRecord(text="example", tokens=["example"]),
...     rb.TokenClassificationRecord(text="another example", tokens=["another", "example"]),
... ]
>>> dataset = rb.DatasetForTokenClassification(records)
>>> # Indexing into the dataset:
>>> dataset[0]
... rb.TokenClassificationRecord(text="example", tokens=["example"])
>>> dataset[0] = rb.TokenClassificationRecord(text="replace example", tokens=["replace", "example"])

classmethod from_datasets(dataset)¶

Imports records from a datasets.Dataset.

Columns that are not supported are ignored.

Parameters: dataset (datasets.Dataset) – A datasets Dataset from which to import the records.
Returns: The imported records in a Rubrix Dataset.
Return type: DatasetForTokenClassification

Examples

>>> import datasets
>>> ds = datasets.Dataset.from_dict({
...     "text": ["my example"],
...     "tokens": [["my", "example"]],
...     "prediction": [
...         [{"label": "LABEL1", "start": 3, "end": 10, "score": 1.0}]
...     ]
... })
>>> DatasetForTokenClassification.from_datasets(ds)

classmethod from_pandas(dataframe)¶

Imports records from a pandas.DataFrame.

Columns that are not supported are ignored.

Parameters: dataframe (pandas.core.frame.DataFrame) – A pandas DataFrame from which to import the records.
Returns: The imported records in a Rubrix Dataset.
Return type: rubrix.client.datasets.DatasetForTokenClassification

rubrix.client.datasets.read_datasets(dataset, task)¶

Reads a datasets Dataset and returns a Rubrix Dataset

Parameters

dataset (datasets.Dataset) – Dataset to be read in.
task (Union[str, rubrix.client.sdk.datasets.models.TaskType]) – Task for the dataset, one of: [“TextClassification”, “TokenClassification”, “Text2Text”]

Returns

A Rubrix dataset for the given task.

Return type

Union[rubrix.client.datasets.DatasetForTextClassification, rubrix.client.datasets.DatasetForTokenClassification, rubrix.client.datasets.DatasetForText2Text]

Examples

>>> # Read text classification records from a datasets Dataset
>>> import datasets
>>> ds = datasets.Dataset.from_dict({
...     "inputs": ["example"],
...     "prediction": [
...         [{"label": "LABEL1", "score": 0.9}, {"label": "LABEL2", "score": 0.1}]
...     ]
... })
>>> read_datasets(ds, task="TextClassification")
>>>
>>> # Read token classification records from a datasets Dataset
>>> ds = datasets.Dataset.from_dict({
...     "text": ["my example"],
...     "tokens": [["my", "example"]],
...     "prediction": [
...         [{"label": "LABEL1", "start": 3, "end": 10}]
...     ]
... })
>>> read_datasets(ds, task="TokenClassification")
>>>
>>> # Read text2text records from a datasets Dataset
>>> ds = datasets.Dataset.from_dict({
...     "text": ["my example"],
...     "prediction": [["mi ejemplo", "ejemplo mio"]]
... })
>>> # or
>>> ds = datasets.Dataset.from_dict({
...     "text": ["my example"],
...     "prediction": [[{"text": "mi ejemplo", "score": 0.9}]]
... })
>>> read_datasets(ds, task="Text2Text")

rubrix.client.datasets.read_pandas(dataframe, task)¶

Reads a pandas DataFrame and returns a Rubrix Dataset

Parameters

dataframe (pandas.core.frame.DataFrame) – Dataframe to be read in.
task (Union[str, rubrix.client.sdk.datasets.models.TaskType]) – Task for the dataset, one of: [“TextClassification”, “TokenClassification”, “Text2Text”]

Returns

A Rubrix dataset for the given task.

Return type

Union[rubrix.client.datasets.DatasetForTextClassification, rubrix.client.datasets.DatasetForTokenClassification, rubrix.client.datasets.DatasetForText2Text]

Examples

>>> # Read text classification records from a pandas DataFrame
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "inputs": ["example"],
...     "prediction": [
...         [("LABEL1", 0.9), ("LABEL2", 0.1)]
...     ]
... })
>>> read_pandas(df, task="TextClassification")
>>>
>>> # Read token classification records from a datasets Dataset
>>> df = pd.DataFrame({
...     "text": ["my example"],
...     "tokens": [["my", "example"]],
...     "prediction": [
...         [("LABEL1", 3, 10)]
...     ]
... })
>>> read_pandas(df, task="TokenClassification")
>>>
>>> # Read text2text records from a datasets Dataset
>>> df = pd.DataFrame({
...     "text": ["my example"],
...     "prediction": [["mi ejemplo", "ejemplo mio"]]
... })
>>> # or
>>> ds = pd.DataFrame({
...     "text": ["my example"],
...     "prediction": [[("mi ejemplo", 0.9)]]
... })
>>> read_pandas(df, task="Text2Text")