API Reference

There really only is one class you should use as your entry point to superintendent:

class superintendent.Superintendent(**kwargs: Any)[source]

Data point labelling.

This is a base class for data point labelling.

Make a class that allows you to label data points.

Parameters
  • features (np.ndarray, pd.DataFrame, sequence) – This should be either a numpy array, a pandas dataframe, or any other sequence object (e.g. a list). You can also add data later.

  • labels (np.array, pd.Series, sequence) – The labels for your data, if you have some already.

  • queue (BaseLabellingQueue) – A queue object. The interface needs to follow the abstract class superintendent.queueing.BaseLabellingQueue. By default, SimpleLabellingQueue (an in-memory queue using python’s deque)

  • labelling_widget (Optional[widgets.Widget]) – An input widget. This needs to follow the interface of the class superintendent.controls.base.SubmissionWidgetMixin

  • model (sklearn.base.BaseEstimator) – An sklearn-interface compliant model (that implements fit, predict, predict_proba and score).

  • eval_method (callable) – A function that accepts three arguments - model, x, and y - and returns the score of the model. If None, sklearn.model_selection.cross_val_score is used.

  • acquisition_function (callable) – A function that re-orders data points during active learning. This can be a function that accepts a numpy array (class probabilities) or a string referring to a function from superintendent.acquisition_functions.

  • shuffle_prop (float) – The proportion of data points that is shuffled when re-ordering during active learning. This is to avoid biasing too much towards the model predictions.

  • model_preprocess (callable) – A function that accepts x and y data and returns x and y data. y can be None (in which it should return x, None) as this function is used on the un-labelled data too.

  • worker_id (bool | str) – If True, will check for the worker’s ID first - this can be helpful when working in a distributed fashion. If a string, this is used as the worker ID. If False, a UUID is generated for this widget.

add_features(features, labels=None)[source]

Add data to the widget.

This adds the data provided to the queue of data to be labelled. You Can optionally provide labels for each data point.

Parameters
  • features (Any) – The data you’d like to add to the labelling widget.

  • labels (Any, optional) – The labels for the data you’re adding; if you have labels.

orchestrate(interval_seconds: Optional[float] = None, interval_n_labels: int = 0, shuffle_prop: float = 0.1, max_runs: float = inf)[source]

Orchestrate the active learning process.

This method can either re-train the classifier and re-order the data once, or it can run a never-ending loop to re-train the model at regular intervals, both in time and in the size of labelled data.

Parameters
  • interval_seconds (int, optional) – How often the retraining should occur, in seconds. If this is None, the retraining only happens once, then returns (this is suitable) if you want the retraining schedule to be maintained e.g. by a cron job). The default is 60 seconds.

  • interval_n_labels (int, optional) – How many new data points need to have been labelled in between runs in order for the re-training to occur.

  • shuffle_prop (float) – What proportion of the data should be randomly sampled on each re- training run.

  • max_runs (float, int) – How many orchestration runs to do at most. By default infinite.

Return type

None

retrain(button=None)[source]

Re-train the classifier you passed when creating this widget.

This calls the fit method of your class with the data that you’ve labelled. It will also score the classifier and display the performance.

Parameters

button (widget.Widget, optional) – Optional & ignored; this is passed when invoked by a button.

Acquisition functions

During active learning, acquisition functions rank unlabelled data points based on model predictions.

In superintendent, the functions accept a 2- or 3-dimensional array of shape n_samples, n_classes, (n_outputs).

The third dimension only applies in a multi-output classification setting, and in general superintendent calculates the score for each data point and then averages in this case.

superintendent.acquisition_functions.entropy(probabilities: ndarray) ndarray[source]

Sort by the entropy of the probabilities (high to low).

Parameters
  • probabilities (np.ndarray) – An array of probabilities, with the shape n_samples, n_classes

  • shuffle_prop (float (default=0.1)) – The proportion of data points that should be randomly shuffled. This means the sorting retains some randomness, to avoid biasing your new labels and catching any minority classes the algorithm currently classifies as a different label.

superintendent.acquisition_functions.margin(probabilities: ndarray) ndarray[source]

Sort by the margin between the top two predictions (low to high).

Parameters
  • probabilities (np.ndarray) – An array of probabilities, with the shape n_samples, n_classes

  • shuffle_prop (float) – The proportion of data points that should be randomly shuffled. This means the sorting retains some randomness, to avoid biasing your new labels and catching any minority classes the algorithm currently classifies as a different label.

superintendent.acquisition_functions.certainty(probabilities: ndarray)[source]

Sort by the certainty of the maximum prediction.

Parameters
  • probabilities (np.ndarray) – An array of probabilities, with the shape n_samples, n_classes

  • shuffle_prop (float) – The proportion of data points that should be randomly shuffled. This means the sorting retains some randomness, to avoid biasing your new labels and catching any minority classes the algorithm currently classifies as a different label.