One of the most important activities in a machine learning problem can be the data labelling stage. If you start out with completely unlabelled data, you can either use unsupervised learning techniques, or you can label data, and either use semi-supervised or supervised machine learning techniques.
superintendent is a library that helps you label data in Jupyter notebooks. It manages the “annotation loop”: handling a queue of data to be labelled, processing incoming data, and even handling multiple labellers.
superintendent also lets you use a machine learning to speed up the labelling process.
The library is designed to work with the large and growing data annotation ecosystem that already exists in Jupyter notebooks. It effectively wraps other UI elements, managing the data input and annotation output.
For example, you can use
sister library) with it.
When do you need data labelling?
Many machine learning and data science projects start with data, and you want to make some predictions about the data. This is hard, because you likely don’t have labels at the beginning. For example, say you work at manufacturing plant and want to build a classifier for defective components. You may have a set of pictures of these components, but you maybe don’t have labels for which are defective.
A project like this can be sped up immensely by quickly labelling a few of pictures. And with the right infrastructure, you can label several hundred in just a few hours. The rest of the project will then become much easier.
Let’s start by taking a classification widget, which lets you annotate data points as belonging to one of a set of classes. First I will show you how quickly you can get going, and then I will discuss in a bit more detail what each component means.
from superintendent import Superintendent from ipyannotations.images import ClassLabeller from sklearn.datasets import load_digits input_data = load_digits().data.reshape(-1, 8, 8) input_widget = ClassLabeller( options=list(range(1, 10)) + , image_size=(100, 100)) data_labeller = Superintendent( features=input_data, labelling_widget=input_widget, ) data_labeller
The above interface takes the images from the dataset and displays them. It offers a set of buttons for each class the image could possibly belong to. When one of the buttons is clicked, the label is stored, and the next image is loaded automatically. The progress bar at the top will show how much of the dataset has been labelled already.
Throughout the documentation, all widgets will look like they do in a jupyter notebook. However, clicking on any of the buttons will not trigger anything. This is because there is no running backend. To really get a feel for how it all works, you should install superintendent and try it out in a notebook.
Necessary components to label data
For labelling your data, you need:
The data points to annotate.
A way of showing those datapoints to the user.
A way of capturing input.
Some method to store the captured annotation.
Superintendent handles the first and last aspect: taking your data, organising it in a way that lets people label it, and then storing the annotations in the same format.
Steps 2 and 3 are so highly dependent on the data that it will likely differ
for many use cases. Users can therefore define their own labelling procedures.
Common use cases are implemented by the
ipyannotations library, which is
written to accompany superintendent.
If your data is large - such as big images - it may be useful to store it as
individual files on disk, use the file paths as the
features supplied to
superintendent. You can then handle loading the file when it needs to be
Elements of a Superintendent Widget
In the example above, the “UI” which provides the input (and the image display)
is provided by a wholly separate library:
ipyannotations. This is a deliberate
choice to uncouple any data labelling tools from the core of
which is handling the queue of data points to label.
To show which parts of the widgets are provided by third party libraries, and
which parts are from
superintendent, I am going to highlight them:
from superintendent import Superintendent from ipyannotations.images import ClassLabeller from sklearn.datasets import load_digits import ipywidgets input_widget = ClassLabeller(options=list(range(1, 10)) + , image_size=(100, 100)) input_data = load_digits().data.reshape(-1, 8, 8) data_labeller = Superintendent( features=input_data, labelling_widget=input_widget, ) data_labeller.children.layout = ipywidgets.Layout( border='solid 2px orange', ) data_labeller.children.layout = ipywidgets.Layout( border='solid 2px green', ) data_labeller
The orange part is provided by Superintendent. It contains a progress bar, but if you are doing active learning, it would also contain a button and performance indicator.
The green part is provided by the specific labelling widget I have chosen: