Using docker-compose to serve a superintendent interface to many users

The documentation already shows how distributed labelling using a SQL database to distribute the data can be used pretty effectively (see :doc:../../distributing-labelling).

However, setting up a database can be difficult; and although even for hundreds of labellers even a small database instance would suffice, there will be some costs associated with it in the cloud.

However, more and more organisations have docker running on a server; and even if your organisation does not have their own hardware or runs docker in the cloud, all of the popular cloud providers offer docker (and, in particular, docker-compose) as a service.

This means it becomes relatively easy for you to manage a database as a back-end, a jupyter server as a front-end, and a model-training server to support active learning.

docker-compose

Docker-compose allows you to specify “multi-container” (i.e. multi-machine) applications that you can then all start and stop at the same time.

You should make sure you have docker and docker-compose installed before continuing.

Here, we are going to start four machines, and the configuration file will look like this:

version: '3.1'

services:

  db:
    image: postgres
    restart: always
    environment: &environment
      POSTGRES_USER: superintendent
      POSTGRES_PASSWORD: superintendent
      POSTGRES_DB: labelling
      PGDATA: /data/postgres
    volumes:
      - "postgres-data:/data/postgres"
    ports:
      - 5432:5432

  adminer:
    image: adminer
    restart: always
    ports:
      - 8080:8080

  orchestrator:
    build:
      context: .
      dockerfile: tensorflow.Dockerfile
    restart: always
    depends_on:
      - "db"
    environment: *environment
    entrypoint: python /app/orchestrate.py
    volumes:
      - ./orchestrate.py:/app/orchestrate.py

  notebook:
    build:
      context: .
      dockerfile: voila.Dockerfile
    restart: always
    depends_on:
      - "db"
    environment: *environment
    volumes:
      - ./voila-interface.py:/home/anaconda/app/app.py
    ports:
      - 8866:8866

volumes:
  postgres-data:

Let’s go through each item.

  • db

    The database server. This will use an official (PostgreSQL)[https://www.postgresql.org/] docker image. You can see that we are providing a “volume”, meaning all the data inside the database is stored in the directory ./postgres-data.

    Note

    The username / password here are just as examples; and you should use some randomly generated strings for safety.

  • adminer

    this is purely to be able to have a graphical interface to the database.

  • notebook:

    This is the server that will actually server our notebook as a website. It uses an image called voila - which actually doesn’t exist yet; we will create that soon.

    Note that we’re placing a notebook into the home folder; this means the container will know what to serve

    Note also that we’re giving this server the same environment variables as the databse server (which we captures using &environment)

  • orchestrator

    This server will run an orchestration script (which we are mounting as a volume) that will re-train and re-order the data in the database.

The notebook (our webapp)

To make superintendent read from the database and display the images (we’ll be using MNIST again…), we need one file with the following content:

./voila-interface.ipynb

import os
from superintendent import Superintendent
from ipyannotations.images import ClassLabeller
from IPython import display

user = os.getenv('POSTGRES_USER', "superintendent")
pw = os.getenv('POSTGRES_PASSWORD', "superintendent")
db_name = os.getenv('POSTGRES_DB', "labelling")

db_string = f"postgresql+psycopg2://{user}:{pw}@localhost:5432/{db_name}"

input_widget = ClassLabeller(options=list(range(1, 10)) + [0], image_size=(100, 100))

widget = Superintendent(database_url=db_string, labelling_widget=input_widget)

display.display(widget)

The orchestration script (our machine learning model)

This script will look very similar to our notebook, but we will additionally create our machine learning model. This time, we will use a neural network, using keras.

import os
import time

import sqlalchemy
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score
from tensorflow import keras

from superintendent import Superintendent
from ipyannotations.images import ClassLabeller


def keras_model():
    model = keras.models.Sequential(
        [
            keras.layers.Conv2D(
                filters=8, kernel_size=3, activation="relu", input_shape=(8, 8, 1)
            ),
            keras.layers.MaxPool2D(2),
            keras.layers.Conv2D(filters=16, kernel_size=3, activation="relu"),
            keras.layers.GlobalMaxPooling2D(),
            keras.layers.Flatten(),
            keras.layers.Dense(10, activation="softmax"),
        ]
    )
    model.compile(keras.optimizers.Adam(), keras.losses.CategoricalCrossentropy())
    return model


def evaluate_keras(model, x, y):
    return cross_val_score(model, x, y, scoring="accuracy", cv=3)


def wait_for_db(db_string):
    database_up = False
    connection = sqlalchemy.create_engine(db_string)
    while not database_up:
        time.sleep(2)
        try:
            print("attempting connection...")
            connection.connect()
            database_up = True
            print("connected!")
        except sqlalchemy.exc.OperationalError:
            continue


model = keras.wrappers.scikit_learn.KerasClassifier(keras_model, epochs=5)

user = os.getenv("POSTGRES_USER")
pw = os.getenv("POSTGRES_PASSWORD")
db_name = os.getenv("POSTGRES_DB")

db_string = f"postgresql+psycopg2://{user}:{pw}@db:5432/{db_name}"

# wait some time, so that the DB has time to start up
wait_for_db(db_string)

# create our superintendent class:
input_widget = ClassLabeller(options=list(range(1, 10)) + [0], image_size=(100, 100))

widget = Superintendent(
    database_url=db_string,
    labelling_widget=input_widget,
    model=model,
    eval_method=evaluate_keras,
    acquisition_function="entropy",
    shuffle_prop=0.1,
    model_preprocess=lambda x, y: (x.reshape(-1, 8, 8, 1), y),
)

# if we've never added any data to this db, load it and add it:
if len(widget.queue) == 0:
    digit_data = load_digits().data
    widget.add_features(digit_data)

if __name__ == "__main__":
    # run orchestration every 30 seconds
    widget.orchestrate(interval_seconds=30, interval_n_labels=10)

Note

In this case, we are adding the data for the images straight into the data-base. This means the numpy array is serialised using JSON. If your images are large, this can be too much for the database. Instead, it’s recommended that you only place the filepaths of the image into the database.

Dockerfiles

Then, we need to actually build two docker images: one that will run the web application, and one that will run the orchestratrion:

Web application (voila) dockerfile

FROM continuumio/miniconda3:4.6.14-alpine

RUN /opt/conda/bin/pip install --upgrade pip

RUN mkdir /home/anaconda/app
WORKDIR /home/anaconda/app

# install some extra dependencies
RUN /opt/conda/bin/pip install voila>=0.1.2
RUN /opt/conda/bin/pip install ipyannotations
RUN /opt/conda/bin/pip install "superintendent>=0.6.0"

ENTRYPOINT ["/opt/conda/bin/voila", "--debug", "--VoilaConfiguration.extension_language_mapping={'.py':'python'}"]
CMD ["app.ipynb"]

Model training dockerfile

FROM tensorflow/tensorflow:2.10.0

RUN pip install "ipyannotations>=0.5.1"
RUN pip install "superintendent>=0.6.0"

RUN mkdir /app
WORKDIR /app

ENTRYPOINT ["python"]
CMD ["app.py"]

Starting

At this point, our folder structure should be:

.
├── docker-compose.yml
├── orchestrate.py
├── voila-interface.ipynb
├── tensorflow.Dockerfile
└── voila.Dockerfile

Now, we can run docker-compose up, which will:

  1. build the docker images specified in docker-compose.yml

  2. start the four different docker images

Now, if you visit http://localhost:8866, you will be able to start labelling. And, of course, if you do this on a web-server, you’ll be able to point other people to that address, so they can start labelling too.

As you and your colleagues proceed with the labelling, you can inspect the content of the database at http://localhost:8080, the “adminer” interface (a web interface to inspect databases). Make sure to set “System” to PostgreSQL when you log in.