Notebook as a Sevice (NaaS)

9 min readNov 29, 2022

Over the past few weeks I’ve been asked a couple of times about invoking a Jupyter Notebook created in IBM Watson Studio via an API. “Out of the Box” I am not aware of away to achieve this but after a bit of thinking I could see that it should be possible to put a wrapper around the Notebook and have this present the API and invoke the Notebook when called.

With this concept in mind I started to search the internet for similar patterns and I found “Executing Jupyter Notebooks on serverless GCP products” which was a pretty good match. This introduced me to Papermill which was the key missing part I was looking for. Papermill is really a pre-processor for Notebooks which can be used to replace content in a Notebook before executing it. This provided me with a way to pass in data from an API call ready to be processed by the Notebook at exectution time. As the data will vary from call to call this is a key requirement. To make things as flexible as possible I decided to focus using JSON payloads as input to and output from the Notebook (and wrapper code). This meant that I could design a single wrapper layer and require the Notebooks to comply with a data passing scheme. The scheme I decided on was as follows:

JSON passed via the API call to be written to a variable named input in a cell tagged with the tag parameters. The data would inserted as a JSON object
The name of the file where the Notebook write its results to be passed in a variable named output again in the cell tagged with the parameters tag. The reason for passing a filename is that the API could be called by multiple clients at the same time so there is a need to seperate the results
The results from the Notebook execution will be written to the file specefined in output as a JSON string.

So the high-level flow would look as follows:

When I looked at this more closely I realised that if I used the Notebook name to scope the API I could use a single wrapper to cover a collection of Notebooks. This is the approach I took and I created the following Python Wrapper code.

import datetime
import json
import logging
import os

import papermill as pm
from flask import Flask, request

app = Flask(__name__)

logging.getLogger().setLevel(logging.INFO)

@app.route('/api/<notebook>', methods=['POST'])
def add_message(notebook):
    logging.info("starting job")
    logging.info("Running Notebook " + notebook)

    content_type = request.headers.get('Content-Type')
    if (content_type == 'application/json'):
        logging.info("application/json content type")
        logging.info(request.data)


    content = json.loads(request.data)
    logging.info(content)
    logging.info(json.dumps(content))
    dtNow = datetime.datetime.now().strftime("%f")
    outputName = "/tmp/output-run-" + dtNow + ".json"
    nbOutput = "/tmp/" + notebook + dtNow + ".ipynb"
    nbInput = "notebooks/" + notebook + '.ipynb'
    parameters = {
        'input': content,
        'output': outputName
    }

    pm.execute_notebook(
        nbInput,
        nbOutput,
        parameters=parameters
    )

    # Opening JSON file
    f = open(outputName)
  
    # returns JSON object as 
    # a dictionary
    data = json.load(f)
  
    # Closing file
    f.close()

    # Clean up files
    if os.path.exists(nbOutput):
        os.remove(nbOutput)
    else:
        print("The Notebook output file does not exist")

    if os.path.exists(outputName):
        os.remove(outputName)
    else:
        print("The output file does not exist")

    logging.info("job completed")

    return data

if __name__ == '__main__':
    import os

    port = os.environ.get('PORT', '8080')
    app.run(port=port)
    logging.info("All running")

This code is similar to the code in the referenced article but I’ve reworked it to support my data passing approach. In the @app.route you can see that I am using a URL parameter to hold the Notebook name. This value is then used to “resolve” the Notebook from the filesystem. I’ve not put any error handling around the “Notebook not found path” but this is an easy add on. In addition the @app.route code uses the current timestamp for any request to generate filenames for the updated Notebook and the results file. As mentioned above this means the code and be executed by two callers at the same time without impact. Finally you can see how I load the Notebook results from the output file into a JSON variable and then return this to the caller. Finally clean up the generated Notebook and output files to keep the filesystem clean.

With the code all together I created a couple of Notebooks to support testing. Both accepted the following JSON payload.

{
    "firstName": "Mary",
    "surname": "Jane",,
    "DOB": "01/03/1968"
}

The first Notebook simply concatenates the first name and surnames and returns the results. The second calculates the persons current age and returns that.

After running a few tests locally on my Mac I was confident that the code was working so I moved on to look at the execution environment. The natural choice was to look at using OpenShift and I decided to target the following set up

To support this the first step I needed to do was to containerise my wrapper code. As my code could be serving a number of consumers I also took the opportunity to wrapper the Python Flask server layer in my code with the gunicorn WGSI HTTP Server to make things more solid. My first step was to create a requirements.txt file to ensure that all the necessary Python libraries were pulled in during my container build.

notebook==6.5.2
flask==2.2.2
papermill==2.4.0
gunicorn==20.1.0

Next I create the following Dockerfile

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "gunicorn" , "--bind" , "0.0.0.0:9090" , "wsgi:app" ]

Here you can see that I am building on top of the python Docker image and my code is running under gunicorn which is bound to port 9090. Again I ran a few local tests and confirmed that the build was working as desired.

Next I set up my OpenShift environment. I have a small Single Node Cluster install running on an old ThinkPad so I used this as my target environment. Also my OpenShift cluster has the Red Hat OpenShift Pipelinesoperator installed. My first step was to create a Project / Namespace and I created one named naas (Notebooks as a Service). I then logged into my cluster via the oc command line interface and set my target to the naas project. My first step was to create a secret with my Git access token to allow my build process to pull from my Git repositories.

kind: Secret
apiVersion: v1
metadata:
  name: git-access
  annotations:
    tekton.dev/git-0: https://<My git repo address>
data:
  password: <My token base64 encoded>
  username: <My username>
type: kubernetes.io/basic-auth

Once this was created I then updated pipelineservice account with this created secret to allow access to my Git repositories.

kind: ServiceAccount
apiVersion: v1
metadata:
  name: pipeline
  namespace: naas
secrets:
  - name: pipeline-token-cgzd9
  - name: pipeline-dockercfg-d99xp
  - name: git-access
imagePullSecrets:
  - name: pipeline-dockercfg-d99xp

Next I created a PVC for the pipeline and the pipeline its self for the build process.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: build-workspace
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  storageClassName: 2tb-drive
  volumeMode: Filesystem

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: naas-build
spec:
  params:
    - default: %APP_NAME%
      name: APP_NAME
      type: string
    - default: 'https://<My Git Server>/tony-hickman/naas.git'
      name: NAAS_GIT_REPO
      type: string
    - default: 'https://<My Git Server>/tony-hickman/notebooks.git'
      name: NOTEBOOKS_GIT_REPO
      type: string
    - default: ''
      name: NAAS_GIT_REVISION
      type: string
    - default: ''
      name: NOTEBOOKS_GIT_REVISION
      type: string
    - default: 'image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3'
      name: IMAGE_NAME
      type: string
    - default: .
      name: PATH_CONTEXT
      type: string
  tasks:
    - name: fetch-naas-repository
      params:
        - name: url
          value: $(params.NAAS_GIT_REPO)
        - name: revision
          value: $(params.NAAS_GIT_REVISION)
        - name: subdirectory
          value: ''
        - name: deleteExisting
          value: 'true'
      taskRef:
        kind: ClusterTask
        name: git-clone
      workspaces:
        - name: output
          workspace: workspace
    - name: fetch-notebooks-repository
      params:
        - name: url
          value: $(params.NOTEBOOKS_GIT_REPO)
        - name: revision
          value: $(params.NOTEBOOKS_GIT_REVISION)
        - name: subdirectory
          value: notebooks
        - name: deleteExisting
          value: 'false'
      runAfter:
        - fetch-naas-repository
      taskRef:
        kind: ClusterTask
        name: git-clone
      workspaces:
        - name: output
          workspace: workspace
    - name: build
      params:
        - name: IMAGE
          value: $(params.IMAGE_NAME)
        - name: TLSVERIFY
          value: 'false'
        - name: CONTEXT
          value: $(params.PATH_CONTEXT)
      runAfter:
        - fetch-notebooks-repository
      taskRef:
        kind: ClusterTask
        name: buildah
      workspaces:
        - name: source
          workspace: workspace
    - name: deploy
      params:
        - name: SCRIPT
          value: oc rollout status deploy/$(params.APP_NAME)
      runAfter:
        - build
      taskRef:
        kind: ClusterTask
        name: openshift-client
  workspaces:
    - name: workspace

I then created the deployment, service and route for my application

kind: Deployment
apiVersion: apps/v1
metadata:
  name: naas-api
spec:
  replicas: 0
  selector:
    matchLabels:
      app: %APP_NAME%
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: %APP_NAME%
        deploymentconfig: %APP_NAME%
    spec:
      containers:
        - name: naas-api
          image: >-
            image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3:latest
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

kind: Service
apiVersion: v1
metadata:
  name: naas
spec:
  ports:
    - protocol: TCP
      port: 9090
      targetPort: 9090
  selector:
    app: %APP_NAME%

kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: naas-api
spec:
  host: %APP_NAME%-%PROJECT_NAME%.<Domain of my cluster>
  to:
    kind: Service
    name: naas
    weight: 100
  port:
    targetPort: 9090
  wildcardPolicy: None

You can see that I’ve parameterised these YAML’s and this is so that I can use the same defintions to create multiple instances. To handle the installation process I created the following shell script.

echo "Checking environment variables are set..."

if [[ -z "${APP_NAME}" ]]; then
  echo "Environment variable APP_NAME not set"
  exit 1
else
  if [[ -z "${PROJECT_NAME}" ]]; then
  echo "Environment variable PROJECT_NAME not set"
  exit 1
  sed -e s/%APP_NAME%/$APP_NAME/g -e s/%PROJECT_NAME%/$PROJECT_NAME/g templates/$file > $file
  fi
fi

cd templates
echo "Update templates"
for file in *.yaml
do
  sed -e s/%APP_NAME%/$APP_NAME/g -e s/%PROJECT_NAME%/$PROJECT_NAME/g $file > ../$file
done

cd ..

echo "Apply YAML to OpenShift"
oc apply -f project.yaml
sleep 5
oc project $PROJECT_NAME
oc apply -f git-access-secret.yaml

echo "Update Pipeline ServiceAccount"
oc get sa pipeline -o yaml > pipeline-sa.yaml
if grep -q "git-access"  pipleline-sa.yaml
then 
  echo "Already has git-access added"
else
  echo "Adding git-access"
  echo "- name: git-access" >> pipeline-sa.yaml
  oc apply -f pipeline-sa.yaml
fi

oc apply -f pipeline.yaml
oc apply -f pvc.yaml
oc apply -f deployment.yaml
oc apply -f service.yaml
oc apply -f route.yaml

echo "Create Pipeline Run"
oc create -f pipeline-run.yaml

rm *.yaml

This script relies on the YAML’s being in a templates subdirectory to where the script is. This is how I set up the git repository which I created to manage these code assets. The script expects two environment varibales to be set APP_NAME and PROJECT_NAME where APP_NAME is the name to be given to the application instance and PROJECT_NAME is the OpenShift project created for this work.

Once everything is set up the next step is to trigger the pipeline to run. This can be done via the OpenShift Console via the Pipelines menu as shown below

However I decided to create a pipelineRun YAML to trigger the pipeline.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: naas-build-
  labels:
    tekton.dev/pipeline: naas-build
spec:
  params:
    - name: APP_NAME
      value: naas-api
    - name: NAAS_GIT_REPO
      value: 'https://<My git repo>/naas.git'
    - name: NOTEBOOKS_GIT_REPO
      value: 'https://<My git repo>/notebooks.git'
    - name: NAAS_GIT_REVISION
      value: ''
    - name: NOTEBOOKS_GIT_REVISION
      value: ''
    - name: IMAGE_NAME
      value: 'image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3'
    - name: PATH_CONTEXT
      value: .
  pipelineRef:
    name: naas-build
  serviceAccountName: pipeline
  timeout: 1h0m0s
  workspaces:
    - name: workspace
      persistentVolumeClaim:
        claimName: build-workspace

Again this is parameterised so running the install script creates a specific instance of this template. To start the run oc create -f <pipeline-run yaml>is used as the name for the run is “generated”.

With everything in place I could trigger the pipeline, build and deploy the code with the Notebooks pulled in from the Notebooks git repository and have it accessible via the defined route. So the basic approach is working but going forward I want to look at the following:

How to access data assets that are part of a Watson Studio project where the Notebooks are initially built. I have a couple of strategies for this but I need to do some more work
Applying Instana to support Observability of the executing environment (looking to build on my work done around Instana enabling Notebooks in Watson Studio)
Using API Connect to provide a way to present and access the API’s

Notebook as a Sevice (NaaS)

Written by Tony Hickman