Notebook as a Sevice (NaaS)

Tony Hickman
9 min readNov 29, 2022

Over the past few weeks I’ve been asked a couple of times about invoking a Jupyter Notebook created in IBM Watson Studio via an API. “Out of the Box” I am not aware of away to achieve this but after a bit of thinking I could see that it should be possible to put a wrapper around the Notebook and have this present the API and invoke the Notebook when called.

With this concept in mind I started to search the internet for similar patterns and I found “Executing Jupyter Notebooks on serverless GCP products” which was a pretty good match. This introduced me to Papermill which was the key missing part I was looking for. Papermill is really a pre-processor for Notebooks which can be used to replace content in a Notebook before executing it. This provided me with a way to pass in data from an API call ready to be processed by the Notebook at exectution time. As the data will vary from call to call this is a key requirement. To make things as flexible as possible I decided to focus using JSON payloads as input to and output from the Notebook (and wrapper code). This meant that I could design a single wrapper layer and require the Notebooks to comply with a data passing scheme. The scheme I decided on was as follows:

  1. JSON passed via the API call to be written to a variable named input in a cell tagged with the tag parameters. The data would inserted as a JSON object
  2. The name of the file where the Notebook write its results to be passed in a variable named output again in the cell tagged with the parameters tag. The reason for passing a filename is that the API could be called by multiple clients at the same time so there is a need to seperate the results
  3. The results from the Notebook execution will be written to the file specefined in output as a JSON string.

So the high-level flow would look as follows:

High-level flow

When I looked at this more closely I realised that if I used the Notebook name to scope the API I could use a single wrapper to cover a collection of Notebooks. This is the approach I took and I created the following Python Wrapper code.

import datetime
import json
import logging
import os

import papermill as pm
from flask import Flask, request

app = Flask(__name__)

logging.getLogger().setLevel(logging.INFO)

@app.route('/api/<notebook>', methods=['POST'])
def add_message(notebook):
logging.info("starting job")
logging.info("Running Notebook " + notebook)

content_type = request.headers.get('Content-Type')
if (content_type == 'application/json'):
logging.info("application/json content type")
logging.info(request.data)


content = json.loads(request.data)
logging.info(content)
logging.info(json.dumps(content))
dtNow = datetime.datetime.now().strftime("%f")
outputName = "/tmp/output-run-" + dtNow + ".json"
nbOutput = "/tmp/" + notebook + dtNow + ".ipynb"
nbInput = "notebooks/" + notebook + '.ipynb'
parameters = {
'input': content,
'output': outputName
}

pm.execute_notebook(
nbInput,
nbOutput,
parameters=parameters
)

# Opening JSON file
f = open(outputName)

# returns JSON object as
# a dictionary
data = json.load(f)

# Closing file
f.close()

# Clean up files
if os.path.exists(nbOutput):
os.remove(nbOutput)
else:
print("The Notebook output file does not exist")

if os.path.exists(outputName):
os.remove(outputName)
else:
print("The output file does not exist")

logging.info("job completed")

return data

if __name__ == '__main__':
import os

port = os.environ.get('PORT', '8080')
app.run(port=port)
logging.info("All running")

This code is similar to the code in the referenced article but I’ve reworked it to support my data passing approach. In the @app.route you can see that I am using a URL parameter to hold the Notebook name. This value is then used to “resolve” the Notebook from the filesystem. I’ve not put any error handling around the “Notebook not found path” but this is an easy add on. In addition the @app.route code uses the current timestamp for any request to generate filenames for the updated Notebook and the results file. As mentioned above this means the code and be executed by two callers at the same time without impact. Finally you can see how I load the Notebook results from the output file into a JSON variable and then return this to the caller. Finally clean up the generated Notebook and output files to keep the filesystem clean.

With the code all together I created a couple of Notebooks to support testing. Both accepted the following JSON payload.

{
"firstName": "Mary",
"surname": "Jane",,
"DOB": "01/03/1968"
}

The first Notebook simply concatenates the first name and surnames and returns the results. The second calculates the persons current age and returns that.

Name Contactenation Notebook
Calculate Age Notebook

After running a few tests locally on my Mac I was confident that the code was working so I moved on to look at the execution environment. The natural choice was to look at using OpenShift and I decided to target the following set up

OpenShift Deployment

To support this the first step I needed to do was to containerise my wrapper code. As my code could be serving a number of consumers I also took the opportunity to wrapper the Python Flask server layer in my code with the gunicorn WGSI HTTP Server to make things more solid. My first step was to create a requirements.txt file to ensure that all the necessary Python libraries were pulled in during my container build.

notebook==6.5.2
flask==2.2.2
papermill==2.4.0
gunicorn==20.1.0

Next I create the following Dockerfile

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "gunicorn" , "--bind" , "0.0.0.0:9090" , "wsgi:app" ]

Here you can see that I am building on top of the python Docker image and my code is running under gunicorn which is bound to port 9090. Again I ran a few local tests and confirmed that the build was working as desired.

Next I set up my OpenShift environment. I have a small Single Node Cluster install running on an old ThinkPad so I used this as my target environment. Also my OpenShift cluster has the Red Hat OpenShift Pipelinesoperator installed. My first step was to create a Project / Namespace and I created one named naas (Notebooks as a Service). I then logged into my cluster via the oc command line interface and set my target to the naas project. My first step was to create a secret with my Git access token to allow my build process to pull from my Git repositories.

kind: Secret
apiVersion: v1
metadata:
name: git-access
annotations:
tekton.dev/git-0: https://<My git repo address>
data:
password: <My token base64 encoded>
username: <My username>
type: kubernetes.io/basic-auth

Once this was created I then updated pipelineservice account with this created secret to allow access to my Git repositories.

kind: ServiceAccount
apiVersion: v1
metadata:
name: pipeline
namespace: naas
secrets:
- name: pipeline-token-cgzd9
- name: pipeline-dockercfg-d99xp
- name: git-access
imagePullSecrets:
- name: pipeline-dockercfg-d99xp

Next I created a PVC for the pipeline and the pipeline its self for the build process.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: build-workspace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: 2tb-drive
volumeMode: Filesystem
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: naas-build
spec:
params:
- default: %APP_NAME%
name: APP_NAME
type: string
- default: 'https://<My Git Server>/tony-hickman/naas.git'
name: NAAS_GIT_REPO
type: string
- default: 'https://<My Git Server>/tony-hickman/notebooks.git'
name: NOTEBOOKS_GIT_REPO
type: string
- default: ''
name: NAAS_GIT_REVISION
type: string
- default: ''
name: NOTEBOOKS_GIT_REVISION
type: string
- default: 'image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3'
name: IMAGE_NAME
type: string
- default: .
name: PATH_CONTEXT
type: string
tasks:
- name: fetch-naas-repository
params:
- name: url
value: $(params.NAAS_GIT_REPO)
- name: revision
value: $(params.NAAS_GIT_REVISION)
- name: subdirectory
value: ''
- name: deleteExisting
value: 'true'
taskRef:
kind: ClusterTask
name: git-clone
workspaces:
- name: output
workspace: workspace
- name: fetch-notebooks-repository
params:
- name: url
value: $(params.NOTEBOOKS_GIT_REPO)
- name: revision
value: $(params.NOTEBOOKS_GIT_REVISION)
- name: subdirectory
value: notebooks
- name: deleteExisting
value: 'false'
runAfter:
- fetch-naas-repository
taskRef:
kind: ClusterTask
name: git-clone
workspaces:
- name: output
workspace: workspace
- name: build
params:
- name: IMAGE
value: $(params.IMAGE_NAME)
- name: TLSVERIFY
value: 'false'
- name: CONTEXT
value: $(params.PATH_CONTEXT)
runAfter:
- fetch-notebooks-repository
taskRef:
kind: ClusterTask
name: buildah
workspaces:
- name: source
workspace: workspace
- name: deploy
params:
- name: SCRIPT
value: oc rollout status deploy/$(params.APP_NAME)
runAfter:
- build
taskRef:
kind: ClusterTask
name: openshift-client
workspaces:
- name: workspace

I then created the deployment, service and route for my application

kind: Deployment
apiVersion: apps/v1
metadata:
name: naas-api
spec:
replicas: 0
selector:
matchLabels:
app: %APP_NAME%
template:
metadata:
creationTimestamp: null
labels:
app: %APP_NAME%
deploymentconfig: %APP_NAME%
spec:
containers:
- name: naas-api
image: >-
image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3:latest
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
kind: Service
apiVersion: v1
metadata:
name: naas
spec:
ports:
- protocol: TCP
port: 9090
targetPort: 9090
selector:
app: %APP_NAME%
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: naas-api
spec:
host: %APP_NAME%-%PROJECT_NAME%.<Domain of my cluster>
to:
kind: Service
name: naas
weight: 100
port:
targetPort: 9090
wildcardPolicy: None

You can see that I’ve parameterised these YAML’s and this is so that I can use the same defintions to create multiple instances. To handle the installation process I created the following shell script.

echo "Checking environment variables are set..."

if [[ -z "${APP_NAME}" ]]; then
echo "Environment variable APP_NAME not set"
exit 1
else
if [[ -z "${PROJECT_NAME}" ]]; then
echo "Environment variable PROJECT_NAME not set"
exit 1
sed -e s/%APP_NAME%/$APP_NAME/g -e s/%PROJECT_NAME%/$PROJECT_NAME/g templates/$file > $file
fi
fi

cd templates
echo "Update templates"
for file in *.yaml
do
sed -e s/%APP_NAME%/$APP_NAME/g -e s/%PROJECT_NAME%/$PROJECT_NAME/g $file > ../$file
done

cd ..

echo "Apply YAML to OpenShift"
oc apply -f project.yaml
sleep 5
oc project $PROJECT_NAME
oc apply -f git-access-secret.yaml

echo "Update Pipeline ServiceAccount"
oc get sa pipeline -o yaml > pipeline-sa.yaml
if grep -q "git-access" pipleline-sa.yaml
then
echo "Already has git-access added"
else
echo "Adding git-access"
echo "- name: git-access" >> pipeline-sa.yaml
oc apply -f pipeline-sa.yaml
fi

oc apply -f pipeline.yaml
oc apply -f pvc.yaml
oc apply -f deployment.yaml
oc apply -f service.yaml
oc apply -f route.yaml

echo "Create Pipeline Run"
oc create -f pipeline-run.yaml

rm *.yaml

This script relies on the YAML’s being in a templates subdirectory to where the script is. This is how I set up the git repository which I created to manage these code assets. The script expects two environment varibales to be set APP_NAME and PROJECT_NAME where APP_NAME is the name to be given to the application instance and PROJECT_NAME is the OpenShift project created for this work.

Once everything is set up the next step is to trigger the pipeline to run. This can be done via the OpenShift Console via the Pipelines menu as shown below

Pipeline Screen
Start the process
Process parameters

However I decided to create a pipelineRun YAML to trigger the pipeline.

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: naas-build-
labels:
tekton.dev/pipeline: naas-build
spec:
params:
- name: APP_NAME
value: naas-api
- name: NAAS_GIT_REPO
value: 'https://<My git repo>/naas.git'
- name: NOTEBOOKS_GIT_REPO
value: 'https://<My git repo>/notebooks.git'
- name: NAAS_GIT_REVISION
value: ''
- name: NOTEBOOKS_GIT_REVISION
value: ''
- name: IMAGE_NAME
value: 'image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3'
- name: PATH_CONTEXT
value: .
pipelineRef:
name: naas-build
serviceAccountName: pipeline
timeout: 1h0m0s
workspaces:
- name: workspace
persistentVolumeClaim:
claimName: build-workspace

Again this is parameterised so running the install script creates a specific instance of this template. To start the run oc create -f <pipeline-run yaml>is used as the name for the run is “generated”.

With everything in place I could trigger the pipeline, build and deploy the code with the Notebooks pulled in from the Notebooks git repository and have it accessible via the defined route. So the basic approach is working but going forward I want to look at the following:

  1. How to access data assets that are part of a Watson Studio project where the Notebooks are initially built. I have a couple of strategies for this but I need to do some more work
  2. Applying Instana to support Observability of the executing environment (looking to build on my work done around Instana enabling Notebooks in Watson Studio)
  3. Using API Connect to provide a way to present and access the API’s

--

--

Tony Hickman

I‘ve worked for IBM all of my career and am an avid technologist who is keen to get his hands dirty. My role affords me this opportunity and I share what I can