Notebook as a Sevice (NaaS)
Over the past few weeks I’ve been asked a couple of times about invoking a Jupyter Notebook created in IBM Watson Studio via an API. “Out of the Box” I am not aware of away to achieve this but after a bit of thinking I could see that it should be possible to put a wrapper around the Notebook and have this present the API and invoke the Notebook when called.
With this concept in mind I started to search the internet for similar patterns and I found “Executing Jupyter Notebooks on serverless GCP products” which was a pretty good match. This introduced me to Papermill which was the key missing part I was looking for. Papermill is really a pre-processor for Notebooks which can be used to replace content in a Notebook before executing it. This provided me with a way to pass in data from an API call ready to be processed by the Notebook at exectution time. As the data will vary from call to call this is a key requirement. To make things as flexible as possible I decided to focus using JSON payloads as input to and output from the Notebook (and wrapper code). This meant that I could design a single wrapper layer and require the Notebooks to comply with a data passing scheme. The scheme I decided on was as follows:
- JSON passed via the API call to be written to a variable named
input
in a cell tagged with the tagparameters
. The data would inserted as a JSON object - The name of the file where the Notebook write its results to be passed in a variable named
output
again in the cell tagged with theparameters
tag. The reason for passing a filename is that the API could be called by multiple clients at the same time so there is a need to seperate the results - The results from the Notebook execution will be written to the file specefined in
output
as a JSON string.
So the high-level flow would look as follows:
When I looked at this more closely I realised that if I used the Notebook name to scope the API I could use a single wrapper to cover a collection of Notebooks. This is the approach I took and I created the following Python Wrapper code.
import datetime
import json
import logging
import os
import papermill as pm
from flask import Flask, request
app = Flask(__name__)
logging.getLogger().setLevel(logging.INFO)
@app.route('/api/<notebook>', methods=['POST'])
def add_message(notebook):
logging.info("starting job")
logging.info("Running Notebook " + notebook)
content_type = request.headers.get('Content-Type')
if (content_type == 'application/json'):
logging.info("application/json content type")
logging.info(request.data)
content = json.loads(request.data)
logging.info(content)
logging.info(json.dumps(content))
dtNow = datetime.datetime.now().strftime("%f")
outputName = "/tmp/output-run-" + dtNow + ".json"
nbOutput = "/tmp/" + notebook + dtNow + ".ipynb"
nbInput = "notebooks/" + notebook + '.ipynb'
parameters = {
'input': content,
'output': outputName
}
pm.execute_notebook(
nbInput,
nbOutput,
parameters=parameters
)
# Opening JSON file
f = open(outputName)
# returns JSON object as
# a dictionary
data = json.load(f)
# Closing file
f.close()
# Clean up files
if os.path.exists(nbOutput):
os.remove(nbOutput)
else:
print("The Notebook output file does not exist")
if os.path.exists(outputName):
os.remove(outputName)
else:
print("The output file does not exist")
logging.info("job completed")
return data
if __name__ == '__main__':
import os
port = os.environ.get('PORT', '8080')
app.run(port=port)
logging.info("All running")
This code is similar to the code in the referenced article but I’ve reworked it to support my data passing approach. In the @app.route
you can see that I am using a URL parameter to hold the Notebook name. This value is then used to “resolve” the Notebook from the filesystem. I’ve not put any error handling around the “Notebook not found path” but this is an easy add on. In addition the @app.route
code uses the current timestamp for any request to generate filenames for the updated Notebook and the results file. As mentioned above this means the code and be executed by two callers at the same time without impact. Finally you can see how I load the Notebook results from the output file into a JSON variable and then return this to the caller. Finally clean up the generated Notebook and output files to keep the filesystem clean.
With the code all together I created a couple of Notebooks to support testing. Both accepted the following JSON payload.
{
"firstName": "Mary",
"surname": "Jane",,
"DOB": "01/03/1968"
}
The first Notebook simply concatenates the first name and surnames and returns the results. The second calculates the persons current age and returns that.
After running a few tests locally on my Mac I was confident that the code was working so I moved on to look at the execution environment. The natural choice was to look at using OpenShift and I decided to target the following set up
To support this the first step I needed to do was to containerise my wrapper code. As my code could be serving a number of consumers I also took the opportunity to wrapper the Python Flask server layer in my code with the gunicorn
WGSI HTTP Server to make things more solid. My first step was to create a requirements.txt
file to ensure that all the necessary Python libraries were pulled in during my container build.
notebook==6.5.2
flask==2.2.2
papermill==2.4.0
gunicorn==20.1.0
Next I create the following Dockerfile
FROM python:3
WORKDIR /usr/src/app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD [ "gunicorn" , "--bind" , "0.0.0.0:9090" , "wsgi:app" ]
Here you can see that I am building on top of the python
Docker image and my code is running under gunicorn
which is bound to port 9090
. Again I ran a few local tests and confirmed that the build was working as desired.
Next I set up my OpenShift environment. I have a small Single Node Cluster install running on an old ThinkPad so I used this as my target environment. Also my OpenShift cluster has the Red Hat OpenShift Pipelines
operator installed. My first step was to create a Project / Namespace and I created one named naas
(Notebooks as a Service). I then logged into my cluster via the oc
command line interface and set my target to the naas
project. My first step was to create a secret with my Git access token to allow my build process to pull from my Git repositories.
kind: Secret
apiVersion: v1
metadata:
name: git-access
annotations:
tekton.dev/git-0: https://<My git repo address>
data:
password: <My token base64 encoded>
username: <My username>
type: kubernetes.io/basic-auth
Once this was created I then updated pipeline
service account with this created secret to allow access to my Git repositories.
kind: ServiceAccount
apiVersion: v1
metadata:
name: pipeline
namespace: naas
secrets:
- name: pipeline-token-cgzd9
- name: pipeline-dockercfg-d99xp
- name: git-access
imagePullSecrets:
- name: pipeline-dockercfg-d99xp
Next I created a PVC
for the pipeline
and the pipeline
its self for the build process.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: build-workspace
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: 2tb-drive
volumeMode: Filesystem
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
name: naas-build
spec:
params:
- default: %APP_NAME%
name: APP_NAME
type: string
- default: 'https://<My Git Server>/tony-hickman/naas.git'
name: NAAS_GIT_REPO
type: string
- default: 'https://<My Git Server>/tony-hickman/notebooks.git'
name: NOTEBOOKS_GIT_REPO
type: string
- default: ''
name: NAAS_GIT_REVISION
type: string
- default: ''
name: NOTEBOOKS_GIT_REVISION
type: string
- default: 'image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3'
name: IMAGE_NAME
type: string
- default: .
name: PATH_CONTEXT
type: string
tasks:
- name: fetch-naas-repository
params:
- name: url
value: $(params.NAAS_GIT_REPO)
- name: revision
value: $(params.NAAS_GIT_REVISION)
- name: subdirectory
value: ''
- name: deleteExisting
value: 'true'
taskRef:
kind: ClusterTask
name: git-clone
workspaces:
- name: output
workspace: workspace
- name: fetch-notebooks-repository
params:
- name: url
value: $(params.NOTEBOOKS_GIT_REPO)
- name: revision
value: $(params.NOTEBOOKS_GIT_REVISION)
- name: subdirectory
value: notebooks
- name: deleteExisting
value: 'false'
runAfter:
- fetch-naas-repository
taskRef:
kind: ClusterTask
name: git-clone
workspaces:
- name: output
workspace: workspace
- name: build
params:
- name: IMAGE
value: $(params.IMAGE_NAME)
- name: TLSVERIFY
value: 'false'
- name: CONTEXT
value: $(params.PATH_CONTEXT)
runAfter:
- fetch-notebooks-repository
taskRef:
kind: ClusterTask
name: buildah
workspaces:
- name: source
workspace: workspace
- name: deploy
params:
- name: SCRIPT
value: oc rollout status deploy/$(params.APP_NAME)
runAfter:
- build
taskRef:
kind: ClusterTask
name: openshift-client
workspaces:
- name: workspace
I then created the deployment, service and route for my application
kind: Deployment
apiVersion: apps/v1
metadata:
name: naas-api
spec:
replicas: 0
selector:
matchLabels:
app: %APP_NAME%
template:
metadata:
creationTimestamp: null
labels:
app: %APP_NAME%
deploymentconfig: %APP_NAME%
spec:
containers:
- name: naas-api
image: >-
image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3:latest
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: Always
restartPolicy: Always
terminationGracePeriodSeconds: 30
dnsPolicy: ClusterFirst
securityContext: {}
schedulerName: default-scheduler
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
revisionHistoryLimit: 10
progressDeadlineSeconds: 600
kind: Service
apiVersion: v1
metadata:
name: naas
spec:
ports:
- protocol: TCP
port: 9090
targetPort: 9090
selector:
app: %APP_NAME%
kind: Route
apiVersion: route.openshift.io/v1
metadata:
name: naas-api
spec:
host: %APP_NAME%-%PROJECT_NAME%.<Domain of my cluster>
to:
kind: Service
name: naas
weight: 100
port:
targetPort: 9090
wildcardPolicy: None
You can see that I’ve parameterised these YAML’s and this is so that I can use the same defintions to create multiple instances. To handle the installation process I created the following shell script.
echo "Checking environment variables are set..."
if [[ -z "${APP_NAME}" ]]; then
echo "Environment variable APP_NAME not set"
exit 1
else
if [[ -z "${PROJECT_NAME}" ]]; then
echo "Environment variable PROJECT_NAME not set"
exit 1
sed -e s/%APP_NAME%/$APP_NAME/g -e s/%PROJECT_NAME%/$PROJECT_NAME/g templates/$file > $file
fi
fi
cd templates
echo "Update templates"
for file in *.yaml
do
sed -e s/%APP_NAME%/$APP_NAME/g -e s/%PROJECT_NAME%/$PROJECT_NAME/g $file > ../$file
done
cd ..
echo "Apply YAML to OpenShift"
oc apply -f project.yaml
sleep 5
oc project $PROJECT_NAME
oc apply -f git-access-secret.yaml
echo "Update Pipeline ServiceAccount"
oc get sa pipeline -o yaml > pipeline-sa.yaml
if grep -q "git-access" pipleline-sa.yaml
then
echo "Already has git-access added"
else
echo "Adding git-access"
echo "- name: git-access" >> pipeline-sa.yaml
oc apply -f pipeline-sa.yaml
fi
oc apply -f pipeline.yaml
oc apply -f pvc.yaml
oc apply -f deployment.yaml
oc apply -f service.yaml
oc apply -f route.yaml
echo "Create Pipeline Run"
oc create -f pipeline-run.yaml
rm *.yaml
This script relies on the YAML’s being in a templates
subdirectory to where the script is. This is how I set up the git repository which I created to manage these code assets. The script expects two environment varibales to be set APP_NAME
and PROJECT_NAME
where APP_NAME
is the name to be given to the application instance and PROJECT_NAME
is the OpenShift project created for this work.
Once everything is set up the next step is to trigger the pipeline to run. This can be done via the OpenShift Console via the Pipelines
menu as shown below
However I decided to create a pipelineRun
YAML to trigger the pipeline
.
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: naas-build-
labels:
tekton.dev/pipeline: naas-build
spec:
params:
- name: APP_NAME
value: naas-api
- name: NAAS_GIT_REPO
value: 'https://<My git repo>/naas.git'
- name: NOTEBOOKS_GIT_REPO
value: 'https://<My git repo>/notebooks.git'
- name: NAAS_GIT_REVISION
value: ''
- name: NOTEBOOKS_GIT_REVISION
value: ''
- name: IMAGE_NAME
value: 'image-registry.openshift-image-registry.svc:5000/%PROJECT_NAME%/naas-build-3'
- name: PATH_CONTEXT
value: .
pipelineRef:
name: naas-build
serviceAccountName: pipeline
timeout: 1h0m0s
workspaces:
- name: workspace
persistentVolumeClaim:
claimName: build-workspace
Again this is parameterised so running the install script creates a specific instance of this template. To start the run oc create -f <pipeline-run yaml>
is used as the name for the run is “generated”.
With everything in place I could trigger the pipeline, build and deploy the code with the Notebooks pulled in from the Notebooks git repository and have it accessible via the defined route. So the basic approach is working but going forward I want to look at the following:
- How to access data assets that are part of a Watson Studio project where the Notebooks are initially built. I have a couple of strategies for this but I need to do some more work
- Applying Instana to support Observability of the executing environment (looking to build on my work done around Instana enabling Notebooks in Watson Studio)
- Using API Connect to provide a way to present and access the API’s