Observing Watson Studio Notebooks with Instana

Tony Hickman
7 min readSep 6, 2022

As I get more and more involved with Instana I start to discover more areas where I would like to apply it. One area of interest for me has been Jupyter notebooks running in Watson Studio. This is of interest as I’ve created a number of solutions where I use scheduled Notebooks to form a data pipeline. Being able “Observe” the execution of these Notebooks would be really useful.

Given this challenge I started working with one of my colleagues (Joe Vullo) to figure out the “art of the possible”. We started by looking at the SaaS instance of Watson Studio available in the IBM Cloud. After a lot of very cunning hacking and proding Joe did managed to get a Notebook functioning such that it could report events. However this required the Instana agent to be installed in the Notebook runtime and generally was very messy to manage. (Joe is planning to revisit this to see if the approach can be improved but at this time we parked this approach). Given this we switch our focus to IBM CloudPak for Data running in the IBM Cloud ROKS service. This approach would allow us far greater access to the wider environment and in particular to install Instana into the cluster running CloudPak for Data and so where the Notebooks would be hosted and executed.

So step one was to get an CloudPak for Data instance stood up and then install Instana. This is all very simple and not something I will cover in this post as the Instana documentation is very good.

With the environment set up our next challenge was how to get Instana up and running in a Notebook and connected to an agent. Important to note here is that to connect to an agent the Notebook would need to talk to the agent deployed to the worker node that is hosting the Notebook runtime. This was going to be tricking one to crack so we focused our attention here first.

After some research (mainly involved in observing the configuration of the underlying OpenShift environment) we could see that the following was happening.

  1. At Notebook startup an environment is created (if the required environment not already available)
  2. When creating a Notebook in the underlying OpenShift environment a deployment, replicaset, pod and a few other configuration items are dynamically created
  3. The created pod hosts the required environment and runs the environment as a “conda” managed Python environment

Normally to propagate the worker IP address into a running pod the Openshift downward API status.hostIP is used. Given the runtime environment is being dynamically created we couldn’t find a way to achieve this and we had to look at an alternative approach. With some further exploration of a running Notebook and comparing with the underlying OpenShift configuration we spotted that the HOSTNAME environment variable corrolated with pod name in Openshift. Further to this if we did a status query on the pod we could see the hostIP is reported. So with this knowledge it became clear that we could use a call to the OpenShift REST API to get the status of the running pod (whose name is contained in the HOSTNAME environment variable) and from this extract the hostIP value. It was a simple task to create a mircoservice to deploy to the OpenShift cluster to make this call. The code we created was as follows:

require("dotenv").config();

const express = require("express");
const https = require("https");

const ACCESS_TOKEN = process.env.ACCESS_TOKEN;
const API_HOST = process.env.API_HOST;
const API_PORT = process.env.API_PORT;
const NAMESPACE = process.env.NAMESPACE;
const port = process.env.PORT || 3000;

const app = express();

app.get("/health", (req, res) => {
console.log("hit GET /health");
res.status(200);
});

app.get("/hostIP/:hostname", (req, res) => {
console.log("hit GET /hostIP");
let host = req.params.hostname;
let path = "/api/v1/namespaces/" + NAMESPACE + "/pods/" + host + "/status";
let api_url = "https://" + API_HOST + ":" + API_PORT + path;

var options = {
"method": "GET",
"hostname": API_HOST,
"port": API_PORT,
"path": path,
headers: {
'Authorization': 'Bearer ' + ACCESS_TOKEN
}
}

var request = https.get(options, function(response) {
var chunks = [];
response.on("data", function (chunk) {
chunks.push(chunk);
});

response.on("error", function (error) {
console.log(error);
});

response.on("end", function() {
var body = Buffer.concat(chunks);
var data = JSON.parse(body.toString())
res.json({
hostIP : data.status.hostIP
});
});

});
});

app.listen(port, () => {
console.log(`Listening on port ${port}`);
});

The code required four environment variables:

  1. ACCESS_TOKEN : The security token to allow permission to invoke the OpenShift REST API call. More details on this below
  2. API_HOST : The name of the host where the OpenShift API service is hosted
  3. API_PORT : The port that the OpenShift API service is listening on
  4. NAMESPACE : The namespace for CloudPak for Data as this is where the dynamically created pod will exist

Focusing on the ACCESS_TOKEN … In order to be able to call the OpenShift REST API an access token is needed and the simplest way to get this is to create a ServiceAccount with the correct permissions. In our case we need to allow the ServiceAccount to “view” the NAMESPACE within which CloudPak for Data is deployed. To do this we used the following OpenShift CLI commands.

echo "Create Namespace"
oc new-project hostip
echo "Create Service Account"
oc create sa hostname
echo "Add policy NB: change this to match namespace where CloudPak for Data is installed"
oc policy add-role-to-group view system:serviceaccount:hostname -n zen-40

With a Namespace created and a ServiceAccount defined with the necessary policy we then created a deployment for the code.

kind: Deployment
apiVersion: apps/v1
metadata:
name: hostip
namespace: hostip
spec:
replicas: 1
selector:
matchLabels:
app: hostipd
template:
metadata:
labels:
app: hostip
spec:
containers:
- name: hostip
image: "<Container registry>/hostip:1"
ports:
- containerPort: 3000
protocol: TCP
env:
- name: ACCESS_TOKEN
valueFrom:
secretKeyRef:
name: <name of service account secret>
key: token
- name: API_HOST
value: <API Host>
- name: API_PORT
value: 32318
- name: NAMESPACE
value: zen-40

I’ve redacted some of the details from our deployment but you can see where you need to make changes. Also you may need to change the NAMESPACE as I am not sure zen-40 is a standard name.

For security reasons we decided not to create an externally facing route and only allow access to via the service which we created as follows.

kind: Service
apiVersion: v1
metadata:
name: hostip
namespace: hostip
spec:
ports:
- protocol: TCP
port: 3000
targetPort: 3000
selector:
app: hostip

With all this in place we could now invoke the service passing the pod HOSTMANE and receive back a JSON response containing the hostIP . The next step was to create a Notebook to test with. The following is the endpoint we got to.

Instana testing Notebook

When this Notebook is executed we see the following in Instana (NB: I’ve redacted some information about our Watson Assistant instance but the full details are showing in Instana).

Instana Trace View

When we ran the test we spotted that Instana was picking up two services to monitor. One was the explicit service which we configured in the Notebook, the other was tied to the serviceName assigned to the pod in OpenShift.

Going back to the Notebook lets focus in on the code in a bit more detail. The first thing we need to do is to get the workerNode IP address. This is where we call the code described earlier. First we get the HOSTNAME from the environment.

import oshostname = os.environ['HOSTNAME']
print(hostname)

Then we make the REST call to the internal OpenShift service for the hostip microservice. The internal service is names using the convention <pod name>.<namespace>.svc.cluster.local:<target port>

import requests
import json
url = 'http://hostip.hostip.svc.cluster.local:3000/hostIP/'+hostname
req = requests.get(url)
data = req.json()
hostIP = data['hostIP']

We then set up the required Instana environment variables and install Instana. We use the derived workerNode details and we set the INSTANA_SERVICE_NAME to a unique value for this Notebook.

%env AUTOWRAPT_BOOTSTRAP instana
%env INSTANA_AGENT_HOST {hostIP}
%env INSTANA_AGENT_PORT 42699
%env INSTANA_SERVICE_NAME instana-test
%env INSTANA_PROCESS_NAME studio-notebook
%env INSTANA_DEBUG true

!pip install instana

After Instana is installed we need to import it to initialise it. The initialisation runs asynchronously and so we use a sleep to allow it time to complete. At this point we also bring in the opentracing libraries which Instana uses to support tracing.

import instana
import time
import opentracing as ot
import opentracing.ext.tags as ext
time.sleep(10)

After this the Instana agent was running and we could start to create traces both. The first thing we did was to create a span to cover the execution of the Notebook.

nb_span = ot.tracer.start_active_span('Instana Test')

Next we made some calls to test the nesting of tracing. One was a call to Watson Assistant and included access keys so I won’t share that here (and is masked in the screenshot above), but here is a sample call we used from the Instana docs.

def simple():
with ot.tracer.start_active_span('asteroid') as pscope:
pscope.span.set_tag(ext.COMPONENT, "Python simple example app")
pscope.span.set_tag(ext.SPAN_KIND, ext.SPAN_KIND_RPC_SERVER)
pscope.span.set_tag(ext.PEER_HOSTNAME, "localhost")
pscope.span.set_tag(ext.HTTP_URL, "/python/simple/one")
pscope.span.set_tag(ext.HTTP_METHOD, "GET")
pscope.span.set_tag(ext.HTTP_STATUS_CODE, 200)
pscope.span.set_tag("Pete's RequestId", "0xdeadbeef")
pscope.span.set_tag("X-Peter-Header", "👀")
pscope.span.set_tag("X-Job-Id", "1947282")
time.sleep(.2)
with ot.tracer.start_active_span('spacedust', child_of=pscope.span) as cscope:
cscope.span.set_tag(ext.SPAN_KIND, ext.SPAN_KIND_RPC_CLIENT)
cscope.span.set_tag(ext.PEER_HOSTNAME, "localhost")
cscope.span.set_tag(ext.HTTP_URL, "/python/simple/two")
cscope.span.set_tag(ext.HTTP_METHOD, "POST")
cscope.span.set_tag(ext.HTTP_STATUS_CODE, 204)
cscope.span.set_baggage_item("someBaggage", "someValue")
time.sleep(.1)

simple()

Here we created a new span asteroid and then nest a 2nd span under it spacedust. The last thing we do in the Notebook is to finish the Notebook wide span.

nb_span.span.finish()

We ran tests with multiple Notebooks within the same environment and with multiple environments. Both set of tests worked and the correct information was reported to Instana.

Based on our work here we do have a working approach for enabling Instana tracing for Notebooks running in Watson Studio. We will continue to experiment but so far this is looking really promising.

--

--

Tony Hickman

I‘ve worked for IBM all of my career and am an avid technologist who is keen to get his hands dirty. My role affords me this opportunity and I share what I can