Monitoring in Notebooks Engine

Disponible a partir de Release 5.3.0 (Ultimate) de Plataforma

Goal

This feature allows you to monitor the individual and overall consumption of the notebooks. In this way we can know the status of each Notebook, see the running processes, control the status...

Operation of the Notebooks

To understand the functionality of Monitoring it is important to know some concepts of Notebooks.

Notebook execution modes

The platform notebooks (based on Apache Zeppelin) are executed based on interpreters with different configurations, so that a notebook can execute interpreters in different modes.

There are 3 modes of execution of interpreters in notebooks:

  • Shared: The interpreter process is shared with all notebooks, so that parallel executions of this interpreter cannot be made in several notebooks. The manager is the same for this interpreter. In these cases, since the interpreter is not associated to a notebook, it will not be possible to know in a simple way which notebook has been executed, since it can jump from one to another and the resources entity will have to be crossed with the executions entity in order to know the details.

  • Per notebook:

    • Scoped: The interpreter process is common to all notebooks so it is a multi-run manager.

    • Isolated: The interpreter process is also separated by notebook so that the manager only handles one notebook. In this case, the interpreter will be associated to a notebook, so you will be able to know which notebook it is at all times by the name of the interpreter. If you want to know the detail of paragraphs, you will have to cross the resources entity with the executions entity.

In addition, there are execution modes in k8s so that the execution of each notebook is delegated to each pod. The manager is kept in this pod to control the various types of executions.

Based on this, the manager (RemoteInterpreterServer process) will be in charge of reporting metrics and execution information to the platform, regardless of where it is executed.

Available metrics

2 metrics have been created, both complementary:

Resource metrics

This monitoring, stored in a TimeSeries Entity (notebooks_metrics_resources). At the interpreter level, the processes, type of interpreter (shared, scoped, isolated), whether it is associated with a notebook and the consumption of CPU and RAM are output.

It has a periodic report (configurable at the pod level of the notebook module), by default it will be 10 seconds.

In “shared” interpreters, it will be necessary to contact the monitoring entity to know which notebook the interpreter has consumed.

image-20240109-220816.png

Execution metrics

These metrics (notebooks_metrics_executions) give the execution details of the paragraphs seen by the user, notebook, paragraph, interpreter...

This monitoring will act as a “history” of executions, it will be stored in its own entity and can be deactivated if it is not considered necessary.

With this monitoring, crossed with the previous one, we will be able to know the real consumption per paragraph.

image-20240109-221100.png

Metrics Report

There are two methods for reporting metrics:

  • Push report from interpreter → through these environment variables (included in zeppelin-env.sh) access is configured, via digital client, to two entities on which the previous metrics will be inserted.

#### Monitor reporter zeppelin onesait platform #### export ZEPPELIN_INTERPRETER_MONITORREPORTER_ENABLE=true export ZEPPELIN_INTERPRETER_MONITORREPORTER_DIGITALCLIENT_HOST=https://development.onesaitplatform.com/iot-broker export ZEPPELIN_INTERPRETER_MONITORREPORTER_DIGITALCLIENT_NAME=notebook_metrics_client export ZEPPELIN_INTERPRETER_MONITORREPORTER_DIGITALCLIENT_INSTANCE=notebook_metrics_client_interpreter export ZEPPELIN_INTERPRETER_MONITORREPORTER_DIGITALCLIENT_TOKEN=XXXXXXX export ZEPPELIN_INTERPRETER_MONITORREPORTER_ENTITY_RESOURCES=notebook_metrics_resources export ZEPPELIN_INTERPRETER_MONITORREPORTER_ENTITY_EXECUTIONS=notebook_metrics_executions
  • Report from Zeppelin's Rest API → through a new api created (actuator type) you can know the consumption of all the interpreters (resource metric). In this case, it is not possible to obtain the execution metric as it depends on its timing.

 

There are several endpoints:

/api/interpreter/metrics/all → get all the resources of the zeppelin performers as well as their status and consumption

/api/interpreter/metrics/running → get all the resources of the interpreters ripped from zeppelin as well as their status and consumption

/api/interpreter/metrics/notebook/{notebookId} → get all the resources of the zeppelin interpreters for the parameterized notebook as well as their status and consumption

/api/interpreter/metrics/running/notebook/{notebookId} → get all the resources of the interpreters started from Zeppelin for the parameterized notebook as well as their status and consumption

/api/interpreter/metrics/interpreter/{interpreterId} → get all interpreter resources by id (python, spark, onesaitplatform, ...) as well as their status and consumption

Next steps

  • Have controls over them in the platform (notebook UI) → be able to use the previous elements in the notebook UI to know the assets, be able to stop them easily, etc, etc.

  • Metrics visualization dashboard in a simple way

  • Limit the use of notebook processes by both RAM and CPU