Generation and publication of ML models

Introduction

Starting from Diabetes data, you will generate the corresponding model that predicts a quantitative measure of disease progression, one year after baseline. You are going to use:

  • File Repository on MinIO to save the original data set. You will upload the file using the Create Entity in Historical Database module.

  • Notebooks to have a parametric process to get the data from MinIO, train and generate the model and log everything in MLFlow.

  • Model Manager (MLFlow) to record all notebook experiments and save the model and other files for training.

  • Serverless module to create a scalable Python function that, using the model, can predict the progression of the disease.

Dataset

The information of the diabetes data set is as follows.


Ten baseline variables, age, sex, body mass index, pressure, and six blood serum measurements were obtained for each of the 442 diabetic patients, as well as the response of interest, a quantitative measure of disease progression one year after the baseline.

Dataset characteristics:

  • Number of instances: 44.

  • Number of attributes: The first 10 columns are numerical predictive values.

  • Target: Column 11 is a quantitative measure of disease progression one year after the baseline.

Information on attribtues:

  • age     age in years

  • sex

  • bmi     body mass index

  • bp      average blood pressure

  • s1      tc, total serum cholesterol

  • s2      ldl, low-density lipoproteins

  • s3      hdl, high-density lipoproteins

  • s4      tch, total cholesterol / HDL

  • s5      ltg, possibly log of serum triglycerides level

  • s6      glu, blood sugar level

Note: Each of these 10 trait variables has been mean-centered and scaled by the standard deviation multiplied by n_samples (i.e., the sum of squares in each column adds up to 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'

Step 1: Load data to the MinIO platform

From the link, you are going to obtain the file from this source:https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt

You are going to create an "Entity in Historical Database" from this file so you must go to this option:

Fill in the main information:

And click “Continue”. Next, you need to fill all the columns of the file with the string format (This is because the CSV file needs to be loaded with this type of column).

Finally, click the “Create” button and your new entity will be created:

You can also query this entity in SQL through Presto with the Query Tool.

Step 2: Create notebook to obtain data, train and record the experiment

First of all, you are going to create a new notebook. Go to the Analytics Tools option and click on the new notebook button (+) then write a name for it.

You can also import this file containing the complete notebook for this example (you just need to set the token parameter).

You have a few explanatory paragraphs for the data set, but go to the code section.

The first paragraph that you are going to focus on is the import one.

Load many libraries and set the base url for the MinIO repository. The next paragraph is going to be the parameters paragraph in order to establish variables that can come from outside.

To obtain the file path you can go to the My Files section:

Luego a MinIO:

 

And in the next page you can get the file path:

The token will be some X-OP-APIKey token that can access the file.

Next, in three sections, you will load the csv file itself and the filepath from the previous section, read it as csv with the dataset column (you need to include the columns in the read_csv function), and show the loaded content:

Now that you have your file as a pandas dataframe, you can split the data into training and test sets:

Also, split these data sets into X and Y data sets for the input parameters and expected results:

And run ElasticNet training with this data and get the output model in lr:

Finally, evaluate some metric for the prediction result.

Step 3: Register training and model data in MLFlow

The Notebook Engine is integrated with the MLFlow tracing service, so all you have to do in the notebook is import the necessary "MLFlow" library and use the MLFlow tracing functions. This will be done in the import libraries section.

The connection parameters and environment variables are already done, so now you can register the parameters in MLFlow directly like this:

%python with mlflow.start_run(): mlflow.set_tag("mlflow.runName", "DiabetesModelGenerator") mlflow.log_param("alpha", alpha) mlflow.log_param("l1_ratio", l1_ratio) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.log_metric("mae", mae) mlflow.sklearn.log_model(lr, "model") mlflow.end_run()

This is the standard code for tracking an experiment in MLFlow. Include everything inside “with mlflow.start_run()“ to start a new experiment.

The other functions are:

  • mlflow.set_tag("mlflow.runName", ...) →(optional) to set an experiment run name. If you do not use this, you will only have an autogenerated id, the ID of the experiment.

  • mlflow.log_param(...) → logs an input parameter for the experiment.

  • mlflow.log_metric(...) → logs an output metric for the experiment.

  • mlflow.sklearn.log_model(lr, "model") → logs and saves the trained model with all necessary metadata files.

If you execute this paragraph, you will have an output like this. The registration process has ended well.

If you go to the Model Manager user interface in the Control Panel:

You can see the execution of the experiment with all the logging parameters, metrics and files:

Clicking on the experiment opens the detail page:

And, at the bottom of the page, you can review all the files for this experiment and the model itself:

The run id on the right (runs:/859953c3a0dd4596bd15d864e91081ab/model) is important because you are going to use it to publish the model in the next step. This is the reference you need to collect the model in MLFlow and do some evaluations with it.

You can also register it in order to tag it, version it and have it outside the experiment. You can do it with the code or you can use the register button on the right side:

And if you go to the model tab, you can see it and work with it.

Step 4: Create a Serverless function in Python that evaluates the data against the MLFlow model

With the model generated above, you are going to create a Python function that, with a single or multiple input, can obtain a prediction using the model.

The first step is to go to the Serverless Applications menu.

Next you are going to create (with the + button) a new application and you are going to fill in all the necessary fields:

You can create a new repository or use an existing one. In any case, you will have a new application like this one:

Then you can go to the "View" button and then to the functions tab:

The next step is to create or use an existing serverless function. Click on "Create Function" and you are going to create three files.

First of all, select the main branch on the right side:

Then you will create (here or in the Git repository with an external editor) the three files:

requirements.txt → libraries that your model needs to run. In this case you will have these:

fdk
protobuf==3.20.*
numpy==1.23.4
mlflow==1.19.0
mlflow-onesaitplatform-plugin==0.2.11
scikit-learn

func.yaml → the project metadata needed for the serverless function. The content will be:

schema_version: 20180708 name: diabetes-predictor version: 0.1.1 runtime: python build_image: fnproject/python:3.9-dev run_image: fnproject/python:3.9 entrypoint: /python/bin/fdk /function/func.py handler memory: 256 triggers: - name: endpoint type: http source: /diabetes-predictor

It is important to have the triggers.source config to have the endpoint for this function, the name and the execution time.

func.py →the content of the evaluation function itself. You have to load the libraries to evaluate the model, MLFlow and fdk for the endpoint.

You will also use an environment variable for host parameter input, experiment, and token.

import io import json import logging import os os.environ["HOME"] = "/tmp" import random import mlflow from fdk import response host = os.environ['HOST'] token = os.environ['TOKEN'] experimentid = os.environ['EXPERIMENTID'] tracking_uri = "https://" + host + "/controlpanel/modelsmanager" model_uri = "onesait-platform://" + token + "@" + host + "/0/" + experimentid + "/artifacts/model" global pyfunc_predictor mlflow.set_tracking_uri(tracking_uri) pyfunc_predictor = mlflow.pyfunc.load_model(model_uri=model_uri) logging.getLogger().info("Diabetes Progression Predictor ready") def handler(ctx, data: io.BytesIO = None): try: logging.getLogger().info("Try") answer = [] json_obj = json.loads(data.getvalue()) logging.getLogger().info("json_obj") logging.getLogger().info(str(json_obj)) if isinstance(json_obj, list): logging.getLogger().info("isinstance") answer = [] values = [] inputvector = [] for input in json_obj: logging.getLogger().info("for") logging.getLogger().info("input: " + str(input)) inputvector = [ input['age'], input['sex'], input['bmi'], input['bp'], input['s1'], input['s2'], input['s3'], input['s4'], input['s5'], input['s6']] values.append(inputvector) predict = pyfunc_predictor.predict(values) answer = predict.tolist() logging.getLogger().info("prediction") else: answer = "input object is not an array of objects:" + str(json_obj) logging.getLogger().error('error isinstance(json_obj, list):' + isinstance(json_obj, list)) raise Exception(answer) except (Exception, ValueError) as ex: logging.getLogger().error('error parsing json payload: ' + str(ex)) logging.getLogger().info("Inside Python ML function") return response.Response( ctx, response_data=json.dumps(answer), headers={"Content-Type": "application/json"} )

You can save everything and deploy your function with the Rocket button:

The last step is to set the environment variables for the model with the button:

Step 5: Model evaluation

Now you can test the model with the REST API with Postman, for example by sending a JSON array with the input:

Or you can create a model evaluator in the Dashboard Engine that uses this endpoint with some provided input:

Or you can evaluate this model in a batch or streaming data flow in the DataFlow with the corresponding evaluator component