ML model generation and publication

Introduction

Starting from data of Diabetes we’re going to generate the corresponding model that predict a quantitative measure of disease progression one year after baseline. We’re going to use:

  • MinIO file system for save the original dataset. We’ll load the file using the Create Entity in Historical Database

  • Notebooks module to have a parametric process for get the data from MinIO, train and generate the model and log everything in MLFlow

  • Models Manager (MLFlow) for log all the experiment of the notebook and save the model and other files for the training

  • Serverless module to create a rest scalable python function that using the model can predict the disease progression

Dataset

The information of the diabetes dataset is the following.

----------------

 Ten baseline variables, age, sex, body mass index, average blood

pressure, and six blood serum measurements were obtained for each of n =

442 diabetes patients, as well as the response of interest, a

quantitative measure of disease progression one year after baseline. 

Data Set Characteristics:

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:

  • age     age in years

  • sex

  • bmi     body mass index

  • bp      average blood pressure

  • s1      tc, total serum cholesterol

  • s2      ldl, low-density lipoproteins

  • s3      hdl, high-density lipoproteins

  • s4      tch, total cholesterol / HDL

  • s5      ltg, possibly log of serum triglycerides level

  • s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'

Step 1: Load data into MinIO platform

From the above link, we’re going to get the file from this source https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt

We’re going to create an “Entity in Historical Database” from this file so we’ll go to this option:

We’ll fill the main information

And click in continue. Then, we need to fill all the columns of the file with the string format (this is because CSV file need to be loaded with this column type)

Finally we click on create button and our new entity will be created:

Also we can query this entity throw presto engine with the query tool:

Step 2: Create notebook for get data, train and log the experiment

First of all, we go to create a new notebook, we go to the Analytics Tools option and we click in the new notebook (+) button we type a name for it

Also we can import this file that contains the full notebook for this example (only we need to set the token param)

We have some explanatory paragraphs for the dataset but we’re going to the code section. First paragraph that we’re going to focus it’s the import one

We load many libraries and we set the base url for the MinIO repository. The next paragrah is going to be the parameter paragraph in order to set variables that can come from outside

To get the filepath we can go to the File Section

Then to MinIO

 

And in the next page we can get the filepath

The token will be some X-OP-APIKey token that can access to the file.

Next paragraph 3 paragraphs will load the csv file itself that we token and filepath of the previous section, read it as csv with the column of the dataset (we need to include the columns in the read_csv function) and we show the loaded content

Now that we have out file as pandas dataframe we can split the data into train a test sets

Split also this datasets into X and Y datasets for input params and expected output

And run the ElasticNet training with this data and get in lr the output model

Finally, we evaluate some metric for the output of the prediction

Step 3: Log training and model data in MLFlow

The notebook module is integrated with MLFlow tracking serve so, the only thing that we need to do in the notebook is import the necessary “MLFlow” lib and use the MLFlow tracking functions. That will be do in the import libs section

The connection params and enviroment variables are already done so now we can log out params in MLFlow directly like this

%python with mlflow.start_run(): mlflow.set_tag("mlflow.runName", "DiabetesModelGenerator") mlflow.log_param("alpha", alpha) mlflow.log_param("l1_ratio", l1_ratio) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) mlflow.log_metric("mae", mae) mlflow.sklearn.log_model(lr, "model") mlflow.end_run()

This is standard code for tracking an experiment in MLFlow. We include everything inside “with mlflow.start_run()“ to start a new experiment. The other functions are:

  • mlflow.set_tag("mlflow.runName", ...) → (optional) for setting a run name of the experiment. If we don’t use this we will only have an autogenerated id, the experiment ID

  • mlflow.log_param(...) → log a param input for the experiment

  • mlflow.log_metric(...) → log an output metric for the experiment

  • mlflow.sklearn.log_model(lr, "model") → log and save the trained model with all necessary metadata files

If we execute this paragraph we’re going to have an output like this. The log process finished ok.

If we go to the Models Manager UI in controlpanel:

We can see the experiment execution with all the log params, metrics and the files:

Clicking in the experiment open the detail page:

And, at the end of the page, we can review all the files for this experiment and the model itself

The run id of the right side (runs:/859953c3a0dd4596bd15d864e91081ab/model) is important because we’re going to use it for publising the model in the next step. This is the reference that we need to pick up the model in the MLFlow and do some evaluations with it.

We can also register the in order to label it, version it and have it outside of the experiment, we can do it with the code or we can use the register button on the right side:

 

And if we go to the model tab we can see it and working with it

Step 4: Create a Serverless python function that eval data against the MLFlow model

With previous generated model we’re going to create a deploy a python function that, with an simple or multiple input, can get a prediction using the model.

First step is going to the Serverless Applications menú

Then we’re going to create (with the + button) a new application and we’re going to fill all the necessary inputs

We can create a new repository or use an existing one. In any case, we’re going to have a new application like this

Then we can go to “View It” button and then the functions tab

Next step it’s to create or use and existing serverless function, we click on “Create Function” and we’re going to create three files.

First of all, we select the main branch on the right side:

Then we’re going to create (here or in the git repository from outside) the 3 files:

requeriments.txt → libraries that out model need to be executed. In this case we’re going to have these:

fdk
protobuf==3.20.*
numpy==1.23.4
mlflow==1.19.0
mlflow-onesaitplatform-plugin==0.2.11
scikit-learn

func.yaml → the project metadata need for the serverless function. The content will be:

schema_version: 20180708 name: diabetes-predictor version: 0.1.1 runtime: python build_image: fnproject/python:3.9-dev run_image: fnproject/python:3.9 entrypoint: /python/bin/fdk /function/func.py handler memory: 256 triggers: - name: endpoint type: http source: /diabetes-predictor

It’s important the triggers.source config for having the endpoint for this function, the name and runtime

func.py → the content of the evalution function itself. We need to load libraries for eval the model, MLFlow itself and fdk for the endpoint. We also use an enviroment variable for the parametric input of the host, experiment and token

import io import json import logging import os os.environ["HOME"] = "/tmp" import random import mlflow from fdk import response host = os.environ['HOST'] token = os.environ['TOKEN'] experimentid = os.environ['EXPERIMENTID'] tracking_uri = "https://" + host + "/controlpanel/modelsmanager" model_uri = "onesait-platform://" + token + "@" + host + "/0/" + experimentid + "/artifacts/model" global pyfunc_predictor mlflow.set_tracking_uri(tracking_uri) pyfunc_predictor = mlflow.pyfunc.load_model(model_uri=model_uri) logging.getLogger().info("Diabetes Progression Predictor ready") def handler(ctx, data: io.BytesIO = None): try: logging.getLogger().info("Try") answer = [] json_obj = json.loads(data.getvalue()) logging.getLogger().info("json_obj") logging.getLogger().info(str(json_obj)) if isinstance(json_obj, list): logging.getLogger().info("isinstance") answer = [] values = [] inputvector = [] for input in json_obj: logging.getLogger().info("for") logging.getLogger().info("input: " + str(input)) inputvector = [ input['age'], input['sex'], input['bmi'], input['bp'], input['s1'], input['s2'], input['s3'], input['s4'], input['s5'], input['s6']] values.append(inputvector) predict = pyfunc_predictor.predict(values) answer = predict.tolist() logging.getLogger().info("prediction") else: answer = "input object is not an array of objects:" + str(json_obj) logging.getLogger().error('error isinstance(json_obj, list):' + isinstance(json_obj, list)) raise Exception(answer) except (Exception, ValueError) as ex: logging.getLogger().error('error parsing json payload: ' + str(ex)) logging.getLogger().info("Inside Python ML function") return response.Response( ctx, response_data=json.dumps(answer), headers={"Content-Type": "application/json"} )

We can save everything and deploy our function with the rocket button:

The final step is to put the enviroment variables for the model with the button:

Now we can test out model with the Rest API with postman for example sending an array of json with the input:

Or we can create a dashboard model evaluator that use this endpoint with some provided input

Or we can eval this model in a streaming or batch Dataflow with the corresponding evaluator component