ML model generation and publication
Introduction
Starting from data of Diabetes we’re going to generate the corresponding model that predict a quantitative measure of disease progression one year after baseline. We’re going to use:
MinIO file system for save the original dataset. We’ll load the file using the Create Entity in Historical Database
Notebooks module to have a parametric process for get the data from MinIO, train and generate the model and log everything in MLFlow
Models Manager (MLFlow) for log all the experiment of the notebook and save the model and other files for the training
Serverless module to create a rest scalable python function that using the model can predict the disease progression
Dataset
The information of the diabetes dataset is the following.
----------------
Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
Data Set Characteristics:
:Number of Instances: 442
:Number of Attributes: First 10 columns are numeric predictive values
:Target: Column 11 is a quantitative measure of disease progression one year after baseline
:Attribute Information:
age age in years
sex
bmi body mass index
bp average blood pressure
s1 tc, total serum cholesterol
s2 ldl, low-density lipoproteins
s3 hdl, high-density lipoproteins
s4 tch, total cholesterol / HDL
s5 ltg, possibly log of serum triglycerides level
s6 glu, blood sugar level
Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).
Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)'
Step 1: Load data into MinIO platform
From the above link, we’re going to get the file from this source https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt
We’re going to create an “Entity in Historical Database” from this file so we’ll go to this option:
We’ll fill the main information
And click in continue. Then, we need to fill all the columns of the file with the string format (this is because CSV file need to be loaded with this column type)
Finally we click on create button and our new entity will be created:
Also we can query this entity throw presto engine with the query tool:
Step 2: Create notebook for get data, train and log the experiment
First of all, we go to create a new notebook, we go to the Analytics Tools option and we click in the new notebook (+) button we type a name for it
Also we can import this file that contains the full notebook for this example (only we need to set the token param)
We have some explanatory paragraphs for the dataset but we’re going to the code section. First paragraph that we’re going to focus it’s the import one
We load many libraries and we set the base url for the MinIO repository. The next paragrah is going to be the parameter paragraph in order to set variables that can come from outside
To get the filepath we can go to the File Section
Then to MinIO
And in the next page we can get the filepath
The token will be some X-OP-APIKey token that can access to the file.
Next paragraph 3 paragraphs will load the csv file itself that we token and filepath of the previous section, read it as csv with the column of the dataset (we need to include the columns in the read_csv function) and we show the loaded content
Now that we have out file as pandas dataframe we can split the data into train a test sets
Split also this datasets into X and Y datasets for input params and expected output
And run the ElasticNet training with this data and get in lr the output model
Finally, we evaluate some metric for the output of the prediction
Step 3: Log training and model data in MLFlow
The notebook module is integrated with MLFlow tracking serve so, the only thing that we need to do in the notebook is import the necessary “MLFlow” lib and use the MLFlow tracking functions. That will be do in the import libs section
The connection params and enviroment variables are already done so now we can log out params in MLFlow directly like this
%python
with mlflow.start_run():
mlflow.set_tag("mlflow.runName", "DiabetesModelGenerator")
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.log_metric("mae", mae)
mlflow.sklearn.log_model(lr, "model")
mlflow.end_run()
This is standard code for tracking an experiment in MLFlow. We include everything inside “with mlflow.start_run()“ to start a new experiment. The other functions are:
mlflow.set_tag("mlflow.runName", ...)
→ (optional) for setting a run name of the experiment. If we don’t use this we will only have an autogenerated id, the experiment IDmlflow.log_param(...)
→ log a param input for the experimentmlflow.log_metric(...)
→ log an output metric for the experimentmlflow.sklearn.log_model(lr, "model")
→ log and save the trained model with all necessary metadata files
If we execute this paragraph we’re going to have an output like this. The log process finished ok.
If we go to the Models Manager UI in controlpanel:
We can see the experiment execution with all the log params, metrics and the files:
Clicking in the experiment open the detail page:
And, at the end of the page, we can review all the files for this experiment and the model itself
The run id of the right side (runs:/859953c3a0dd4596bd15d864e91081ab/model) is important because we’re going to use it for publising the model in the next step. This is the reference that we need to pick up the model in the MLFlow and do some evaluations with it.
We can also register the in order to label it, version it and have it outside of the experiment, we can do it with the code or we can use the register button on the right side:
And if we go to the model tab we can see it and working with it
Step 4: Create a Serverless python function that eval data against the MLFlow model
With previous generated model we’re going to create a deploy a python function that, with an simple or multiple input, can get a prediction using the model.
First step is going to the Serverless Applications menú
Then we’re going to create (with the + button) a new application and we’re going to fill all the necessary inputs
We can create a new repository or use an existing one. In any case, we’re going to have a new application like this
Then we can go to “View It” button and then the functions tab
Next step it’s to create or use and existing serverless function, we click on “Create Function” and we’re going to create three files.
First of all, we select the main branch on the right side:
Then we’re going to create (here or in the git repository from outside) the 3 files:
requeriments.txt → libraries that out model need to be executed. In this case we’re going to have these:
fdk
protobuf==3.20.*
numpy==1.23.4
mlflow==1.19.0
mlflow-onesaitplatform-plugin==0.2.11
scikit-learn
func.yaml → the project metadata need for the serverless function. The content will be:
schema_version: 20180708
name: diabetes-predictor
version: 0.1.1
runtime: python
build_image: fnproject/python:3.9-dev
run_image: fnproject/python:3.9
entrypoint: /python/bin/fdk /function/func.py handler
memory: 256
triggers:
- name: endpoint
type: http
source: /diabetes-predictor
It’s important the triggers.source config for having the endpoint for this function, the name and runtime
func.py → the content of the evalution function itself. We need to load libraries for eval the model, MLFlow itself and fdk for the endpoint. We also use an enviroment variable for the parametric input of the host, experiment and token
import io
import json
import logging
import os
os.environ["HOME"] = "/tmp"
import random
import mlflow
from fdk import response
host = os.environ['HOST']
token = os.environ['TOKEN']
experimentid = os.environ['EXPERIMENTID']
tracking_uri = "https://" + host + "/controlpanel/modelsmanager"
model_uri = "onesait-platform://" + token + "@" + host + "/0/" + experimentid + "/artifacts/model"
global pyfunc_predictor
mlflow.set_tracking_uri(tracking_uri)
pyfunc_predictor = mlflow.pyfunc.load_model(model_uri=model_uri)
logging.getLogger().info("Diabetes Progression Predictor ready")
def handler(ctx, data: io.BytesIO = None):
try:
logging.getLogger().info("Try")
answer = []
json_obj = json.loads(data.getvalue())
logging.getLogger().info("json_obj")
logging.getLogger().info(str(json_obj))
if isinstance(json_obj, list):
logging.getLogger().info("isinstance")
answer = []
values = []
inputvector = []
for input in json_obj:
logging.getLogger().info("for")
logging.getLogger().info("input: " + str(input))
inputvector = [ input['age'], input['sex'], input['bmi'], input['bp'], input['s1'], input['s2'], input['s3'], input['s4'], input['s5'], input['s6']]
values.append(inputvector)
predict = pyfunc_predictor.predict(values)
answer = predict.tolist()
logging.getLogger().info("prediction")
else:
answer = "input object is not an array of objects:" + str(json_obj)
logging.getLogger().error('error isinstance(json_obj, list):' + isinstance(json_obj, list))
raise Exception(answer)
except (Exception, ValueError) as ex:
logging.getLogger().error('error parsing json payload: ' + str(ex))
logging.getLogger().info("Inside Python ML function")
return response.Response(
ctx, response_data=json.dumps(answer),
headers={"Content-Type": "application/json"}
)
We can save everything and deploy our function with the rocket button:
The final step is to put the enviroment variables for the model with the button:
Now we can test out model with the Rest API with postman for example sending an array of json with the input:
Or we can create a dashboard model evaluator that use this endpoint with some provided input
Or we can eval this model in a streaming or batch Dataflow with the corresponding evaluator component