Training & deployment of models with BaseModelService

What is BaseModelService?

BaseModelService is a Python class that is distributed as part of the Python client for Onesait Platform (hereafter OSP). The code for this client is maintained by the OSP community on Github, and it can be installed using pip:

pip install onesaitplatform-client-services

BaseModelService allows to:

Train models.
Retrain models to generate new versions of these models.
Deploy the trained models.
Make inference with the deployed models.

All of this is done by taking advantage of the tools that OSP provides for:

Management of training datasets.
Storage of the trained models.
Control of their different versions.
Deployment through microservices.

BaseModelService abstracts to the model developer the management of all this functionality, allowing her to make use of it in a simple way. As its name suggests, BaseModelService is a parent class from which the model developer will create a child class that inherits from it.

The child class will contain the specific code to train a particular model, save it to a local path, subsequently load the saved version also from a local path, and use it in inference. The model’s developer can use any type of Python library (scikit-learn, Tensorflow, PyTorch, etc.) and can also use the model saving and loading mechanisms of her choice.

The rest of the OSP interaction tasks have already been defined in the BaseModelService parent class: downloading the dataset from a file in the File Repository, or from an ontology; saving the trained models in the File Repository, downloading these models from the File Repository, checking the different versions of the same model and selecting the preferred version.

What do I need to have in OSP?

OSP provides support for managing and storing datasets and models. To do so, the following must be configured:

A dataset in the File Repository. This will be the dataset that will be used to train the model.
Alternatively, an ontology in which the dataset in question is stored.
An ontology in which the different versions of the model are registered.
A Digital Client to which the previous ontologies are associated and to which they can be accessed.

The structure of the model version registration ontology must be as follows. In this case, the ontology created has been called SentimentAnalysisModels:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "SentimentAnalysisModels",
    "type": "object",
    "required": [
        "SentimentAnalysisModels"
    ],
    "properties": {
        "SentimentAnalysisModels": {
            "type": "string",
            "$ref": "#/datos"
        }
    },
    "datos": {
        "description": "Info SentimentAnalysisModels",
        "type": "object",
        "required": [
            "name",
            "description",
            "asset",
            "version",
            "metrics",
            "hyperparameters",
            "model_path",
            "date",
            "active"
        ],
        "properties": {
            "name": {
                "type": "string"
            },
            "description": {
                "type": "string"
            },
            "asset": {
                "type": "string"
            },
            "version": {
                "type": "string"
            },
            "metrics": {
                "type": "array",
                "items": {
                    "type": "object",
                    "required": [
                        "name",
                        "value"
                    ],
                    "properties": {
                        "name": {
                            "type": "string"
                        },
                        "value": {
                            "type": "string"
                        },
                        "dtype": {
                            "type": "string"
                        }
                    },
                    "additionalProperties": false
                },
                "minItems": 0
            },
            "hyperparameters": {
                "type": "array",
                "items": {
                    "type": "object",
                    "required": [
                        "name",
                        "value"
                    ],
                    "properties": {
                        "name": {
                            "type": "string"
                        },
                        "value": {
                            "type": "string"
                        },
                        "dtype": {
                            "type": "string"
                        }
                    },
                    "additionalProperties": false
                },
                "minItems": 0
            },
            "model_path": {
                "type": "string"
            },
            "date": {
                "type": "string",
                "format": "date-time"
            },
            "dataset_path": {
                "type": "string"
            },
            "active": {
                "type": "boolean"
            },
            "ontology_dataset": {
                "type": "string"
            }
        }
    },
    "description": "Definition of trained models",
    "additionalProperties": true
}

The ontology fields, as shown above, are the following:

name: model name
description: model description
asset: name of the asset in which the model is framed
version: model version
metrics: list of evaluation criteria of the model (each criterion in the list consists of a field name, name of the criterion; a field value, value for that criterion; and dtype, data type of the value)
hyperparameters: list of hyperparameters with which the model has been trained (each hyperparameter in the list consists of a name field, name of the hyperparameter; a value field, value of the hyperparameter; and dtype, data type of the value).
model_path: identifier of the file corresponding to the model stored in the File Repository.
date: date and time when the model was created.
dataset_path: identifier of the file corresponding to the training dataset used for training in File Repository.
active: boolean denoting whether a version of the model is active. Usually, only one version of the model will be active, and it will be the one loaded as the serviced model.
ontology_dataset: name of the ontology in which the dataset has been saved.

Creation of a BaseModelService child class: SentimentAnalysisModelService

To create an object that manages the training, saving, loading, deployment and use of a specific model, a class that inherits from BaseModelService must be created. In this tutorial, we are going to create a class that manages sentiment analysis models. It will be called SentimentAnalysisModelService:

from onesaitplatform.model import BaseModelService

class SentimentAnalysisModelService(BaseModelService):
    """Service for models of Sentiment Analysis"""

    def __init__(self, **kargs):
        """
        YOUR CODE HERE
        """
        super().__init__(**kargs)
        
    def load_model(self, model_path=None, hyperparameters=None):
        """Loads a previously trained model and save it a one or more object attributes"""

        """
        YOUR CODE HERE
        """

    def train(self, dataset_path=None, hyperparameters=None, model_path=None):
        """
        Trains a model given a dataset and saves it in a local path.
        Returns a dictionary with the obtained metrics
        """

        """
        YOUR CODE HERE
        """

        return metrics

    def predict(self, inputs=None):
        """Predicts given a model and an array of inputs"""

        """
        YOUR CODE HERE
        """

        return results

As seen above, the child class must override the init, load_model, train and predict methods of BaseModelService.

Specifically, we are going to create a SentimentAnalysisModelService class that manages sentiment analysis models on Spanish data. It will be a binary text classifier: the output will be 0 for texts with negative sentiment, 1 for texts with positive sentiment. Tensorflow 2.x will be used for this purpose. A perceptron will be built whose input will be a bag of words with tf-idf. These will be toy models. It is not intended to elaborate good models, only to show how to develop them in a simple way. The model will be saved using h5 and pickle: a file with the weights resulting from the training and another file for the tokenizer that is trained to do the text preprocessing.

Thus, the libraries and classes needed to build the model as described above are imported. Along with them, the BaseModelService class is imported:

Override the init method

In the init method, you will initialize the attributes that will later be used in other methods to reference the model. Specifically, for the sentiment model to be developed in this tutorial, two attributes will be enabled: one to store the model itself (model), the neural network that will return 1 for texts with positive sentiment and 0 for texts with negative sentiment; and another one for the text preprocessor (preprocessor):

Override the load_model method

The load_model method is in charge of building the model to be serviced from the file or files in which it has been previously saved. This method is executed when the object is created. The object constructor will search in the corresponding OSP ontology for the appropriate model, and download it from the OSP File Repository to a local directory. This download will contain exactly the files and/or directories that were created at the time the model was saved (see train method).

The following two parameters are passed to the load_model method:

model_path: the path to the local directory where the files and/or directories needed to load the model are located. The developer assumes that, in that path, she will find all the files and/or directories he created at the time she saved the model to be loaded. Therefore, she can use it rebuild the model from those elements.
hyperparameters: this is a dictionary with all the hyperparameters that were used to train the model. They may be necessary for its reconstruction. In this example, they will not be used.

Specifically, for the SentimentAnalysisModelService class, it is assumed that the models are stored in two files: an h5 with the Tensorflow neural network and a pickle with the tokenizer object that preprocesses the text. Therefore, it is assumed that these two files have to be provided within the model_path directory:

model.h5
tokenizer.pkl

The neural network is stored in the model attribute, while the tokenizer is stored in the preprocessor attribute (both previously initialized in the init method). See the code:

Override the train method

The train method is in charge of training the model. It is executed internally when the developer executes one of these methods, implemented in BaseModelService:

train_from_file_system: launches the training of a model from a dataset previously saved in the OSP File Repository.
train_from_ontology: launches the training of a model from a dataset stored in an OSP ontology.

The train method receives the following parameters:

dataset_path: is the local path to the file in which the training dataset is provided. This file can have its origin in a file previously stored in the File Repository. In such a case, it will have exactly the format of the saved file. If the origin of the file is an ontology, it will have been converted to a CSV with "," as delimiter and as many columns as there are fields in the ontology records.
hyperparameters: it is a dictionary with the hyperparameters that were passed to the train_from_file_system or train_from_ontology methods at the time of launching the training.
model_path: it is the path to the local directory where the files or directories where the model will be saved once trained.

The developer will have to read the dataset from the local file provided in dataset_path. This will feed the training process. Once it is finished, she has to save the resulting model in the directory indicated in model_path. In addition, the train method has to return a dictionary with the model evaluation metrics that the developer considers necessary.

In the case of SentimentAnalysisModelTrain, the training dataset will be assumed to be a CSV with "," as delimiter. This dataset will contain two columns:

text: with the texts containing the opinions.
label: with a 1 for texts with positive opinion and 0 for those with negative opinion.

We train a model with Tensorflow2.x. For the preprocessing of the texts, we use the Keras tokenizer that converts each text to a vector of n positions representing a bag of words with tf-idf: each position of the vector denotes a word (always the same) with a numeric value (between 0 and 1) denoting how relevant that word is in the text in question. See the code:

Overwrite the predict method

The predict method receives a parameter (input) with the list of inputs for which inference is to be made; it calculates the output according to the model, and returns it in a list. Specifically, for SentimentAnalysisModelService, the input is assumed to be a list of texts. It takes, on one hand, the model of the model attribute, and on the other, the text preprocessor of the preprocessor attribute (both initialized in init and instantiated in load_model), and the input is processed with them. The results are returned.

Create an object, train and predict

It is assumed that the following items have been created in the OSP deployment of https://lab.onesaitplatform.com/:

A dataset as a CSV file with "," as a separator in the File Repository. The dataset will have two columns: text (with the texts) and label (with value 1 or 0, where 1 denotes that the text has positive sentiment and 0 denotes that the text has negative sentiment).
The same dataset in an ontology called SentimentAnalysisDataset, where each element will have a text field and a label field.
An ontology called SentimentAnalysisModels with the structure shown above.
A Digital Client associated to the two previous ontologies, called SentimentAnalysisDigitalClient.
The SentimentAnalysisModelService class described above.

With all this, the object sentiment_analysis_model_service of class SentimentAnalysisModelService is going to be created:

The parameters passed to the object are as follows:

PLATFORM_HOST: Host of the OSP deployment to be worked on. In this case, lab.onesaitplatform.com .
PLATFORM_PORT: Port where the OSP is served.
PLATFORM_DIGITAL_CLIENT: Name of the Digital Client created in OSP to give access to the ontologies.
PLATFORM_DIGITAL_CLIENT_TOKEN: Authentication token corresponding to the Digital Client.
PLATFORM_DIGITAL_CLIENT_PROTOCOL: Protocol under which communications with OSP will be established.
PLATFORM_DIGITAL_CLIENT_AVOID_SSL_CERTIFICATE: True if connections without certificate are to be established.
PLATFORM_ONTOLOGY_MODELS: Name of the ontology where the different versions of the model created will be registered.
PLATFORM_USER_TOKEN: Authentication token of an OSP user.
TMP_FOLDER: Local directory to be used as local address to which the OSL File Repository elements will be temporarily downloaded; and in which the models will be temporarily stored before being uploaded to the File Repository.
NAME: Name of the model service.

Once the sentiment_analysis_model_service object has been created, it is ready to train versions of the sentiment analysis model as defined in the SentimentAnalysisModelService class. In addition, at the time the object is created, if the OSP ontology referenced in PLATFORM_ONTOLOGY_MODELS already contains any model in active state (active True), the latter will be loaded into memory and made available for use by the predict method.

To train a version of the model, one of these two methods can be executed:

train_from_file_system: launches the training of a model from a dataset previously stored in the OSP File Repository.
train_from_ontology: launches the training of a model from a dataset stored in an OSP ontology.

The following code launches the training of a model from a dataset previously uploaded to the OSP File Repository:

Notice that the value of DATASET_FILE_ID is the identifier of the file containing the dataset in the OSP File Repository.

Alternatively, the model can be trained from a dataset previously stored in an ontology. In the following code, the ontology is SentimentAnalysisDataset:

Once a model version has been successfully trained, it will be stored in the OSP File Repository, and will be registered in the SentimentAnalysisModels ontology. Note that model versions are saved as active False. One of the versions will have to be activated in order to be available when creating a new SentimentAnalysisModelService object.

Once there is an active version of the model in SentimentAnalysisModels, when a new instance of SentimentAnalysisModelService is created, it will have this model loaded and available to be used in inference by means of the process method:

Una vez se ha entrenado con éxito una versión del modelo, esta será guardada en el File Repository de OSP, y será registrada en la ontología SentimentAnalysisModels. Téngase en cuenta que las versiones de los modelos se guardan como active False. Habrá que activar una de las versiones para que esta se disponibilice al crear un nuevo objeto SentimentAnalysisModelService.

Una vez existe una versión activa del modelo en SentimentAnalysisModels, al crearse una nueva instancia de SentimentAnalysisModelService, esta tendrá dicho modelo cargado y disponible para usarse en inferencia mediante el método process:

platform-doc-en