Creation of a time series prediction model with Prophet
Introduction
This tutorial will explain the main steps to follow to create a predictive model with Prophet. In this case, we will create a model to make predictions based on a time series.
Prophet is an open source toolkit (launched by Facebook's Core Data Science team) for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly and daily seasonality, plus holiday effects. It has tools for Python and R.
We will use NO2 data from a Madrid weather station.
Access to the Platform's Notebooks
First, you must acces the Platform CloudLab Environment with an Analytics role.
Once in the platform's control panel, access th Notebooks in the Analytics Tools menu.
Lastly, create a new Notebook.
The Notebook
Within the Notebook, several main tasks can be performed:
- Data download.
- Environment Preparation.
- Data cleaning.
- Model creation and training.
- Model comparative example.
Data download
Download and extract the data from 2011 to 2018 from the OpenData in Madrid:
We delete unused files:
Environment Preparation
Next, you will create and activate a work environment with the necessary Python dependencies. This is a necessary task that must be performed for each model. You will use the Conda package manager.
Once the environment is activated, install the necessary libraries.
Data cleaning
Firstly, import Python dependencies. The virtual environment must be activated.
Set variables to easily access the downloaded data. Use Python functions to avoid repeating code.
Once you have easy access to the files, several functions are created to perform data cleaning. The following functions have been developed for this specific case because they depend on the data form. Besides, as the data covers several years, you will process each year independently and in the end you will gather them.
The first functions are to read the data from .csv or .txt files and to load the data into a panda dataframe:
Let's see an example of a loaded dataframe:
The following functions are used to format the data. You must perform these steps to obtain the data in the way Prophet needs.
Function to select the atmospheric station and the outcome magnitude:
Function to select a day:
Function to compact the day's data. This step is necessary to transpose rows and columns.
Let's see the compacted data:
The following function transposes the data:
This function's result is shown below:
If you try to run these steps on every month of each year, you will find a problem in August, 2013. In this month, there is an error in the data on line 1421. Use a function to replace the value 001.NV with the value 001.0N . We choose the suffix "N" because we do not know whether the data is valid or not, so we set it to invalid.
The next step is to process the data using the previously developed functions. As explained above, you can perform the cleaning of the data month by month. For that purpose, use the following function:
You will receive information on the screen about the month you are processing. This way, you can see the result file with the clean data.
Perform the previous task for every month that you have data, then join all the output files into another file. Save this file in csv format.
Model creation and training
The last task to perform is the model creation and training. This is done to save the model in a file (serialization) for later use.
The model you are going to make is a time series prediction model. This type of model needs a series of time data and a variable. With these data, the variable's behavior is modeled according to the time series.
This way, you can make forecasts of future values, or compare past values.
First, read the csv file with the data that you have previously cleaned. You will see a graph with a part of this data.
Use the installed library to create the Prophet model and train it with the loaded data. In the following image, you can see the iterations of the model training.
Once the model is trained, you can make a future time series to predict the values ​​of the outcome variable.
The new time series looks like the following image, depending on the selected dates:
Predict future values ​​with this new future time series and show the data.
Explicación:
This graph shows the hours from  to , with the last three days being the prediction values.
You can obtain several conclusions from this graph:
- The prediction data is quite similar to the training time series.
- The schedule prediction data are very similar to the training time series.
- We hope to have enough error for the values ​​of . This is because we have not considered holidays when we have trained the model.
- Prediction values ​​for and will also be influenced by holidays, although to a lesser extent.
Bear in mind too that the special anti-pollution scenarios in Madrid have not been considered either. These data could be modeled in the same way as holidays, but at this time there is no availabledata on the scenarios.
At this point, it is time for you to save the model for later use. For this example, save it in a pickle file in the same folder as the downloaded data.
Model comparative example
To observe a comparison between the prediction values ​​and the actual values ​​in a time series, you will create another model with data from 2018.
This comparison model will be trained with the data prior to and will make the prediction on Dec-2018.
The steps, very similar to the previous model, are:
Data loading
Separation of train-test data (in the previous model, all the data were for training)
Model creation and training
Prediction and visualization
Explanation:
The previous graph shows the inferred NO2 values ​​(NO2_pred) and the actual NO2 (NO2_real) data for the first fifteen days of December, 2018.
We must bear three things in mind:
- The data is from the first fifteen days because they have fewer vacation cases than the rest of the month (Christmas).
- Holiday data has not been taken into account. This data can be inserted in the model or be the input of a later model.
- Weather data has not been considered. This prediction would be the basis for another prediction model taking into account weather and pollution data.
If we make a deeper analysis of the results, we have to separate different time intervals:
From  to :
During these days, the prediction values ​​are slightly higher than the real ones because they are the days between the weekend and the Constitution Day (national holiday), so a small percentage of the population did not work.
Day and :
These days are the days before the Constitution Day and, therefore, many car trips start from Madrid (Constitution exodus). YOu can see how pollution peaks are generated, particularly in the afternoon.
From to :
In the morning, you see real pollution values that are not coincident with those expected. If holidays are not modeled, they introduce error over the years.
Froml to :
These days correspond to the journeys back by car to Madrid (Constitution return) so you can also observe pollution peaks.