Monitoring on FlowEngine

Introduction

In this tutorial we are going to explain how the domain monitoring that has been added to FlowEngine works.

Three new features have been added that will help you monitor and recover your domains in case of error.

Endpoint monitoring (healthcheck)

A new monitoring endpoint has been created in each domain, through which the domain’s status can be consulted. The endpoint call is made like this:

  • Internal URL (in the CaaS cluster): http://flowengineservice:5050/<domain_name>/health

  • External URL (outside the CaaS cluster): http://<url_instalación>/nodered/<domain_name>/health

This helthcheck service will show you the following information in a json format:

  • CPU usage.

  • Memory usage.

  • Information about the status of the different sockets..

{
"cpu": 1.4831932773109242,
"memory": 117186560,
" sockets": [
"node 12884 rtvachet 11u IPv6 263833 0t0 TCP *:28001(LISTEN)",
"node 12884 rtvachet 12u IPv6 263870 0t0 TCP localhost:28001->localhost:58338 (ESTABLISHED)",
"node 12884rtvachet 13u IPv6 262758 0t0 TCP localhost:28001->localhost:59326 (ESTABLISHED)"
]
}

Automatic recovery in case of error

The possibility of activating a new property that detects when a domain has stopped running and restarts it, has been added to the ControlPanel. Whenever that happens, the domain will be automatically restarted.

To activate it, follow these steps:

  1. Select "My Digital Flows" option from the DEVELOPMENT menu:

     

  2. Select the new option, “edit”:



  3. The new property “Reboot on failure“ will now appear:


Checking this box will cause the domain to reboot if at any time it stops running due to any failure. The average time from domain failure to its recovery is about 30 seconds.

 

Additionally, a control has been added that counts the number of reboots given a time window. If the amount of reboot in a domain exceeds the threshold established for said time window, the domain will remain stopped, and the check will be automatically deactivated. The window size and restart threshold are defined in the following platform-level properties:

  • onesaitplatform.flowengine.reboot.count.monitor.sec: Size of the time window (in seconds) in which the amount of reboots will be counted. The default value is 30 minutes.

  • onesaitplatform.flowengine.reboot.count.monitor.max: Maximum amount of reboots allowed during the time window defined in the previous property. The default value is 10 reboots.

Automatic domain control based on the number of sockets in a state

Sometimes a domain is active (running) but its performance may not be as good as desired. In order to monitor domains more accurately, a numbers of controls have been added on the number of sockets and their states. From the domain editing screen itself, you can select the maximum number of sockets, either in total or in a specific state.


The filters will be active only if you select the checkbox associated with each state. If at any time the number of sockets in the indicated state exceeds the established limit, the domain will be automatically stopped. If automatic restart has also been selected, the domain will restart after 30 to 60 seconds. This period of time is necessary for the correct closure of the processes.