When to use the Dataflow module?

Introduction

This article describes how Onesait Platform's Dataflow module can help solve various scenarios that are typical of many projects.

The Dataflow module allows you to define, graphically and easily, data flows, data transformations, etc. In this link you can query in detail the technical capabilities of the Dataflow module: https://onesaitplatform.atlassian.net/wiki/spaces/DOCT/pages/2220820644 .

This article is not intended to show concrete examples of data flows, but only to show, at a higher level, the possibilities offered by this module.

Streaming data processing

One of the most used capabilities of the Dataflow module is streaming data processing. With the Dataflow module, data flows can be defined from the connection to the data source, to the possible destinations of the data. It also allows you to implement the necessary data transformations.

Let's take a look at an example to describe the different steps and possibilities in a greater detail. The image at the end of this section shows an example with two different streaming data sources. These data sources can even be from different technologies, e.g. communication brokers such as Apache Kafka and Google Pub/Sub. More traditional sources, such as relational databases, NoSQL databases, FTP servers, etc., can also be used as data sources. The list of supported technologies is extensive and continues to grow. It can be found in this link: https://onesaitplatform.atlassian.net/wiki/spaces/DOCT/pages/2220821331.

Through pipelines defined in the Dataflow it connects to external data sources. Once the data is obtained, the pipelines can perform validations, enrich the data with other sources and generally perform the necessary transformations. Finally, the data is stored in one or more destinations.

In this case, the dataflow connectors will subscribe to the data sources and as soon as a new data becomes available, they will process it.

Batch processing

Another very typical scenario in projects are tasks that have to be performed periodically without any user having to monitor or trigger these tasks.

The image at the end of this section shows a scenario similar to the previous one. The difference is that, in this case, the connectors will not be subscribed to the data sources. Instead, a scheduler will launch the tasks on a scheduled basis. Onesait Platform allows this scheduling using the Flowengine component (https://onesaitplatform.atlassian.net/wiki/spaces/PT/pages/308445197 ).

In this scenario, the pipeline will start when the scheduler tells it to. If necessary, the scheduler itself can pass parameters to the pipelines. Once the data has been processed, the pipelines will stop until the scheduler decides to launch a new execution.

Data replication between environments

Many projects have a need for copies of production data in test or pre-production environments. With the Dataflow module, this need can be met. There are several ways to do this data replication with the Dataflow module.

The image in this section shows a scenario where Dataflow from one of the environments exports to another environment directly. Other possibilities would be that the Dataflow of the destination reads the data from the source, or even using an intermediate broker or repository. Depending on the connectivity between the environments, different variants should be used. Connectivity limitations are usually determined by security requirements.

System integration

In many projects, integration between different systems is done at the data level. In these cases, the Dataflow module can be used to obtain data from external systems and make it available for new developments on the platform. The same happens the other way around: There are customers who have tools or other systems that they will need to store data locally, and someone must provide that to them.

The image at the end of this section shows an example with two instances of the Dataflow module - one dedicated to acquiring data from external sources, and one dedicated to providing data to external systems. Having dedicated Dataflow instances makes it easier to manage pipelines when their number starts to grow.

Data centralisation

In many cases, Onesait Platform is used to centralise data from different systems. A clear example is in data lake-type projects. In this case, a pipeline will be defined for each data source. Thanks to the flexibility of Dataflow, a multitude of technologies can be used as data sources, and new sources can be added without having to deploy new software, as the dataflow pipelines are defined dynamically.

Conclusions

We have seen some of the typical cases that can be implemented with Dataflow. There are many more cases that can be solved with Dataflow, because if this module stands out for something, it is for its flexibility.