How to create automatic tests to test Dataflow pipelines?
Introduction
Onesait Platform's Dataflow module is a low-code tool for defining and executing data flows. In this article we will see how JUnit can be used to automate dataflow testing.
To test the pipelines we need an instance of Dataflow where to execute them. To keep this test completely independent of external resources, we use the TestContainers library to run a Dataflow instance in the test itself. It would be more efficient and faster when running the tests to have a dedicated instance with all the pipelines that need to be tested.
The purpose of this example is to show how this kind of tests can be performed, and therefore the pipeline we are going to test is very simple. Any other pipeline would be tested in a similar way.
The strategy followed in this example is to use the dataflow client library to manage the flows of an instance, execute a preview remotely and validate the values of the records in each of the stages. In other words, the same thing we would do manually and visually when using the preview of a pipeline, but automated.
Other types of tests could be performed, such as the complete execution of a pipeline and validating that the number of records in the outputs and the number and type of errors are as expected, but the information provided by this type of test is less accurate.
The complete code for this example is at github.
Dataflow Instance Configuration
As mentioned earlier, this example uses the TestContainers library, which allows for automatically starting and stopping containers in a test.
To use it with JUnit 5, you need the following dependencies:
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<version>1.18.3</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>junit-jupiter</artifactId>
<version>1.18.3</version>
<scope>test</scope>
</dependency>
Configuring the container in a basic way is very simple; you just need to annotate the test class with @Testcontainers
and create an attribute with the @Container
. annotation. For example, in our case:
private static final String IMAGE = "registry.onesaitplatform.com/onesaitplatform/streamsets:4.1.0-overlord-323";
private static final int PORT = 18630;
@Container
public static GenericContainer<?> dataflow = new GenericContainer<>(IMAGE)
.withExposedPorts(PORT)
.withStartupTimeout(Duration.ofSeconds(120));
The TestContainers library will automatically create and destroy a container for each test that runs. This can slow down the execution of tests. Additionally, each new container will run without any pre-installed pipelines or additional libraries, so these will need to be installed in the test process itself. There are several strategies to address this:
On one hand, you can have an image configured with all the required pipelines and libraries installed. This is as simple as creating an image from an existing container that has been configured using the command
docker commit <container> <new_image_name>
.On the other hand, you can have a ready and configured Dataflow instance in an environment to be used during test execution.
Testing with JUnit
Creating a Dataflow Client
The first step needed to run the tests is to create an instance of the Dataflow client.
ApiClient apiClient = new ApiClient(authType);
apiClient.setUserAgent("SDC CLI");
apiClient.setBasePath(getUrl(port) + "/rest");
apiClient.setUsername(user);
apiClient.setPassword(password);
SimpleModule module = new SimpleModule();
module.addDeserializer(FieldJson.class, new FieldDeserializer());
apiClient.getJson().getMapper().registerModule(module);
In addition to creating the client, you can see in the code that a Deserializer is being registered. This is not mandatory. In this case, it is used to leverage the Dataflow Field classes to facilitate querying the values of the records, but the values could also be checked using the original JSON format.
The code for creating a client is encapsulated in a utility class so that it can be reused across multiple tests. github.
Import the pipeline to be tested.
The next step will be to import the pipeline to be tested.
Ver la sección Configurar Dataflow, para ver cómo configurar escenarios más complejos.
See Configuring Dataflow to learn how to set up more complex scenarios.
The code for creating a pipeline is encapsulated in a utility class so that it can be reused across multiple tests. github.
Run a pipeline preview.
Once the pipeline is in the Dataflow instance, we will run a preview.
The following code snippet does exactly that. It runs a preview, waits for the result to be ready, and retrieves the preview data.
Â
The code for running a pipeline preview is encapsulated in a utility class so that it can be reused across multiple tests. github.
Validating the Preview Result
Once the preview has been run, we can analyze the results to determine if they are as expected.
Essentially, this involves checking batch by batch if the records in the stages of interest have the expected values. In other words, we verify that the expected data is obtained and that the expected transformations are performed before sending to destination.
The example we have included only generates one batch of data. In most cases, this will be sufficient. If tests with multiple batches are required, it is advisable to use a manageable number of them to simplify the validation of results. For each batch obtained, there will be a set of records at each stage of the pipeline. The test should verify that the values correspond to what is expected at each stage.
As we can see in the following example, the pipeline shows a visualization of the data in the output of each stage.
The names DevRawDataSource_01
and Trash_01
used in the case block of the previous code snippet to determine the execution stage can be seen in the information of each stage in the pipeline editor: