Introduction

Onesait Platform's Dataflow module is a low-code tool for defining and executing data flows. In this article we will see how JUnit can be used to automate dataflow testing.

To test the pipelines we need an instance of Dataflow where to execute them. To keep this test completely independent of external resources, we use the TestContainers library to run a Dataflow instance in the test itself. It would be more efficient and faster when running the tests to have a dedicated instance with all the pipelines that need to be tested.

The purpose of this example is to show how this kind of tests can be performed, and therefore the pipeline we are going to test is very simple. Any other pipeline would be tested in a similar way.

The strategy followed in this example is to use the dataflow client library to manage the flows of an instance, execute a preview remotely and validate the values of the records in each of the stages. In other words, the same thing we would do manually and visually when using the preview of a pipeline, but automated.

Other types of tests could be performed, such as the complete execution of a pipeline and validating that the number of records in the outputs and the number and type of errors are as expected, but the information provided by this type of test is less accurate.

The complete code for this example is at github.

Dataflow Instance Configuration

As mentioned earlier, this example uses the TestContainers library, which allows for automatically starting and stopping containers in a test.

To use it with JUnit 5, you need the following dependencies:

<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>testcontainers</artifactId>
    <version>1.18.3</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.testcontainers</groupId>
    <artifactId>junit-jupiter</artifactId>
    <version>1.18.3</version>
    <scope>test</scope>
</dependency>

Configuring the container in a basic way is very simple; you just need to annotate the test class with @Testcontainers and create an attribute with the @Container. annotation. For example, in our case:

private static final String IMAGE = "registry.onesaitplatform.com/onesaitplatform/streamsets:4.1.0-overlord-323";
private static final int PORT = 18630;

@Container
public static GenericContainer<?> dataflow = new GenericContainer<>(IMAGE)
        .withExposedPorts(PORT)
		.withStartupTimeout(Duration.ofSeconds(120));

The TestContainers library will automatically create and destroy a container for each test that runs. This can slow down the execution of tests. Additionally, each new container will run without any pre-installed pipelines or additional libraries, so these will need to be installed in the test process itself. There are several strategies to address this:

On one hand, you can have an image configured with all the required pipelines and libraries installed. This is as simple as creating an image from an existing container that has been configured using the command docker commit <container> <new_image_name>.
On the other hand, you can have a ready and configured Dataflow instance in an environment to be used during test execution.

Testing with JUnit

Creating a Dataflow Client

The first step needed to run the tests is to create an instance of the Dataflow client.

ApiClient apiClient = new ApiClient(authType);
apiClient.setUserAgent("SDC CLI");
apiClient.setBasePath(getUrl(port) + "/rest");
apiClient.setUsername(user);
apiClient.setPassword(password);

SimpleModule module = new SimpleModule();
module.addDeserializer(FieldJson.class, new FieldDeserializer());
apiClient.getJson().getMapper().registerModule(module);

In addition to creating the client, you can see in the code that a Deserializer is being registered. This is not mandatory. In this case, it is used to leverage the Dataflow Field classes to facilitate querying the values of the records, but the values could also be checked using the original JSON format.

The code for creating a client is encapsulated in a utility class so that it can be reused across multiple tests. github.

Import the pipeline to be tested.

The next step will be to import the pipeline to be tested.

TypeRef<PipelineEnvelopeJson> returnType = new TypeRef<PipelineEnvelopeJson>() {};
PipelineEnvelopeJson pipelineEnvelopeJson = apiClient.getJson().deserialize(new File(getPipelinePath(filename)), returnType);

StoreApi storeApi = new StoreApi(apiClient);
PipelineEnvelopeJson importPipeline = storeApi.importPipeline(
		pipelineId, 
		rev, 
		false, 
		false, 
		false, 
		false, 
		pipelineEnvelopeJson);

Ver la sección Configurar Dataflow, para ver cómo configurar escenarios más complejos.

See Configuring Dataflow to learn how to set up more complex scenarios.

The code for creating a pipeline is encapsulated in a utility class so that it can be reused across multiple tests. github.

Run a pipeline preview.

Once the pipeline is in the Dataflow instance, we will run a preview.

The following code snippet does exactly that. It runs a preview, waits for the result to be ready, and retrieves the preview data.

PreviewApi previewApi = new PreviewApi(apiClient);

PreviewInfoJson previewOutput = previewApi.previewWithOverride(
		pipelineId, 
		Collections.<StageOutputJson>emptyList(), 
		rev, 
		batchSize, 
		batches, 
		writeTarget, 
		null, 
		null);
String previewerId = previewOutput.getPreviewerId();

PreviewStatus status = null;
do {
	PreviewInfoJson previewStatus = previewApi.getPreviewStatus(pipelineId, previewerId);
	TimeUnit.SECONDS.sleep(5);
	StatusEnum statusEnum = previewStatus.getStatus();
	status = PreviewStatus.valueOf(statusEnum.name());
} while (status.isActive());

assertTrue(PreviewStatus.FINISHED.equals(status));

PreviewOutputJson previewData = previewApi.getPreviewData(pipelineId, previewerId);

The code for running a pipeline preview is encapsulated in a utility class so that it can be reused across multiple tests. github.

Validating the Preview Result

Once the preview has been run, we can analyze the results to determine if they are as expected.

Essentially, this involves checking batch by batch if the records in the stages of interest have the expected values. In other words, we verify that the expected data is obtained and that the expected transformations are performed before sending to destination.

List<List<StageOutputJson>> batchesOutput = previewData.getBatchesOutput();

// This test only use one batch.
List<StageOutputJson> stageOutputs = batchesOutput.get(0);

for (StageOutputJson stageOutput : stageOutputs) {
	// Assert status of each stage for the selected batch.
	switch (stageOutput.getInstanceName()) {
	case "DevRawDataSource_01":
		Map<String, List<RecordJson>> outputs = stageOutput.getOutput();
		for (List<RecordJson> records : outputs.values()) {
			for (RecordJson record : records) {

				// If needed, test record header values
				// Dummy test as example
				HeaderJson header = record.getHeader();
				assertTrue(header.getSourceId().equals("rawData::0"));

				Field field = ApiUtils.getFieldFromRecord(record, apiClient.getJson());
				Map<String, Field> root = field.getValueAsMap();

				// We can test any data of the records, including the data type.
				assertTrue(root.get("value1").getType().equals(Field.Type.STRING));
				assertTrue(root.get("value1").getValueAsString().equals("abc"));
				assertTrue(root.get("value2").getType().equals(Field.Type.STRING));
				assertTrue(root.get("value2").getValueAsString().equals("xyz"));
				assertTrue(root.get("value3").getType().equals(Field.Type.STRING));
				assertTrue(root.get("value3").getValueAsString().equals("lmn"));
			}
		}
		break;
	case "Trash_01":
		// trash does not have output
		break;
	default:
		// stage not tested
		break;
	}
}

The example we have included only generates one batch of data. In most cases, this will be sufficient. If tests with multiple batches are required, it is advisable to use a manageable number of them to simplify the validation of results. For each batch obtained, there will be a set of records at each stage of the pipeline. The test should verify that the values correspond to what is expected at each stage.

As we can see in the following example, the pipeline shows a visualization of the data in the output of each stage.

Pipeline de ejemplo para tests

The names DevRawDataSource_01 and Trash_01 used in the case block of the previous code snippet to determine the execution stage can be seen in the information of each stage in the pipeline editor:

Ejemplo de nombre de etapa en un pipeline

platform-doc-en

How to create automatic tests to test Dataflow pipelines?