/
Data Refiner Use Workshop

Data Refiner Use Workshop

Available since version 2.3.0-Inmortal

Introduction

With the aim of incorporating new Data Governance capabilities into Onesait Platform, a new module called DataRefiner has been included, the purpose of which is to ‘refine’ the information that is uploaded to or extracted from the Platform.

This module will allow:

  • A user to load from a user interface data from various sources, such as from his own computer, from a URL or from information residing in the Platform itself.

  • The tool allows data to be uploaded in the main formats, including Excel, XML, JSON, CSV, etc.

  • The user can work with these data with an excel like interface to perform data profiling, including data cleansing, enhancement, restructuring or reconciliation.

  • The ‘refined’ data can be downloaded as files or uploaded to a Platform Entity.

image-20250321-073312.png

Technology on which DataRefiner is based

The DataRefiner module is based on OpenRefine, a BSD-3 licensed, Java-based, open source Java tool, which with an Excel-style web interface allows data from different sites and in different formats to be loaded, understood, cleaned, reconciled and improved.

OpenRefine, as an open source project, is available for download on GitHub. In order to integrate it into the Onesait Platform, a series of extensions have been developed to optimise it and make it compatible.

The project documentation is available on its website, so if you are interested in finding out how to start a project, explore and transform data or export results, we recommend that you go there.

How to use DataRefiner

This module is available in the ML & AI > Data Refiner Tool menu.

image-20250325-132830.png

Create a Project from an External URL

For this example we will use this dataset in CSV format belonging to the Isle of Wight Council: https://www.iwight.com/documentlibrary/download/transparency-data-april-2013-csv

 

After accessing the tool, the list of available projects will be displayed. To create a new project, click on the ‘Create’ button at the top right of the screen.

image-20250321-082322.png

A list of the different data sources will then be displayed. From this list, select ‘Web Addresses (URLs)’.

image-20250321-082921.png

Once you have entered the URL of the CSV file mentioned above, click on the ‘Next’ button:

image-20250321-083544.png

A screen will be displayed for a few seconds indicating that the information is being loaded, and then all data will be displayed in table format:

image-20250321-083703.png

This screen is part of the data analysis, being able to configure how the data are loaded and to define the format in which they come. Therefore, at the bottom of the screen there is a series of options for the loading configuration, which will be common to all data sources.

image-20250321-084408.png

Once the data is as required, click on the ‘Create Project’ button:

image-20250321-085529.png

A message will appear indicating that the project is being created, and as soon as the project is finished, the project will be displayed, with an example of ten records.

image-20250321-091805.png

To return to the list of projects, simply click on the ‘Back’ button:

image-20250321-092836.png

In the project list you can see the new project that has just been created:

image-20250321-093009.png

Other Examples of Use

Below are a series of examples of how to use DataRefiner with different origins.

Ingest Local Files

In this example, a project will be generated as in the previous case, but in this case the source of the data will not be a URL, but will come from an existing CSV file on the computer.

For this purpose, the following CSV can be used, which corresponds to the CSV used in the previous example:

Once downloaded locally, a new project will be created, selecting in this case the ‘This Computer’ option:

image-20250321-102714.png

Once the CSV file has been selected, click on the ‘Next’ button:

image-20250321-103232.png

As in the previous case, a screen will be displayed for a few seconds indicating that the information is being loaded, and then all the data will be displayed in table format:

image-20250321-103412.png

Simple Transformations

In this example we are going to transform the type of a column from numeric type to string type.

To do so, we will use one of the projects previously created from the Isle of Wight data. From the project list, select one of them. On the right hand side you will see the table with ten preview records. You are going to work with the column named ‘Amount’.

By clicking on the ‘’ icon above the column, the available options will be displayed.

image-20250321-104748.png

The aim is to transform the data cells from numbers to text, so to do this we will navigate to Edit cells > Common transforms > To number.

image-20250321-104937.png

After executing the transformation, a notification shall be displayed indicating the completion of the transformation, and the number of transformed records:

image-20250321-105030.png

The total number of records is 11,285, while the transformation has been performed on 9,041 records, so that there are still untransformed records. To analyse what has happened, a facet will be applied. To do this, in the column in which you are working, select the option Facet > Numeric facet.

image-20250321-105410.png

On the left-hand side, a data distribution graph will then be displayed, showing information on the number of numeric, non-numeric, empty and error records:

image-20250321-105615.png

By narrowing down the data distribution, and looking at the ‘Amount’ column, you will see that the data preview is updated and displayed something like this:

image-20250321-111940.png

Here you can see that the problem lies in the presence of ‘commas’ as thousands separators, which causes the transformation not to take place. To solve it, these ‘comma’ characters will have to be removed.

To do this, in the column options, navigate to the menu Edit cells > Transform.

image-20250321-112516.png

A window will then appear in which a code can be entered to carry out the transformation.

image-20250321-112736.png

It is possible to use different languages, such as GREL, Python or Clojure. In this case GREL (General Refine Expression Language) will be used.

image-20250321-112754.png

To perform the conversion, a replace command shall be used:

value.replace(',',''')

The window shall display the original value, and the transformed value after applying the command, to directly check the transformation:

image-20250321-113039.png

The ‘OK’ button shall be clicked to apply the changes, and a notification shall be displayed indicating the changes that have occurred.

image-20250321-113316.png

Once this is done, the transformation of the column from number to text will be carried out again in Edit cells > Common transforms > To number. If everything went smoothly, the notification will indicate that all cells have been modified.

image-20250321-113515.png

The value of 8,459 records is displayed and not 10,703 records due to the filtering done on the histogram of the distribution of values. Restarting it (by clicking on ‘reset’), it will show all the data.

image-20250321-113710.png

Advanced Transformations

It is possible to apply a number of other more advanced transformations, such as deriving one column from another by applying transformations.

For example, for the data you are working with in this workshop, you will expand the options for the ‘Date’ column, and select the option Edit column > Add column based on this column.

image-20250321-114741.png

A new window will then appear in which to enter the code needed to perform the transformation and apply it to the new column.

image-20250321-114834.png

As in the previous case, it is possible to define different programming languages to be used. Again, GREL will be used.

The aim is to format the date field to something more generic, so the following code will be used as a command:

value.toDate('dd.mm.yy').toString('yyyy-mm-dd')

The window will show the new field with the transformation applied:

image-20250321-115153.png

Click on the ‘OK’ button to create the new column and a notification will be displayed indicating the changes.

image-20250321-115306.png

In the preview table, the newly created column - named NewDate - will appear in the specified format.

image-20250321-115412.png

Pivot data

It is possible to pivot the order of the data in a simple way. For this example, we will make use of this other data dataset, which will have to be loaded as local data:

Once the project is generated, a table with three records will be displayed:

image-20250321-124607.png

The data will be pivoted from the ‘Country Name’ column. To do this, open the options for this column, and select the option Transpose > Transpose cells across columns into rows.

image-20250321-124746.png

A window will then appear in which you indicate from which column to which column you want to pivot to. This will be configured as follows:

image-20250321-125006.png

Click on the ‘Transpose’ button to apply the changes. As in previous cases, a notification will be displayed indicating the changes made:

image-20250321-125131.png

Next, the ‘year’ column shall be split into two different columns. To do this, from the options of this column, select the option Edit column > Split into several columns.

image-20250321-125307.png

A window is then displayed in which you can configure how the column will be separated.

image-20250321-130844.png

The separator in this case will be ‘:’. Clicking on ‘OK’ will execute the separation and a notification will be displayed:

image-20250321-130953.png

The column separation has been successful. The next step is to rename the columns to ‘Year’ and ‘Life Expectancy’. To do this, access the column options and select the option Edit columns > Rename this column.

image-20250321-131151.png

A modal will then be launched in which the new column name can be entered:

image-20250321-131530.png

After accepting the new name, a notification will be displayed informing of the change.

image-20250321-131615.png

Export data in different formats

To export data, click on the ‘Export ▼’ button.

image-20250321-115652.png

A menu with different export options and formats will be displayed.

image-20250321-115743.png

By clicking on the option of interest, the file is downloaded in the indicated format.

image-20250321-115845.png

Exporting data to Onesait Platform

Normally, after refining the data, you will want to ingest it into an Entity rather than download it locally. For this reason, there is an extension that allows you to export the project directly to an Entity.

To do this, it can be exported either from the Export > EXPORT TO ONESAITPLATFORM button or from the Extensions section: Onesait Platorm > EXPORT TO ONESAITPLATFORM.

image-20250321-120438.png

A window will then be displayed in which certain Entity configuration information must be entered:

image-20250321-120559.png
  • Use Proxy: if the instance is behind a proxy, this option must be checked.

  • Proxy: URL of the proxy. Default: http://proxy.indra.es

  • ProxyPort: the proxy port. Default: 8080

  • Token: the user token, to authorise the operation.

  • OP URL Server: the URL of the instance where the Platform in use is located.

  • Ontology Name: the name of the Entity.

  • New Ontology: if you want to create a new entity, tick this field.

  • Ontology Description: the descriptive text of the Entity, for when a new one is created.

image-20250321-123955.png

Import data from Onesait Platform

Just as you can export refined data to an Entity, you can import data from the Platform itself to DataRefiner.

To do this, from the list of projects, click on the ‘Create’ button located at the top right of the screen.

image-20250321-082322.png

A list of the different data sources will then be displayed. From this list, select the ‘Onesait Platform’ option.

image-20250321-135035.png

A list of possible existing connections will be displayed, as well as the option to create a new connection. In the latter case, the following information must be entered:

image-20250321-135151.png
  • Use Proxy: if the instance is behind a proxy, this option must be checked.

  • Proxy: URL of the proxy. Default: http://proxy.indra.es

  • ProxyPort: the port of the proxy. Default: 8080

  • Name: the name of the connection to create.

  • Host: the URL of the Onesait Platform instance.

  • User: the user to connect with.

  • Vertical: the vertical of the instance to connect to.

  • Password: the user's password.

After including this information, you can click on the ‘Test’ button to test the connection.

image-20250321-135832.png

After logging in, the query editor will appear. As an example, you will get the data of the Entity ‘Helsinkipopulation’.

Depending on the type of query, the data will be obtained in one way or another. Thus, to obtain all the data, the following query will be executed:

SELECT * FROM Helsinkipopulation

Using DataRefiner via API REST

You can perform the same operations that are performed from the user interface via API REST. To access the Swagger with the API operations, you will have to navigate from the Control Panel to the APIs menu located in the context menu next to the user's name.

image-20250321-140342.png

A window with shortcuts to the different APIs will be displayed, and you will have to select the ‘Control Panel’ one:

image-20250321-140440.png

Once in Swagger, in the definition selector, choose ‘Data Refiner’:

image-20250321-140534.png

The Data Refiner API will then be displayed with the /datarefiner operation, which is the one to use.

To use this operation, the following fields must be filled in:

  • Authorization: the user's Bearer token.

  • Operations: with the steps that can be easily obtained by pressing the ‘extract’ button from the user interface:

image-20250321-141006.png
image-20250321-141033.png
  • exportType: the format allowed for the export of the data.

  • engine:

  • importOptions: the JSON with the relevant options of the file format.

Once the options for the operation have been defined, the only thing left to do is to click on the ‘Execute’ button to launch the query.

Example of operation to pivot columns

{ "op": "core/transpose-columns-into-rows", "combinedColumnName": "year", "startColumnName": "2005", "columnCount": 5, "ignoreBlankCells": true, "fillDown": true, "prependColumnName": true, "separator": ":", "keyColumnName": null, "valueColumnName": null, "description": "Transpose cells in 5 column(s) starting with 2005 into rows in one new column named year" }, { "op": "core/column-split", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "year", "guessCellType": true, "removeOriginalColumn": true, "mode": "separator", "separator": ":", "regex": false, "maxColumns": 0, "description": "Split column year by separator" }, { "op": "core/column-rename", "oldColumnName": "year 2", "newColumnName": "life expectancy", "description": "Rename column year 2 to life expectancy" }, { "op": "core/column-rename", "oldColumnName": "year 1", "newColumnName": "year", "description": "Rename column year 1 to year" }

Related content