Data Refiner Use Workshop
Available since version 2.3.0-Inmortal
Introduction
With the aim of incorporating new Data Governance capabilities into Onesait Platform, a new module called DataRefiner has been included, the purpose of which is to ‘refine’ the information that is uploaded to or extracted from the Platform.
This module will allow:
A user to load from a user interface data from various sources, such as from his own computer, from a URL or from information residing in the Platform itself.
The tool allows data to be uploaded in the main formats, including Excel, XML, JSON, CSV, etc.
The user can work with these data with an excel like interface to perform data profiling, including data cleansing, enhancement, restructuring or reconciliation.
The ‘refined’ data can be downloaded as files or uploaded to a Platform Entity.
Technology on which DataRefiner is based
The DataRefiner module is based on OpenRefine, a BSD-3 licensed, Java-based, open source Java tool, which with an Excel-style web interface allows data from different sites and in different formats to be loaded, understood, cleaned, reconciled and improved.
OpenRefine, as an open source project, is available for download on GitHub. In order to integrate it into the Onesait Platform, a series of extensions have been developed to optimise it and make it compatible.
The project documentation is available on its website, so if you are interested in finding out how to start a project, explore and transform data or export results, we recommend that you go there.
How to use DataRefiner
This module is available in the ML & AI > Data Refiner Tool menu.
Create a Project from an External URL
For this example we will use this dataset in CSV format belonging to the Isle of Wight Council: https://www.iwight.com/documentlibrary/download/transparency-data-april-2013-csv
After accessing the tool, the list of available projects will be displayed. To create a new project, click on the ‘Create’ button at the top right of the screen.
A list of the different data sources will then be displayed. From this list, select ‘Web Addresses (URLs)’.
Once you have entered the URL of the CSV file mentioned above, click on the ‘Next’ button:
A screen will be displayed for a few seconds indicating that the information is being loaded, and then all data will be displayed in table format:
This screen is part of the data analysis, being able to configure how the data are loaded and to define the format in which they come. Therefore, at the bottom of the screen there is a series of options for the loading configuration, which will be common to all data sources.
Once the data is as required, click on the ‘Create Project’ button:
A message will appear indicating that the project is being created, and as soon as the project is finished, the project will be displayed, with an example of ten records.
To return to the list of projects, simply click on the ‘Back’ button:
In the project list you can see the new project that has just been created:
Other Examples of Use
Below are a series of examples of how to use DataRefiner with different origins.
Ingest Local Files
In this example, a project will be generated as in the previous case, but in this case the source of the data will not be a URL, but will come from an existing CSV file on the computer.
For this purpose, the following CSV can be used, which corresponds to the CSV used in the previous example:
Once downloaded locally, a new project will be created, selecting in this case the ‘This Computer’ option:
Once the CSV file has been selected, click on the ‘Next’ button:
As in the previous case, a screen will be displayed for a few seconds indicating that the information is being loaded, and then all the data will be displayed in table format:
Simple Transformations
In this example we are going to transform the type of a column from numeric type to string type.
To do so, we will use one of the projects previously created from the Isle of Wight data. From the project list, select one of them. On the right hand side you will see the table with ten preview records. You are going to work with the column named ‘Amount’.
By clicking on the ‘▼’ icon above the column, the available options will be displayed.
The aim is to transform the data cells from numbers to text, so to do this we will navigate to Edit cells > Common transforms > To number.
After executing the transformation, a notification shall be displayed indicating the completion of the transformation, and the number of transformed records:
The total number of records is 11,285, while the transformation has been performed on 9,041 records, so that there are still untransformed records. To analyse what has happened, a facet will be applied. To do this, in the column in which you are working, select the option Facet > Numeric facet.
On the left-hand side, a data distribution graph will then be displayed, showing information on the number of numeric, non-numeric, empty and error records:
By narrowing down the data distribution, and looking at the ‘Amount’ column, you will see that the data preview is updated and displayed something like this:
Here you can see that the problem lies in the presence of ‘commas’ as thousands separators, which causes the transformation not to take place. To solve it, these ‘comma’ characters will have to be removed.
To do this, in the column options, navigate to the menu Edit cells > Transform.
A window will then appear in which a code can be entered to carry out the transformation.
It is possible to use different languages, such as GREL, Python or Clojure. In this case GREL (General Refine Expression Language) will be used.
To perform the conversion, a replace command shall be used:
value.replace(',',''')
The window shall display the original value, and the transformed value after applying the command, to directly check the transformation:
The ‘OK’ button shall be clicked to apply the changes, and a notification shall be displayed indicating the changes that have occurred.
Once this is done, the transformation of the column from number to text will be carried out again in Edit cells > Common transforms > To number. If everything went smoothly, the notification will indicate that all cells have been modified.
The value of 8,459 records is displayed and not 10,703 records due to the filtering done on the histogram of the distribution of values. Restarting it (by clicking on ‘reset’), it will show all the data.
Advanced Transformations
It is possible to apply a number of other more advanced transformations, such as deriving one column from another by applying transformations.
For example, for the data you are working with in this workshop, you will expand the options for the ‘Date’ column, and select the option Edit column > Add column based on this column.
A new window will then appear in which to enter the code needed to perform the transformation and apply it to the new column.
As in the previous case, it is possible to define different programming languages to be used. Again, GREL will be used.
The aim is to format the date field to something more generic, so the following code will be used as a command:
value.toDate('dd.mm.yy').toString('yyyy-mm-dd')
The window will show the new field with the transformation applied:
Click on the ‘OK’ button to create the new column and a notification will be displayed indicating the changes.
In the preview table, the newly created column - named NewDate - will appear in the specified format.
Pivot data
It is possible to pivot the order of the data in a simple way. For this example, we will make use of this other data dataset, which will have to be loaded as local data:
Once the project is generated, a table with three records will be displayed:
The data will be pivoted from the ‘Country Name’ column. To do this, open the options for this column, and select the option Transpose > Transpose cells across columns into rows.
A window will then appear in which you indicate from which column to which column you want to pivot to. This will be configured as follows:
Click on the ‘Transpose’ button to apply the changes. As in previous cases, a notification will be displayed indicating the changes made:
Next, the ‘year’ column shall be split into two different columns. To do this, from the options of this column, select the option Edit column > Split into several columns.
A window is then displayed in which you can configure how the column will be separated.
The separator in this case will be ‘:’. Clicking on ‘OK’ will execute the separation and a notification will be displayed:
The column separation has been successful. The next step is to rename the columns to ‘Year’ and ‘Life Expectancy’. To do this, access the column options and select the option Edit columns > Rename this column.
A modal will then be launched in which the new column name can be entered:
After accepting the new name, a notification will be displayed informing of the change.
Export data in different formats
To export data, click on the ‘Export ▼’ button.
A menu with different export options and formats will be displayed.
By clicking on the option of interest, the file is downloaded in the indicated format.
Exporting data to Onesait Platform
Normally, after refining the data, you will want to ingest it into an Entity rather than download it locally. For this reason, there is an extension that allows you to export the project directly to an Entity.
To do this, it can be exported either from the Export > EXPORT TO ONESAITPLATFORM button or from the Extensions section: Onesait Platorm > EXPORT TO ONESAITPLATFORM.
A window will then be displayed in which certain Entity configuration information must be entered:
Use Proxy: if the instance is behind a proxy, this option must be checked.
Proxy: URL of the proxy. Default: http://proxy.indra.es
ProxyPort: the proxy port. Default: 8080
Token: the user token, to authorise the operation.
OP URL Server: the URL of the instance where the Platform in use is located.
Ontology Name: the name of the Entity.
New Ontology: if you want to create a new entity, tick this field.
Ontology Description: the descriptive text of the Entity, for when a new one is created.
Import data from Onesait Platform
Just as you can export refined data to an Entity, you can import data from the Platform itself to DataRefiner.
To do this, from the list of projects, click on the ‘Create’ button located at the top right of the screen.
A list of the different data sources will then be displayed. From this list, select the ‘Onesait Platform’ option.
A list of possible existing connections will be displayed, as well as the option to create a new connection. In the latter case, the following information must be entered:
Use Proxy: if the instance is behind a proxy, this option must be checked.
Proxy: URL of the proxy. Default: http://proxy.indra.es
ProxyPort: the port of the proxy. Default: 8080
Name: the name of the connection to create.
Host: the URL of the Onesait Platform instance.
User: the user to connect with.
Vertical: the vertical of the instance to connect to.
Password: the user's password.
After including this information, you can click on the ‘Test’ button to test the connection.
After logging in, the query editor will appear. As an example, you will get the data of the Entity ‘Helsinkipopulation’.
Depending on the type of query, the data will be obtained in one way or another. Thus, to obtain all the data, the following query will be executed:
SELECT * FROM Helsinkipopulation
Using DataRefiner via API REST
You can perform the same operations that are performed from the user interface via API REST. To access the Swagger with the API operations, you will have to navigate from the Control Panel to the APIs menu located in the context menu next to the user's name.
A window with shortcuts to the different APIs will be displayed, and you will have to select the ‘Control Panel’ one:
Once in Swagger, in the definition selector, choose ‘Data Refiner’:
The Data Refiner API will then be displayed with the /datarefiner operation, which is the one to use.
To use this operation, the following fields must be filled in:
Authorization: the user's Bearer token.
Operations: with the steps that can be easily obtained by pressing the ‘extract’ button from the user interface:
exportType: the format allowed for the export of the data.
engine:
importOptions: the JSON with the relevant options of the file format.
Once the options for the operation have been defined, the only thing left to do is to click on the ‘Execute’ button to launch the query.
Example of operation to pivot columns
{
"op": "core/transpose-columns-into-rows",
"combinedColumnName": "year",
"startColumnName": "2005",
"columnCount": 5,
"ignoreBlankCells": true,
"fillDown": true,
"prependColumnName": true,
"separator": ":",
"keyColumnName": null,
"valueColumnName": null,
"description": "Transpose cells in 5 column(s) starting with 2005 into rows in one new column named year"
},
{
"op": "core/column-split",
"engineConfig": { "facets": [], "mode": "row-based" },
"columnName": "year",
"guessCellType": true,
"removeOriginalColumn": true,
"mode": "separator",
"separator": ":",
"regex": false,
"maxColumns": 0,
"description": "Split column year by separator"
},
{
"op": "core/column-rename",
"oldColumnName": "year 2",
"newColumnName": "life expectancy",
"description": "Rename column year 2 to life expectancy"
},
{
"op": "core/column-rename",
"oldColumnName": "year 1",
"newColumnName": "year",
"description": "Rename column year 1 to year"
}