Data Refiner Use Workshop

Description

In order to incorporate new Data Governance capabilities in the Platform, a new module has been included. It is called DataRefiner, and it is accessible from the Data Governance menu entry.



This component' goal is to "refine" the information that is loaded or extracted from platform. For this purpose, it allows:

  • An end user to load from a UI data from different places, such as from their own PC, from a URL or from information residing on the platform itself.

  • The tool allows data to be loaded in the main formats, including Excel, XML, JSON, CSV,...

  • The user can work with these data with an "Excel-like" interface to perform data profiling, including data cleansing, enhancement, restructuring or reconciliation.

  • The "refined" data can be downloaded as files or uploaded to the platform as an Entity.

Technology on which DataRefiner is based

Our DataRefiner is based on the OpenRefine software, to which a set of extensions have been added to work with the platform.

OpenRefine is an open-source Java tool (BSD-3 license) based on Java, that, with an Excel-style web interface, allows you to load data from different sites and in different formats, and to understand, clean, reconcile and improve them.

First of all, bear in mind that the concept of OpenRefine is that you can do the transformations from your own computer, but instead of using an application, you do it from your browser.

OpenRefine can be found in github: OpenRefine

To find more information, you can access its wiki: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users

and the user manual

OpenRefine user manual

Usage from the user interface

Example 1 Ingest files, external url

Ingest files by user interface from an external url.

Example url: https://www.iwight.com/documentlibrary/download/transparency-data-april-2013-csv

Example 2 Ingest local files

Ingest files by user interface from the PC.

 

Example 3 simple transformation

We transform the Amount column to number type.

Edit cells>Common transforms>to number

Some of the records have not transformed well, so I will now apply a Facet by selecting in the amount column:

Facet>Numeric Facet

We see that there are elements that have not been transformed because there is a comma in the thousands separators.

So I can transform it with the tool and its language GREL (Google Refine Expression Language). I will go to the Amount column, select Edit cells>Transform and in the transformation I will indicate value.replace(',','').

We repeat:

Edit cells>Common transforms>to number

and we see that they are all in numerical format.

Example 4 more advanced transformations

I can apply other more advanced transformations, such as deriving a column from another column by applying transformations.

To do that, I will go to the Date column, select Edit column>Add column based on this column.

In this one, I will apply the date format transformation:

value.toDate('dd.mm.yy').toString('yyyy-mm-dd')

And then I will apply: Edit cells>Commons Transformations>To date

NOTE

We will use Extract... to obtain the history of changes made on the project to be able to apply them in other projects with the same structure, and also to be able to use them in the rest api that we can invoke from the platform's DataRefiner.

Example 5 Exporting data to different formats

We will show how to export data to different file formats.-

Example 6 Exporting data to the platform

Export data to the platform by creating a new ontology.

The token is filled with the oauth token. We mark new ontology and give it a name.

Example 7 Pivot data

 

We create a project with the file.

I will select the Country Name column and select Transpose>Transpose cells across columns into rows

We separate the year column into two columns.

Edit column>Split into several columns

We rename the columns to “year” and “life expectancy”.

Example 8 Importing data from a connection with onesaitplatform

 

Connection form

Use Proxy, Proxy y ProxyPort: To be filled in if a proxy is to be used.

Name: It will be the name of the connection.

Host: The Host of the platform, in this case we are indicating lab.onesaitplatform.com

User: Onesait platform user to connect to in order to have access to its resources.

Vertical: Must be filled in if verticals are used.

Password: User password to access the platform.

 

We will connect and the query editor will appear.

We will obtain the data of the example entity, Helsinkipopulation

Depending on the query we will get the data in one way or another.

Select * from…

Select Helsinki from…

 

Refine files from the DataRefiner api

We can perform the same operations that we do from the user interface, for which we only have to fill in the operations field with the steps that we can obtain in a simple way by pressing the Extract button... from the user interface. We will show some examples:

Example 1 use from api

We will refine the file that we have used for the pivot example, adding the operations we did before

{ "op": "core/transpose-columns-into-rows", "combinedColumnName": "year", "startColumnName": "2005", "columnCount": 5, "ignoreBlankCells": true, "fillDown": true, "prependColumnName": true, "separator": ":", "keyColumnName": null, "valueColumnName": null, "description": "Transpose cells in 5 column(s) starting with 2005 into rows in one new column named year" }, { "op": "core/column-split", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "year", "guessCellType": true, "removeOriginalColumn": true, "mode": "separator", "separator": ":", "regex": false, "maxColumns": 0, "description": "Split column year by separator" }, { "op": "core/column-rename", "oldColumnName": "year 2", "newColumnName": "life expectancy", "description": "Rename column year 2 to life expectancy" }, { "op": "core/column-rename", "oldColumnName": "year 1", "newColumnName": "year", "description": "Rename column year 1 to year" }

and selecting, as exportype, html for example.