Data Refiner Use Workshop
- 1 Description
- 2 Technology on which DataRefiner is based
- 3 Usage from the user interface
- 3.1 Example 1 Ingest files, external url
- 3.2 Example 2 Ingest local files
- 3.3 Example 3 simple transformation
- 3.4 Example 4 more advanced transformations
- 3.5 NOTE
- 3.6 Example 5 Exporting data to different formats
- 3.7 Example 6 Exporting data to the platform
- 3.8 Example 7 Pivot data
- 3.9 Example 8 Importing data from a connection with onesaitplatform
- 4 Refine files from the DataRefiner api
Description
In order to incorporate new Data Governance capabilities in the Platform, a new module has been included. It is called DataRefiner, and it is accessible from the Data Governance menu entry.
This component' goal is to "refine" the information that is loaded or extracted from platform. For this purpose, it allows:
An end user to load from a UI data from different places, such as from their own PC, from a URL or from information residing on the platform itself.
The tool allows data to be loaded in the main formats, including Excel, XML, JSON, CSV,...
The user can work with these data with an "Excel-like" interface to perform data profiling, including data cleansing, enhancement, restructuring or reconciliation.
The "refined" data can be downloaded as files or uploaded to the platform as an Entity.
Technology on which DataRefiner is based
Our DataRefiner is based on the OpenRefine software, to which a set of extensions have been added to work with the platform.
OpenRefine is an open-source Java tool (BSD-3 license) based on Java, that, with an Excel-style web interface, allows you to load data from different sites and in different formats, and to understand, clean, reconcile and improve them.
First of all, bear in mind that the concept of OpenRefine is that you can do the transformations from your own computer, but instead of using an application, you do it from your browser.
OpenRefine can be found in github: OpenRefine
To find more information, you can access its wiki: https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users
and the user manual
Usage from the user interface
Example 1 Ingest files, external url
Ingest files by user interface from an external url.
Example url: https://www.iwight.com/documentlibrary/download/transparency-data-april-2013-csv
Example 2 Ingest local files
Ingest files by user interface from the PC.
Example 3 simple transformation
We transform the Amount column to number type.
Edit cells>Common transforms>to number
Some of the records have not transformed well, so I will now apply a Facet by selecting in the amount column:
Facet>Numeric Facet
We see that there are elements that have not been transformed because there is a comma in the thousands separators.
So I can transform it with the tool and its language GREL (Google Refine Expression Language). I will go to the Amount column, select Edit cells>Transform and in the transformation I will indicate value.replace(',','').
We repeat:
Edit cells>Common transforms>to number
and we see that they are all in numerical format.
Example 4 more advanced transformations
I can apply other more advanced transformations, such as deriving a column from another column by applying transformations.
To do that, I will go to the Date column, select Edit column>Add column based on this column.
In this one, I will apply the date format transformation:
value.toDate('dd.mm.yy').toString('yyyy-mm-dd')
And then I will apply: Edit cells>Commons Transformations>To date
NOTE
We will use Extract... to obtain the history of changes made on the project to be able to apply them in other projects with the same structure, and also to be able to use them in the rest api that we can invoke from the platform's DataRefiner.
Example 5 Exporting data to different formats
We will show how to export data to different file formats.-
Example 6 Exporting data to the platform
Export data to the platform by creating a new ontology.
The token is filled with the oauth token. We mark new ontology and give it a name.
Example 7 Pivot data
We create a project with the file.
I will select the Country Name column and select Transpose>Transpose cells across columns into rows
We separate the year column into two columns.
Edit column>Split into several columns
We rename the columns to “year” and “life expectancy”.
Example 8 Importing data from a connection with onesaitplatform
Connection form
Use Proxy, Proxy y ProxyPort: To be filled in if a proxy is to be used.
Name: It will be the name of the connection.
Host: The Host of the platform, in this case we are indicating lab.onesaitplatform.com
User: Onesait platform user to connect to in order to have access to its resources.
Vertical: Must be filled in if verticals are used.
Password: User password to access the platform.
We will connect and the query editor will appear.
We will obtain the data of the example entity, Helsinkipopulation
Depending on the query we will get the data in one way or another.
Select * from…
Select Helsinki from…
Refine files from the DataRefiner api
We can perform the same operations that we do from the user interface, for which we only have to fill in the operations field with the steps that we can obtain in a simple way by pressing the Extract button... from the user interface. We will show some examples:
Example 1 use from api
We will refine the file that we have used for the pivot example, adding the operations we did before
{ "op": "core/transpose-columns-into-rows", "combinedColumnName": "year", "startColumnName": "2005", "columnCount": 5, "ignoreBlankCells": true, "fillDown": true, "prependColumnName": true, "separator": ":", "keyColumnName": null, "valueColumnName": null, "description": "Transpose cells in 5 column(s) starting with 2005 into rows in one new column named year" }, { "op": "core/column-split", "engineConfig": { "facets": [], "mode": "row-based" }, "columnName": "year", "guessCellType": true, "removeOriginalColumn": true, "mode": "separator", "separator": ":", "regex": false, "maxColumns": 0, "description": "Split column year by separator" }, { "op": "core/column-rename", "oldColumnName": "year 2", "newColumnName": "life expectancy", "description": "Rename column year 2 to life expectancy" }, { "op": "core/column-rename", "oldColumnName": "year 1", "newColumnName": "year", "description": "Rename column year 1 to year" }
and selecting, as exportype, html for example.