Creation of an Entity in a historical database
Since version 3.1.0-KickOff, Ontologies have been referred to as "Entities" in the Control Panel. This does not alter any functionality; the nomenclature has simply been changed for a better understanding of the concept.
Available since version 3.1.0-KickOff.
User Interface: 6.0.0-Vegas
Introduction
In the 2000s, Hadoop had been standardized as a solution for creating DataLakes because it allowed building local clusters with basic hardware to store and process this new data cheaply.
But the open source world has continued to evolve and today with Hadoop it is very difficult to achieve the elasticity, simplicity and agility in provisioning that other Kubernetes-based solutions offer.
The Platform proposes a solution based on MinIO+Presto as DataLake.
On one side we have MinIO which is a distributed storage that implements the AWS S3 API. MinIO can be deployed On-Premise and runs on top of Kubernetes. It is currently an interesting alternative to HDFS-based environments.
For our DataLake implementation we propose to use Presto which is an open-source distributed SQL query engine built in Java, it is intended to launch interactive analytical queries against a large number of data sources (through connectors) supporting queries on data sources ranging from gigabytes to petabytes.
In our case Presto is the query engine for the data stored in MinIO, so instead of mounting HIVE to query in SQL format the data stored in HDFS I will use Presto to query the data stored in MinIO.
Advantages of this approach
The combination is more elastic than the typical Hadoop configuration, and if you've ever had to add and remove nodes to a Hadoop cluster, you'll know what I mean. It can be done, but it's not easy, whereas that same task is trivial in this architecture.
With Hadoop if you want to add more storage, you do it by adding more nodes (with compute). If you need more storage, you're going to have more compute, whether you need it or not whereas with the object storage architecture if you need more compute, you can add nodes to the Presto cluster and keep the storage, so compute and storage are not just elastic, they're independently elastic. And that's good, because your compute and storage needs are also independently elastic.
Keeping a Hadoop cluster stable and reliable is a complex task, for example upgrading a cluster usually means shutting down the cluster, continuous upgrades are complex, etc.
With this architecture we will have a reduction of the total cost of ownership, since MinIO hardly requires management, and also the storage of objects is cheaper.
Below we explain how to create an Entity of this type.
Steps
Create the Entity
From the Control Panel, navigate to the Main Concepts > My Entities menu.
This will take us to the list of available Entities. To create the Entity, click on the "+" button at the top right of the screen.
From the different Entities that we will be able to create, we will select the "Creation Entity in Historical Database" one:
Fill in general information
This will open the Entity creation wizard, where we will have to enter some basic information:
Identification: the unique name with which to identify the Entity.
Meta-Information: tags to characterize the Entity, which will be used for filtering when searching.
Description: extended descriptive text of the Entity, such as its use, properties, characteristics, etc.
In addition, we have some more options to characterize the Entity:
Active: so that the Entity works or is blocked.
Public entity: if we want the Entity to be public, or if we want it to be private.
Once the Entity's general information and options have been defined, click on the "Continue" button to access the Entity's schema definition.
Define Data Models (JSON Schema)
Once we click on the "Continue" button, the form for the creation of the Entity on the historical database will be displayed.
Adding fields
As in other Entities we must fill in the mandatory fields (name, meta-information, and description), and add one by one the fields we want our entity to contain using the available interface:
File options
The next step is to select the format of the file in which we want the data to be saved in the Entity or the file to be uploaded:
If no option is selected, the data will be stored by default in ORC format.
Likewise, if we want to store in CSV, the escaping, quotation marks and separator characters must be indicated, if they are not indicated, the default values will be taken. This is very important when uploading a file so that the data is readable by the engine.
There is also the option to partition the data by selecting one or more fields of the entity we want to create. These must be the last fields in the creation query and be in the same order:
Once the data that apply to the entity to be created have been filled in, click on the "Update SQL" button to generate the query for the creation of the table that can be edited:
After this, you must generate the JSON schema that will allow you to create the entity on the platform, by clicking on the "Generate Schema" button:
When clicking on the "Create" button, if the Entity has been generated correctly, a message will appear allowing us to upload a file to the database:
This option is also available in the Entity edition, through the "Upload file to Entity" button.
Consulting the information
In case we have imported data into the newly created Entity, we can use the Query Tool to check our data.
To do this, access the Tools > Query Tool menu.
In the window that will appear, select the Entity for which you want to consult the information, and execute the relevant query.