Integration of PrestoDB as a multi-repository query engine

Introduction

In the 3.1.0 release of the Platform, we integrated Presto+MinIO to support DataLake-type storage for migration scenarios from Hadoop.

In this release, we are working to support Presto as a multi-repository SQL query engine, which will allow us to make analytical queries on all Platform Entities/Ontologies regardless of the repository where they are stored (this, for example, will allow us to do JOINs between a PostgreSQL and a MongoDB , or between a MinIO and an Oracle).

It is important to bear in mind that Presto is an analytical SQL query engine capable of merging other repositories and that its deployment can require a lot of memory depending on its use.

How will we support it in Platform?

On the one hand, we can create a new type of Entity: Presto Entity:

that will allow users to connect to the different catalogs registered in Presto by the Platform administrator, creating PRESTO entities.

Once the PRESTO entities are created, we can make JOINS between them, transparently to the repository.

These Presto entities are handled like the rest of the Platform Entities, being able to create dashboards on them, ingest data, publish them as REST API, …

We will also integrate the Presto UI accessible to administrator users, which allows viewing the queries executed on Presto.

ejecutadas sobre Presto

What is Presto?

PrestoDB is an open-source distributed SQL query engine built in Java, designed to launch interactive analytical queries against a large number of data sources (via connectors), supporting queries on data sources ranging from gigabytes to petabytes.

Presto is an ANSI-SQL query engine, allowing to query and manipulate data in any connected data source with the same SQL statements, functions, and operators.

 

PrestoDB was created in 2012 at Facebook. where it was initially created to solve HIVE's slowness when accessing a 300 PB data-warehouse. To solve this problem, an SQL-based MPP engine was built that would be easy to use from existing knowledge, easy to connect to any database, warehouse, or datalake, and easy to integrate with any BI tool.

What can we do?

Presto allows you to query data about your source, including among others Hive connectors, Cassandra, Kafka relational databases, Kudu, Redis, and MongoDB. A single Presto query can combine data from multiple sources, allowing for multi-store analytics.

Presto is focused on analytical queries that expect response times ranging from less than a second, to minutes.

It offers a command line to make the queries:

Connectors

Presto offers the connectors available in Presto to access data from different data sources, you can see the list here: https://prestodb.io/docs/current/connector.html

Among them: Accumulo connector, Cassandra, Druid, Elasticsearch, HIVE, JMX, Kafka, Kudu, local files, MongoDB, MySQL, Oracle, Postgresql, Redis, Redshift, SQL Server,...

JDBC driver

Presto offers a JDBC driver that allows access to the underlying data sources from any application that uses the driver.

Presto Web UI

Presto provides a web interface to monitor and manage queries. The web interface is accessible in Presto’s coordinator through HTTP.

The UI will indicate, for each query, its status:

Â