In the 2000s, Hadoop had been standardized as a solution for the creation of DataLakes, as it allowed to build local clusters, with basic hardware, to store and process this new data cheaply.
But the open-source world has continued to evolve and today it is very difficult to achieve, with Hadoop, the elasticity, simplicity and agility in provisioning that other Kubernetes-based solutions offer.
Platform proposes a solution based on MinIO+Presto as DataLake.
On one hand, we have MinIO, which is a distributed storage that implements the AWS S3 API. MinIO can be deployed on premise, and it runs on top of Kubernetes. It is currently an interesting alternative to HDFS-based environments.
For our DataLake implementation, we propose to use Presto, which is an open-source distributed SQL query engine built in Java. It is intended to launch interactive analytical queries against a large number of data sources (through connectors), supporting queries on data sources ranging from gigabytes to petabytes.
In our casem Presto is the query engine for the data stored in MinIO, so that, instead of mounting HIVE to query in SQL format the data stored in HDFS, you will use Presto to query the data stored in MinIO.
Advantages of this approach
The combination is more elastic than the typical Hadoop setup, and, if you've ever had to add and remove nodes to a Hadoop cluster, you'll know what I’m talking about. It can be done, but it's not easy, whereas that same task is trivial in this architecture.
With Hadoop if you want to add more storage, you do it by adding more nodes (with compute). If you need more storage, you're going to have more compute, whether you need it or not, whereas with the object storage architecture, if you need more compute, you can add nodes to the Presto cluster and keep the storage, so compute and storage are not just elastic, they're independently elastic. And this is good, because your compute needs and your storage needs are also independently elastic.
Keeping a Hadoop cluster stable and reliable is a complex task. For example upgrading a cluster usually means shutting down the cluster, continuous upgrades are complex, etc.
With this architecture, you will have a reduction of the total cost of ownership, since MinIO hardly requires any management, and also because object storage is cheaper.