DAPLAB architecture

In this page, we review some of the principal software components of the Daplab as a Big Data Platform, then give you more informations about the hardware and architecture of the Daplab cluster dedicated to Big Data Analysis.

The DAPLAB team is currently working on an OpenStack project. More about this and the hardware dedicated to it at the bottom of the page.

Counting the OpenStack-dedicated hardware, the Daplab has 1,026 cores, 5.7 TB of RAM and 575 TB or storage !

HDP - HortonWorks Data Platform

The HADOOP ecosystem being especially complex and constently changing, DAPLAB uses Hortonworks as its hadoop provider. HDP is a secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN).

In theory, DAPLAB can run any tool developed for HADOOP. In practice, it is necessary to take into account the already installed frameworks as well as the limitations imposed by HDP.

The HDP architecture is presented in the figure below.

HDP data platform

As of August 2017, DAPLAB platform is running HDP 2.6.x based on Hadoop 2.7.x, with Ambari 2.5.x for management.

Hadoop Distributed File System (HDFS) and YARN are the cornerstone components of Hortonworks Data Platform (HDP). HDFS is a distributed file system providing a scalable, fault-tolerant and cost-efficient storage for the data. YARN, Yet Another Resource Negotiator, is a large-scale, distributed operating system for big data applications. It decouples MapReduce's resource management and scheduling capabilities from the data processing components, enabling Hadoop to support more varied processing approaches and a broader array of applications.

On top of YARN, it is possible to plug a versatile range of processing engines that allows multiple processes to interact with the same data in multiple ways, at the same time. Most of the tools visible on the figure are discussed more in depth in our tutorials and other resources.

Ambari is a graphical interface to provision, manage and monitor HADOOP clusters easily, while ZooKeeper provides a distributed synchronization (for maintaining configuration information and naming) through a centralized service.

About HADOOP 2

Most of the following information have been taken from the articles What is Apache Tez? and Hadoop 2 vs 1. We encourage you to read them for more in-depth explanations.

The major changes between HADOOP 1 and HADOOP 2 are the introduction of HDFS federation and the YARN resource manager. There is also a new execution engine, Tez, which generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph.

hadoop 2

In Hadoop 1, a single Namenode managed the entire namespace for a Hadoop cluster. With HDFS federation, multiple Namenode servers manage namespaces and this allows for horizontal scaling, performance improvements, and multiple namespaces.

With YARN , HADOOP users are no longer limited to working the I/O intensive, high latency MapReduce framework, but can run other processing models. YARN also simplifies the management of resources and brings significant performance improvements to HADOOP.

Tez is one of the new processing models of HADOOP 2. It allows developers to build end-user applications (such as Hive and Pig) with much better performance and flexibility. It gets around some limitations imposed by MapReduce and makes near-real-time query processing possible.


Daplab Software Versions

Version
YARN 2.7.3
HDFS 2.7.3
MapReduce2 2.7.3
Tez 0.7.0
Hive 1.2.1
Pig 0.16.0
Sqoop 1.4.6
ZooKeeper 2.4.6
Ambari 2.5
Apache Spark 1.6.3
Apache Spark2 2.1.0
Apache Kafka 0.10.1
Apache Cassandra 3.11.0

Daplab Physical Architecture

DAPLAB Architecture follows a typical Data Lake architecture.

High level architecture

Users have physical SSH access to so-called Gateways or Edge servers, and interact with Hadoop services from there. Users have also access to user interaces such as Ambari or Hue. User Gateways have home directories mounted on a SAN via NFS in order to share home folders between gateways.

Applications are either run as long living YARN applications, or have dedicated gateways simply called Application Gateways. The rational behind that is that Users Gateways tend to have unpredictable workload, highly correlated to users mood. Applications might require more predictable workload and are thus separated. Note that these Application and Users Gateways would be good candidates to run on VMs.

The servers running Hadoop & friends are split in two categories with two distinct hardware configurations: storage and non-storage.

The storage nodes usually contains 12 spinning 2TB drives or more. They have reasonably small SSDs for the OS, dual E5-XXXX v3 CPUs and 128GB of RAM. They are used as HDFS datanodes and YARN nodemanagers, with some added services as well. We also bond two of their NIC to achieve 20Gb/s of bandwidth between the servers !

The second type of servers, non-storage nodes, have 2 to 4 bigger SSDs and are intended to run master processes such as Zookeeper, Namenode, Resource Manager as well as Kafka brokers.

The last type of servers are called Utility servers, and are primarily hosting mandatory services to run the Hadoop platform such as DHCP, DNS, Kerberos, LDAP etc. These ones also run Postgresql in master-slave mode to store Ambari, Hive, Hue and Oozie databases.


DAPLAB Hardware

DAPLAB hardware dedicated to Big Data:
    ➙   Total Cores : 402
    ➙   Total Disk : 527 TB
    ➙   Total RAM : 3.196 TB

DAPLAB currently uses hardware of the following families:

Below is a summary of their specifications:

Storage nodes

# machines Hardware CPU RAM (GB) Disk capacity (TB)
6 SuperServer 6017R-73THDP+ Intel xeon E5-2620 v3 @ 2.1GHz 128 24
12 SuperServer F618H6­FTPTL+ Intel xeon E5-2630 v3 @ 2.4GHz 128 24

Total: 2.3 TB of RAM, 432 TB of storage.


Non-storage nodes

# machines Hardware CPU RAM (GB) Disk capacity (TB)
5 PowerEdge 2950 Intel Xeon 5XXXX 32 2.4 - 4.2
4 Superserver 6028TP-HTTR Intel xeon E5-2630 v3 @ 2.4GHz 128 4.8

Total: 672 GB of RAM, 28.6 TB of storage.


Utility servers

# machines Hardware CPU RAM (GB) Disk capacity (TB)
2 PowerEdge R430 Intel xeon E5-2630 v3 @ 2.4GHz 32 8
5 PowerEdge 2950 Intel Xeon 5XXXX 32 0.7-2.4 GB

Total: 224 GB of RAM, 7'361 GB of storage.


In total, the DAPLAB has 402 cores, 527 TB of disk space and about 4 TB or RAM solely dedicated to Big Data. Moreover, some machines are reserved for projects such as an OpenStack infrastructure. Counting those, we have 1168 cores and more than 6 TB of RAM in our park.

OpenStack

The EPFL (École Polytechnique de Lausanne) gave us 13 machines that we are currently using to setup an OpenStack infrastruture. The goal is to provide an IaaS (infrastructure-as-a-service) for virtual servers and resources able to communicate with our Hadoop platform.

OpenStack hardware

# machines Hardware CPU RAM (GB) Disk capacity (TB)
13 IBM System X3755 M3 AMD Opteron 6176 QuadCore @ 2.3GHz 192 2.25 - 6.25

Total: 624 cores, 2.5 TB of RAM, 47.65 TB of storage.