DAPLAB Technical Documentation

DAPLAB, acronym of Data Analytics and Processing Lab, is meant for learning and sharing knowledge around Hadoop and related technologies, including HDFS, YARN, Kafka, Spark and Cassandra.

In vain have you acquired knowledge if you have not imparted it to others.
-- Deuteronomy Rabbah (c.900, commentary on the Book of Deuteronomy)

In this documentation, you'll find plenty of training material and examples, as well as crunchy infrastructure details. Feel free to get some inspiration, and send your remarks and comments, or even contribute to enhance the documentation by submitting pull requests of the Github project hosting the DAPLAB documentation.

The DAPLAB platform is currently running HDP 2.3.x based on Hadoop 2.7.x, with Ambari 2.2.x for management.

Commodity Hardware

Access to the DAPLAB Platform

The DAPLAB cluster follows typical Hadoop deployment, i.e. it provides gateways and web interfaces as endpoints to interact with the Hadoop components, and no direct access to the servers running the components. See the architecture page for more details.

Web Interfaces

Programmatic Interfaces

Audience

No need to have a Ph.D. in Science to be interested in data and to have a valuable perspective when looking at data. Indeed, when searching the needle in the data haystack, the wider background and broader perspective the better.

The DAPLAB follows this reality and is thus open to everyone, the only requirement is to have a computer :).

We also meet every Thursday evening for hacking data and discuss about various hadoop-related technologies. Feel free to join us !

Tutorials

This documentation is putting a strong focus on having up-to-date training material. The training material break down into three main categories:

Please navigate through the links in the "Tutorial" tab at the very top of this page, or go to the page referencing all the tutorials.