Hacky Thursday

During Hacky Thursdays, we discuss about big data technologies, have small workshops, invite different Big Data actors to give talks about pretty much everything interesting.

  Hacky Thursday

Every Thursdays, 6 pm

Haute école d’ingénierie et d’architecture Fribourg
Département d’Informatique
Boulevard de Pérolles 80 – Case postale 32
CH-1705 Fribourg

   

Join us on our HipChat Room or use this invitation link if you're not yet in.

Projects

Here is a list of some projects developed/imagined during Hacky Thursdays (most source codes are available on GitHub):

More details below.

Project details and other cool stuff

Highly available HTTPS

What is the common point between

etc... ?

They all share the same, wildcard SSL certificate. Right! But they also all follow the same path to reach out the final destination.

The goal of this project is:

There're dozen of approaches to achieve that, all of them having pros and cons. And for every solution, a dozen of different softwares can do the job.

In the case of the DAPLAB, no servers have public IPs, they are all hidden behind a router on a private subnet so we have to heavily rely on destination NAT. Don't be surprised to see port 443 being NAT'ed in the approach.

The approach implemented to meet these requirements includes the following components that will be detailed below.

HTTPS High Availability

SSL Termination

SSL Termination is a way of handling SSL overhead (hand-shake, encryption, etc) in one place, and for this proxy forwarding plain traffic to the destination endpoint. This obviously should be exclusively done in trusted environments.

For this implementation, we used Nginx.

From Nginx, the clear (de-ssl'ized) traffic can be redirected to virtually everywhere, including loadbalancing between several backend servers.

Floating IP

A floating IP is an IP shared between two or more servers, but only active to one at the same time. The servers run a sort of coordination between them to decide which one is active, and the active one is replying to ARP requests for that IP. Once the active server is down and another detects that, it starts claiming to be the owner of the IP and start getting the traffic.

This is a fairly easy technique to achieve high availability on an IP address.

We used the defacto keepalived service for that purpose.

Deployment of a new endpoint

In order to deploy new endpoints, we rely on Ansible to push the new nginx config as well as add the new dns entry in the zone.

Area of improvements


Hive Auto Partitioner

When ingesting data is not designed from ground up with Hive in mind, it might quickly become painful go manage partitions on top of the folders.

Think about the following use case: team A is ingesting data via their prefered path, and team B want to create an external Hive table around the data ingested by team A. You might told me that team A should only deal with Hive-ready data, and I would agree, but the reality is slightly different.

As a result, team B will start polling recursively team A directories and create partitions when new folders are discovered. Repeate that multiple times with several teams, and you start having a significant amount of RPC calls in HDFS only for creating Hive partitions. Does it worth the price?

The answer obviously is not, and there is and awesome tool (disclaimer: I'm part of the author of the tools, so I might be slightly baised :)) which is solving this problem in an elegant way. The tool is called Trumpet and act as a sort of iNotify for HDFS. Instead of polling for files and directories creation or change, you can subscribe to the event stream and got notified in case of interest.

The idea of this project is to combine Trumpet and Hive and solve a real life Data Engineer problem.

The project is hosted in Github and all the implementations details are captured:


Hive Queries in HipChat

ChatOps is huge these days [1]. DAPLAB is no exception to that trend. Concretely, we're using HipChat during our HackyThursday.

To integrate ChatOps with Data, the idea of the project was to allow anyone running Hive queries from our HipChat team room

Demo time :)

See the GitHub page for all the details!

Pointers


Notebook

Here is a quick tutorial on how to create a Python Notebook.

Enable python 2.7

module load python2.7

Launch your own notebook

jupyter notebook --no-browser --ip=localhost --port=1234

Tunnel to your notebook

ssh -L1234:localhost:1234 pubgw1.daplab.ch

Mind adapting the port to what you have set before, since only one user can use the same port.

Access your notebook

Open in your browser http://localhost:1234

Bonus

Run your notebook in a screen to make it resilient to network failures (i.e. your notebook won't be killed if you disconnect from ssh)

screen -S "jupyter"

See the screen manual for more details on how to use screen.


Crawling Internet with Nutch

Pointers

Get started

First, export some environment variables:

export HADOOP_CONF=/etc/hadoop/conf
export NUTCH_HOME=/usr/local/apache-nutch
export HADOOP_CLASSPATH=.:$NUTCH_HOME/conf:$NUTCH_HOME/runtime/local/lib
export PATH=$PATH:/usr/local/apache-nutch/runtime/deploy/bin
export NUTCH_CONF=/usr/local/apache-nutch/conf

Then, run nutch:

> nutch
> crawl urlsdir crawl 1

SwissSimilarities

If you ever thought about wiring Spark and Cassandra in a real-life project, this is for you. We do process 1 billion molecules stored in a Cassandra cluster via Spark. And it works great!

Pointers


Zefix Notifier

In this project, we’ll ingest data from the Central Business Names Index of Switzerland (Zentraler Firmenindex, or Zefix in short). We’ll also let any user enter some keywords, which we’ll match with the new data ingested, and notify (via email or callbacks) when a match is found.

Zefix

Pointers