During Hacky Thursdays, we discuss about big data technologies, have small workshops, invite different Big Data actors to give talks about pretty much everything interesting.
Here is a list of some projects developed/imagined during Hacky Thursdays (most source codes are available on GitHub):
- Highly available HTTPS: deploy a new HTTPS endpoint in less than 30 seconds using open-source softwares;
- Hive Auto Partitioner: combine Trumpet and Hive and solve a real life Data Engineer problem;
- Hive Queries in HipChat: allow anyone running Hive queries from a HipChat team room;
- Notebook: a quick tutorial on how to setup a Python Notebook;
- Nutch crawler: a production ready Web Crawler;
- SwissSim: an application to process 1 billion molecules stored in a Cassandra cluster via Spark;
- Zefix Notifier: ingestion of data from the Central Business Names Index of Switzerland.
More details below.
Project details and other cool stuff
Highly available HTTPS
What is the common point between
They all share the same, wildcard SSL certificate. Right! But they also all follow the same path to reach out the final destination.
The goal of this project is:
- Have (valid) SSL everywhere to avoid writing documentation full of
- Deploy a new HTTPS endpoint in less than 30 seconds
- Be highly available up to the destination endpoint (the HA property of the endpoint is not covered here)
- Use open-source softwares, of course.
There're dozen of approaches to achieve that, all of them having pros and cons. And for every solution, a dozen of different softwares can do the job.
In the case of the DAPLAB, no servers have public IPs, they are all hidden behind a router on a private subnet so we have to heavily rely on destination NAT. Don't be surprised to see port 443 being NAT'ed in the approach.
The approach implemented to meet these requirements includes the following components that will be detailed below.
- SSL termination
- A floating IP, also know as VRRP
- A DNS failover
SSL Termination is a way of handling SSL overhead (hand-shake, encryption, etc) in one place, and for this proxy forwarding plain traffic to the destination endpoint. This obviously should be exclusively done in trusted environments.
For this implementation, we used Nginx.
From Nginx, the clear (de-ssl'ized) traffic can be redirected to virtually everywhere, including loadbalancing between several backend servers.
A floating IP is an IP shared between two or more servers, but only active to one at the same time. The servers run a sort of coordination between them to decide which one is active, and the active one is replying to ARP requests for that IP. Once the active server is down and another detects that, it starts claiming to be the owner of the IP and start getting the traffic.
This is a fairly easy technique to achieve high availability on an IP address.
We used the defacto keepalived service for that purpose.
Deployment of a new endpoint
In order to deploy new endpoints, we rely on Ansible to push the new nginx config as well as add the new dns entry in the zone.
Area of improvements
- IP source logging at the destination endpoint. As of today, the destination endpoint sees only the SSL termination ip address, which might be a bit embarrassing to compute the access of unique ips :)
Hive Auto Partitioner
When ingesting data is not designed from ground up with Hive in mind, it might quickly become painful go manage partitions on top of the folders.
Think about the following use case: team A is ingesting data via their prefered path, and team B want to create an external Hive table around the data ingested by team A. You might told me that team A should only deal with Hive-ready data, and I would agree, but the reality is slightly different.
As a result, team B will start polling recursively team A directories and create partitions when new folders are discovered. Repeate that multiple times with several teams, and you start having a significant amount of RPC calls in HDFS only for creating Hive partitions. Does it worth the price?
The answer obviously is not, and there is and awesome tool (disclaimer: I'm part of the author of the tools, so I might be slightly baised :)) which is solving this problem in an elegant way. The tool is called Trumpet and act as a sort of iNotify for HDFS. Instead of polling for files and directories creation or change, you can subscribe to the event stream and got notified in case of interest.
The idea of this project is to combine Trumpet and Hive and solve a real life Data Engineer problem.
The project is hosted in Github and all the implementations details are captured:
Hive Queries in HipChat
See the GitHub page for all the details!
- Github repo: https://github.com/daplab/HiveQLBot
Here is a quick tutorial on how to create a Python Notebook.
Enable python 2.7
module load python2.7
Launch your own notebook
jupyter notebook --no-browser --ip=localhost --port=1234
Tunnel to your notebook
ssh -L1234:localhost:1234 pubgw1.daplab.ch
Mind adapting the port to what you have set before, since only one user can use the same port.
Access your notebook
Open in your browser http://localhost:1234
Run your notebook in a screen to make it resilient to network failures (i.e. your notebook won't be killed if you disconnect from ssh)
screen -S "jupyter"
See the screen manual for more details on how to use
Crawling Internet with Nutch
First, export some environment variables:
export HADOOP_CONF=/etc/hadoop/conf export NUTCH_HOME=/usr/local/apache-nutch export HADOOP_CLASSPATH=.:$NUTCH_HOME/conf:$NUTCH_HOME/runtime/local/lib export PATH=$PATH:/usr/local/apache-nutch/runtime/deploy/bin export NUTCH_CONF=/usr/local/apache-nutch/conf
Then, run nutch:
> nutch > crawl urlsdir crawl 1
If you ever thought about wiring Spark and Cassandra in a real-life project, this is for you. We do process 1 billion molecules stored in a Cassandra cluster via Spark. And it works great!
- Github repo: https://github.com/daplab/swisssim
- Detailed presentation: https://github.com/daplab/swisssim/raw/master/SwissSimilarity.pdf
In this project, we’ll ingest data from the Central Business Names Index of Switzerland (Zentraler Firmenindex, or Zefix in short). We’ll also let any user enter some keywords, which we’ll match with the new data ingested, and notify (via email or callbacks) when a match is found.
- Github repo: https://github.com/daplab/zefix-notifier
- Detailed presentation: http://daplab.ch/wp-content/uploads/2015/10/Zefix.pptx