Apache Zeppelin is an online notebook that let you interact with the DAPLAB cluster through many languages and technology backends. Currently, Zeppelin supports:

Getting Started

Login

  1. Go to Zeppelin
  2. Login using your daplab credentials

Create a new notebook

On the home page or on the notebook menu, select "create new...". Once the notebook is open, give it a new name.

Using slashes (/) in the notebook name will automatically create and/or move the notebook into folders.

Basics

A notebook is made of cells, also called paragraphs. A Cell has an interpreter that tells Zeppelin which langage/backend to use to run the cell.

The interpreter is configured by writing %<interpreter name> at the top of the cell. Without it, Zeppelin will use the default interpreter, which you can configure by clicking on > interpreters at the top right of the notebook (drag-drop to re-order them, the first one being the default).

You can run cells by clicking on the icon on the right or using shift+enter.

Many useful shortcuts exist in edit mode. Click on the at the top right of a notebook to display them.

List of interpreter prefixes

Prefix Description
%spark, %spark2 Spark with Scala
%spark.sql, %spark2.sql Spark SQL syntax
%spark.dep, %spark2.dep Load dependencies for use within Spark cells
%spark.pyspark, %spark2.pyspark Spark with Python
%md MarkDown cell
%sh shell scripts
%jdbc(hive) Hive

Note: spark is Spark 1.6, spark2 is Spark 2.1.0.

Battling with spark

Let's use our battling.csv example from the hive and pig tutorials.

Include Spark CSV

First, we need an external library to easily read the CSV. On the first cell, enter and run:

%dep
z.load("com.databricks:spark-csv_2.11:1.5.0")

%dep is used to manage dependencies. Have a look here for more information.

If you run into the error:

Must be used before SparkInterpreter (%spark) initialized
Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter

Open the interpretor's list ( > interpreters on the top right) and click on the icon on the left of the spark2 interpreter.

Load Data

%spark2
val battingFile = "hdfs:///shared/seanlahman/2011/Batting/Batting.csv"
val batting = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load(battingFile)

Visualizing

First, have a look at the data:

%spark2
z.show(batting)

NOTE : z.show is a zeppelin builtin that allows you to display values inside a variable. The interface let's you switch between views, such as table, piechart, etc.

Compute some statistics per year:

val statsPerYear = batting
    .groupBy($"yearID")
    .agg(
        sum("R").alias("total runs"),
        sum("H").alias("total hits"), 
        sum("G").alias("total games"))

z.show(statsPerYear)

On the interface, select the line chart or area chart and then click on settings. Drag-and-drop the statistics into the Values area:

Batting with Zeppelin - statistics per year

Use an input form to display the hit by pitch per team for a given year:

# z.input(<input name>, <default value>)
val year = z.input("year", 1894)
val hbp1894 = batting.filter($"yearID" === year).groupBy("teamID").agg(sum("HBP").alias("hit by pitch"))
z.show(hbp1894)

z.input creates a simple input text. Use z.select for a dropdown and z.checkbox for multiple choices. For example, a dropdown for all teams would be:

// get all team names
val all_teams = batting.select("teamID").distinct().map(_.getAs[String](0)).collect()
// create and show a dropdown form
val team = z.select("selected team", all_teams.zip(all_teams).sorted)

More dynamic forms in the documentation.