Apache Zeppelin is an online notebook that let you interact with the DAPLAB cluster through many languages and technology backends. Currently, Zeppelin supports:
- Spark 1.6 and Spark 2.1.0 using python, scala or R
- Go to Zeppelin
- Login using your daplab credentials
Create a new notebook
On the home page or on the notebook menu, select "create new...". Once the notebook is open, give it a new name.
Using slashes (
/) in the notebook name will automatically create and/or move the notebook into folders.
A notebook is made of cells, also called paragraphs. A Cell has an interpreter that tells Zeppelin which langage/backend to use to run the cell.
The interpreter is configured by writing
%<interpreter name> at the top of the cell. Without it, Zeppelin will use the default interpreter, which you can configure by clicking on > interpreters at the top right of the notebook (drag-drop to re-order them, the first one being the default).
You can run cells by clicking on the icon on the right or using
Many useful shortcuts exist in edit mode. Click on the at the top right of a notebook to display them.
List of interpreter prefixes
||Spark with Scala|
||Spark SQL syntax|
||Load dependencies for use within Spark cells|
||Spark with Python|
spark is Spark 1.6,
spark2 is Spark 2.1.0.
Battling with spark
Include Spark CSV
First, we need an external library to easily read the CSV. On the first cell, enter and run:
%dep is used to manage dependencies. Have a look here for more information.
If you run into the error:
Must be used before SparkInterpreter (%spark) initialized Hint: put this paragraph before any Spark code and restart Zeppelin/Interpreter
Open the interpretor's list ( > interpreters on the top right) and click on the icon on the left of the
%spark2 val battingFile = "hdfs:///shared/seanlahman/2011/Batting/Batting.csv" val batting = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load(battingFile)
First, have a look at the data:
z.show is a zeppelin builtin that allows you to display values inside a variable. The interface let's you switch between views, such as table, piechart, etc.
Compute some statistics per year:
val statsPerYear = batting .groupBy($"yearID") .agg( sum("R").alias("total runs"), sum("H").alias("total hits"), sum("G").alias("total games")) z.show(statsPerYear)
On the interface, select the line chart or area chart and then click on settings. Drag-and-drop the statistics into the Values area:
Use an input form to display the hit by pitch per team for a given year:
# z.input(<input name>, <default value>) val year = z.input("year", 1894) val hbp1894 = batting.filter($"yearID" === year).groupBy("teamID").agg(sum("HBP").alias("hit by pitch")) z.show(hbp1894)
z.input creates a simple input text. Use
z.select for a dropdown and
z.checkbox for multiple choices. For example, a dropdown for all teams would be:
// get all team names val all_teams = batting.select("teamID").distinct().map(_.getAs[String](0)).collect() // create and show a dropdown form val team = z.select("selected team", all_teams.zip(all_teams).sorted)
More dynamic forms in the documentation.