This tutorial assumes you have a proper environment setup to access the DAPLAB cluster.
In order to use HADOOP, it is crucial that you understand the basic functioning of HDFS, as well as some of its constraints. After a brief introduction of core HDFS concepts, this page presents copy-paste-like tutorial to familiarize with HDFS commands. It mainly focuses on user commands (uploading and downloading data into HDFS).
Introduction
HDFS (Hadoop Distributed File System) is one of the core components of HADOOP.
The HDFS is a distributed file system designed to run on commodity hardware. Very powerful, it should ensure that data are replicated across a wide variety of nodes, making the system fault tolerant and suitable for large data sets and gives high throughput.
To have a better understanding of how HDFS works, we strongly encourage you to check out the HDFS Architecture Guide.
Some remarks on HDFS
HDFS uses a simple coherency model: applications mostly need a write-once-read-many access model for files. As a result, a file once created, written to and closed becomes read-only. It is possible to append to an HDFS file only if the system was explicitly configured to.
HDFS is tuned to deal with large files. A typical file in HDFS is gigabytes to terabytes in size. As a result, try to avoid scattering your data in numerous small files.
HDFS is designed more for batch processing rather than interactive use (high throughput versus low latency), and provides only sequential access of data. If your application has other needs, check out tools like HBase, Hive, Apache Spark, etc.
“Moving Computation is Cheaper than Moving Data”
HDFS architecture
As the the HDFS Architecture Guide explains, HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Resources
While the source of truth for HDFS commands is the code source, the documentation page
describing the hdfs dfs
commands is really useful:
A good and simpler cheat sheet is also available here.
Basic Manipulations
In HDFS, user's folders are stored in /user
and not /home
like traditional Unix/Linux
filesystems.
Listing a folder
Your home folder
$ hdfs dfs -ls
Found 28 items
...
-rw-r--r-- 3 bperroud daplab_users 6398990 2015-03-13 11:01 data.csv
...
^^^^^^^^^^ ^ ^^^^^^^^ ^^^^^^^^^^^^ ^^^^^^^ ^^^^^^^^^^ ^^^^^ ^^^^^^^^
1 2 3 4 5 6 7 8
Columns, as numbered below, represent:
- Permissions, in [http://en.wikipedia.org/wiki/File_system_permissions#Notation_of_traditional_Unix_permissions unix-style] syntax;
- Replication factor (RF in short), default being 3 for a file. Directories have a RF of 0;
- Owner;
- Group owning the file;
- Size of the file, in bytes. Note that to compute the physical space used, this number should be multiplied by the RF;
- Modification date. As HDFS is mostly a ''write-once-read-many'' filesystem, this date often means creation date;
- Modification time. Same as date;
- Filename, within the listed folder.
Listing the /tmp
folder
$ hdfs dfs -ls /tmp
Uploading a file
In /tmp
$ hdfs dfs -copyFromLocal localfile.txt /tmp/
The first arguments after -copyFromLocal
point to local files or folders,
while the last argument is a file (if only one file listed as source) or directory in HDFS.
Note that -copyFromLocal
and -copyToLocal
also support wildcards and the copy of
directories.
hdfs dfs -put
is doing about the same thing, but -copyFromLocal
is more explicit
when you're uploading a local file and thus preferred.
Downloading a file
From /tmp
$ hdfs dfs -copyToLocal /tmp/remotefile.txt .
The first arguments after -copyToLocal
point to files or folder in HDFS, while the last
argument is a local file (if only one file listed as source) or directory.
hdfs dfs -get
is doing about the same thing, but -copyToLocal
is more explicit when
you're downloading a file and thus preferred.
Creating a folder
In your home folder
$ hdfs dfs -mkdir dummy-folder
In /tmp
$ hdfs dfs -mkdir /tmp/dummy-folder
Note that relative paths points to your home folder, /user/bperroud
for instance.
Advanced Manipulations
the hdfs dfs
command support several actions that any linux user is already familiar with. Most
of their paramters are the same, but not that the shortcuts (-rf
instead of -r -f
for example)
are not supported. Here is a non-exhaustive list:
-rm [-r] [-f]
: remove a file or directory;-cp [-r]
: copy a file or directory;-mv
: move/rename a file or directory;-cat
: display the content of a file;-chmod
: manipulate file permissions;-chown
: manipulate file ownership;-tail|-touch|
etc.
Other useful commands include:
-moveFromLocal|-moveToLocal
: same as-copyFromLocal|-copyToLocal
, but remove the source;-stat
: display information about the specified path;-count
: counts the number of directories, files, and bytes under the paths;-du
: display the size of the specified file, or the sizes of files and directories that are contained in the specified directory;-dus
: display a summary of the file sizes;-getmerge
: concatenate the files in src and writes the result to the specified local destination file. To add a newline character at the end of each file, specify theaddnl
option:hdfs dfs -getmerge <src> <localdst> [addnl]
-setrep [-R]
: change the replication factor for a specified file or directory;