Setting up an RHadoop sandbox environment: Cloudera QuickStart VM

The main goal of this note is to provide some technical help for the students taking our elective course ‘Big Data Analysis Techniques’. While there are other, rather good articles on this topic on the Internet, this one is tailored to the needs of the course.

Note to the participants of the 2013 fall Big Data Analysis Techniques course: we will make the QuickStart VM available on our departmental Apache VCL cluster.

The local backend is the best choice for getting started with RHadoop and prototyping MapReduce algorithms in it - as discussed in the previous note. However, in a “Hadoopless” RHadoop environment you can certainly not learn about the Hadoop-related aspects of a complete RHadoop solution. The next step in the learning curve is setting up a one-node system, complete with HDFS, MapReduce, R and RHadoop/rmr2.

Setting up Apache Hadoop from scratch can be a complicated task even for a single-node install. A number of companies offer distributions of Hadoop as a product; these include tools that simplify cluster setup and management. (And also provide the elements of the Hadoop ecosystem in a packaging tested for interoperability.)

Cloudera and HortonWorks are two companies with widely used Hadoop distributions; the Hadoop Wiki provides a reference list of “Products that include Apache Hadoop or derivative works and Commercial Support”. Notice that even IBM uses Hadoop as a key element in its Big Data platform. (At least for “at-rest” data.)

Both Cloudera and HortonWorks provide preconfigured, single node environments of their products in the form of virtual machines. Starting with a VM containing a ready to go Hadoop setup is the quickest way to reach a fully functional single node RHadoop system; this note tries to walk you through this process.

Downloading the Cloudera QuickStart VM

We will use the Cloudera QuickStart VM, version cloudera-quickstart-vm-4.4.0-1-vmware. While downloading/unzipping the image of your choice (I recommend VMware, but that's only a personal preference) you should at least skim through the administrative information. Note that you will need a 64 bit host OS and VT-x present and enabled (for Intel processors). To avoid swapping, you really should have at least 6GB memory in your machine; the VM will need 4 of that. Other than that, the only further requirement is that you should be preferably familiar with Linux at least on a basic level; the operating system in the VM is CentOS 6.2.

Having a look around

The VM boots directly into a desktop environment and opens a browser with links to two web applications running on the VM locally: Hue and Cloudera Manager. (User name and password is cloudera/cloudera for both.) You should first become familiar with Cloudera Manager, starting the service oozie1 along the way. You should also take a moment and check out what those services are if you are not familiar with them (although we will use only a few directly).

Apart from that, most probably you will not want to do anything noteworthy here.


More interesting is the Hue web interface to those services; it provides a more convenient way to access them than the command line tools or the basic web interfaces implemented in the Apache projects. (Incidentally, some of those are directly accessible from the bookmark bar.) For basic MapReduce, you will need the Files (HDFS) and Jobs (Hadoop Jobs) applications.


RHadoop

We get R out of the box (version 3.0.1); just issue the command R in a terminal and the familiar R prompt appears.

For RHadoop to work, the environment variables HADOOP_CMD and HADOOP_STREAMING have to be defined. Insert the following at the end of /etc/profile (via e.g. sudo gedit /etc/profile; also, see this on the value of the two variables):

1
2
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar

You will have to restart the VM for the exports to take effect (e.g. sudo reboot). Next, download rmr2_2.3.0.tar.gz from GitHub, open an R session with elevated rights (sudo R) and issue:

1
2
3
install.packages(c('Rcpp', 'RJSONIO', 'bitops', 'digest', 'functional', 'stringr', 'plyr', 'reshape2','caTools'))
install.packages("/home/cloudera/Downloads/rmr2_2.3.0.tar.gz")
quit()

That's all.

There is a number of additional steps if you would like to check the package and its compatibility with the Hadoop substrate via a battery of tests.

First, install libcurl-devel for the RCurl package:

1
sudo yum install libcurl-devel

Then, after sudo R:

1
2
3
4
install.packages("devtools")
library("devtools")
install_github("quickcheck", "RevolutionAnalytics", subdir="pkg")
quit()

Lastly, you can start the built-in checks themselves with:

1
R CMD check /home/cloudera/Downloads/rmr2_2.3.0.tar.gz

Note, however, that the test set contains 50+ Hadoop jobs, so it is quite lengthy. The results are not exactly spotless, but all Hadoop jobs should run successfully.

Optional: installing RStudio server

It is quite easy to set up RStudio on the VM if you prefer to use an integrated environment over running R from the terminal. We could install the desktop version into the VM; however, you may find it more convenient to access an RStudio server from a browser on your host operating system.

Download the rpm and install it (see the CentOS section of the RStudio server guide):

1
2
wget http://download2.rstudio.org/rstudio-server-0.97.551-x86_64.rpm
sudo yum install --nogpgcheck rstudio-server-0.97.551-x86_64.rpm

The default port RStudio server runs on is 8787; you can determine the IP address of the VM the usual way using ifconfig. Open a browser on the host OS and navigate to <VM IP address>:8787. RStudio server uses the OS for authentication; thus, you can log in with the user name and password ‘cloudera’. (Note that on the next figure port 9000 is used instead of 8787; also, the IP address is not what you would see on a default VMware NAT network. To be honest, I actually run the VM on a quite hefty remote workstation so I had to do some additional port mapping/forwarding.)


In this case you will have to take additional steps for the R session created by the server to pick up the necessary environment variables. A way to do it is to create the file Renviron.site in /usr/lib64/R/etc/ with the following content:

1
2
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar

One more thing: if you are using a non-English keyboard (layout) and use Windows as a host OS, you may experience problems with special characters that you would normally type using the AltGr button. Without going into details, Ctrl+Alt shortcuts seem to hijack the AltGr event (what on Windows is actually Ctrl+Alt) in the web application code. I am sorry to say that as of now, we know no real solution for this; either change to an English layout (-> AltGr is not necessary to type [,{,@ et al) or cope with the minor discomfort caused by the need to copypaste certain characters.

Testing RHadoop

Let's write a few numbers in human-readable form to HDFS (either normal or RStudio R session):

1
2
library('rmr2')
to.dfs(seq(from=1, to=500, by=3), format="text", output="/user/cloudera/numbers.txt")

Using Hue, we can immediately check the results:


Finally, we can begin to write MapReduce jobs:

1
2
3
4
5
a <- to.dfs(seq(from=1, to=500, by=3), output="/user/cloudera/numbers")
b <- mapreduce(input=a, map=function(k,v){keyval(v,v*v)})
c <- from.dfs(b)
d <- data.frame(key=c[["key"]], val=c[["val"]])
head(d[order(d$key),])

This gives the following output:

    key val
1     1   1
112   4  16
157   7  49
2    10 100
13   13 169
24   16 256

Seems like we now have a for-real RHadoop sandbox environment.

Next steps

You should explore the datasets directory in /home/cloudera; some of the examples can be readily used with the tools you are already familiar with. (I.e. without Hive/Impala/Solr.) We will investigate cluster setups in later notes.