Skip to content

LinuxJunkies

Musings, Rants, Gotchas and HowTos.

Top Posts & Pages

  • Packt Tech books for 5$
  • Code man! Right now!
  • Design Pattern: Strategy
  • Why is it a good idea to Graduate?
  • What certification, I mean does certifications even hold any value?
  • Linked Lists in Java

Blog Stats

  • 63,675 hits

Recent Posts

  • Packt Tech books for 5$
  • Code man! Right now!
  • Design Pattern: Strategy
  • Why is it a good idea to Graduate?
  • What certification, I mean does certifications even hold any value?
  • Linked Lists in Java

Categories

  • Bash (3)
  • Data Structures (14)
  • Design Patterns (1)
  • GUI (3)
  • Hadoop (1)
  • HBase (1)
  • HDFS (1)
  • Java (16)
  • Linux (3)
  • Networking (3)
  • Programming (4)
  • Thought (3)
  • Uncategorized (1)

Abstract Class Abstract Data Type ADT Array of objects Arrays Binary Search break BufferedReader Chat Circularly linked doubly linked Class Client Commands Conditionals Construc Constructor continue Data Data Structures Degree of equality do Doubly Linked List Dynamic method lookup equals final Function parameters GUI Hadoop HDFS Heap I/O Inheritence Initialisers InputStream InputStreamReader Interfaces Invarient Java java.lang java.net.* Lifetime Linked List Linux Lists Loop bounds Loops Method overriding Multidimensional arrays Networking Packages Parameter passing Pascals Triangle PhD Ports printStackTrace private Programming protected keyword Recursion return Scope sentinel Server Shell Sockets Stack static Strings Swing switch Testing this Thread.dumpstack Tut Tutorial

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 13 other subscribers

MapR

Hadoop or Hadoop datanode Installation Tutorial (On a cluster)

November 20, 2011 NPKCloudera, Datanode, distributed computing, Hadoop, HDFS, Installation, MapR, Namenode, passwordless login, ssh-keygen, Tutorial, Yahoo hadoop1 Comment

Why another post on Hadoop Installation?

It took a while for me to get a Hadoop cluster up and running. Especially after looking at all the available documentation and tutorials available on the internet. Moreover, for a starter into the Hadoop ecosystem, it can be quite frustrating in to decide to choose between a distribution like Cloudera or MapR for the same or just a direct installation from the apache site. I have chosen the later and it works fine for me. Yes, there are a number of good tutorials available on the internet, but well, I am sure this would help a few out there like me. Before I start, I do assume that you have a basic understanding of how Hadoop works or a general overview. If not, I suggest you do so.

Now for those of you came here by accident, I would like to quote from the Apache Hadoop website. http://hadoop.apache.org/

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.”

For an introduction, I would recommend,

http://developer.yahoo.com/hadoop/tutorial/

Although, its a dated tutorial, it does give a good idea of the overall system and yes, the map reduce frame work. Or, if you’d rather read a book, this is a must:

http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979

This book written by Tom White is considered ‘the’ book on Hadoop, with a hands on approach.

I will be using hadoop-0.20.2 version, a stable release. You can download it or a newer version of it from here.

http://hadoop.apache.org/common/releases.html.

Or if you want to try out other good tutorials out there, I would suggest the following:

This is a good read and gives you a great insight on the framework. This post will be for a standalone system on Ubuntu.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Once you are comfortable with this, move on to his next tutorial on multiple machines.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

I would also recommend his mapreduce tutorial in python. Although Java is the native API, like he says, Python can do the needful thanks to the streaming API for hadoop.

http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Again, its upto you to decide whether to go in for a distribution. I would suggest cloudera. I wont be writting about it here though.

http://www.cloudera.com/hadoop/

MapR is also worth mentioning. Especially when it comes with support for analytics.

http://www.mapr.com/

So what now?

I will be be going through a general case of Hadoop installation on a RHEL5 machine. I would give a tutorialized how to format.

1. Some prerequists:

  • Java installation and the installation path. Make sure you have atleast the 1.6 build for Java. Else do install it.
java -version

 

java version "1.6.0_10-rc"
Java(TM) SE Runtime Environment (build 1.6.0_10-rc-b28)
Java HotSpot(TM) Server VM (build 11.0-b15, mixed mode)
  • Making sure you have a valid hostname. Check hostname using the hostname commad.
  • While not necessary, it would be a good idea to partition your available disks in a given format( more on this later). Login as root and use the fdisk -l commnad to see the partitions available.

2. Now the installation

Copy the tar file you downloaded say. hadoop-0.20.2.tar.gz to /home/hadoop/installations

cp - hadoop-0.20.2.tar.gz /home/hadoop/installations

3.Now untar the file to /usr/local/, why? you’ll know.

sudo tar -xzvf /home/hadoop/installations/hadoop-0.20.2.tar.gz -C /usr/local/

Also give the required permissions. This is very important.

sudo chown -R hadoop:hadoop /usr/local/hadoop-0.20.2/

Create a soft link to /usr/local/hadoop-0.20.2

ln -s /usr/local/hadoop-0.20.2 /home/hadoop/hadoop

Create or copy existing configuration files

Now, as you may know, the three main configuration files are core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml. Refer to one of the above tutorials on how to set them, or better, the book, “Hadoop Definitive Guide”. Now if you have them ready, copy core-site.xml,hadoop-env.sh,hdfs-site.xml,mapred-site.xml to each of your datanode, or the main namenode if this is your first install and configure them appropriately. This itself would take a long time to explain and so I’ll write another post. If you have them ready on another hadoop server, do this. Yes, it is important that they all share the same attributes.

scp -r  hadoop@10.0.9.91:/home/hadoop/hadoop/conf/* /home/hadoop/hadoop/conf

Set the environment variables /etc/profile using a editor like vim.

### Hadoop Environment Variables ###
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export PATH=$PATH:$HADOOP_HOME/bin

Now create the Hadoop System Directories with our hadoop user.

mkdir -p /home/hadoop/mapred/local
mkdir -p /home/hadoop/mapred/system

Now create the other directories as root, but give them ownership permissions for the hadoop user.

mkdir -p /var/log/hadoop
chown hadoop:hadoop /var/log/hadoop
mkdir -p /var/run/hadoop
chown hadoop:hadoop /var/run/hadoop

The important data directories and mapred directories.

mkdir -p /disk1/hadoop/hdfs/data
mkdir -p /disk1/hadoop/mapred/local
chown -R  hadoop:hadoop   /disk1/hadoop

If this is to add a datanode to an existing hadoop system, you should add an entry to /etc/hosts for every new datanode.
Passphrase less ssh login from the namenode to datanodes. The idea is to copy the namenodes public key id_dsa,pub to the datanode created, that is to its, /home/hadoop/.ssh/authorized_keys. If you dont know how to create the keys, follow this link. It is explained very lucidly.

http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html

create .ssh directory , if not exists
mkdir -p /home/hadoop/.ssh
scp hadoop@10.0.9.100:/home/hadoop/.ssh/authorized_keys /home/hadoop/.ssh

If you are doing it for the name node, you need to format the hdfs before you start the daemons.Do not format a running Hadoop filesystem, this will cause all your data to be erased. Before formatting, ensure that the dfs.name.dir directory exists. More on this here. http://wiki.apache.org/hadoop/GettingStartedWithHadoop

hadoop namenode -format

Stop the running cluster if it exists.

stop-mapred.sh
stop-dfs.sh

Do add the IP of the new data node in conf/slaves and conf/includes and restart/start the cluster.

start-dfs.sh
start-mapred.sh

This should get you up and running. Although by all means, this is not a complete listing, I have tried to keep it short and clean. I’d write more on the configuration files and other administrative stuff in later blogs. Comments and suggestions appreciated! 🙂

Hadoop or Hadoop datanode Installation Tutorial (On a cluster)
Blog at WordPress.com.
  • Subscribe Subscribed
    • LinuxJunkies
    • Already have a WordPress.com account? Log in now.
  • Privacy
    • LinuxJunkies
    • Subscribe Subscribed
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
Design a site like this with WordPress.com
Get started