Setting up Apache Hadoop cluster in AWS EC2


Apache Hadoop is a distributed software framework with which we can process large data sets across distributed clusters of computers. Hadoop scales up from a single machine to thousands of machines. Setting up the Apache Hadoop cluster in AWS EC2 is a relatively complicated task. We will follow some simple steps to set up the cluster quickly.

There are four main modules in the Hadoop framework. They are:

  1. Hadoop Common: Common utility functions that are used in other Hadoop modules
  2. HDFS: Hadoop Distributed File system
  3. Hadoop YARN: Job scheduling and cluster resource management
  4. Hadoop MapReduce: Parallel processing of large data sets using YARN

Prerequisite

  1. AWS Account (Academy and Sandbox also work)
  2. Basic familiarity with Linux terminal
  3. SSH client installed on a personal computer

EC2 Instance setup

We need to set up an EC2 instance to setup the Hadoop cluster. AWS EC2 provides virtual servers on the cloud. We will create a server and install Java and Hadoop on the server which will be our Name Node, and then clone it later to create three new instances for data nodes.

Create EC2 instance

Go to the AWS console and search “ec2”. Then, click on EC2

Click on “Launch Instance”

Configure EC2 Instance creation

Fill up or choose the following data:

Name: “Name Node”

Amazon Machine Image: Ubuntu 22.04 LTS, SSD Volume Type

Architecture: 64-bit (x86)

Instance type: t2.micro

Create an RSA key pair for ssh login: In the section “Key pair (login)” click on “create a new key pair” and fill it up as shown below to create an RSA private and public key pair. It will download the public key, which you should keep in a safe folder.

Network settings

Tick the checkmark on all three options under the Firewall security group.

  • Allow SSH traffic from: Anywhere
  • Allow HTTPS traffic from the internet
  • Allow HTTP traffic from the internet

Configure storage

8 GB is fine for testing purpose; but let’s choose 16 GB for extra storage. You can choose upto 30 GB in free tier.

Finally, click on the “Launch Instance” orange button.

In the EC2 > instances section, we can see that the instance has been launched and is in the “running” state. If it shows “pending”, wait for some time.

Installing JAVA and Hadoop

SSH access to the server

We need to enter into the server via ssh(secure shell) in order to install Java and Hadoop there. Follow the following steps for ssh access.

Get the public domain name

Click on the “Instance ID” and copy the “Public IPv4 DNS” as shown below.

ssh into the server

On your personal computer, type the following commands to enter the server

cd folder_name
chmod 400 hadoop_ec2.pem
ssh -i hadoop_ec2.pem ubuntu@ec2-54-159-202-213.compute-1.amazonaws.com

These commands do the following things:

  • Go to the folder where you stored your public key earlier (in my case, it is “hadoop_test”)
  • Change permission of the file to be read by only the current user (400 means read-only for the current user, and no access for other users)
  • Enter into the server via ssh (Replace the public DNS of the server in the last command)

We have now entered the server via ssh. The commands we enter here will run in the EC2 instance we created earlier.

Update Ubuntu on the server

Type the following command to update the server

sudo apt update -y && sudo apt upgrade -y;

If any confirmation page is shown, press Enter on the keyboard to accept the changes.

After the update has completed, go to the AWS console > EC2 > instances, click on “Instance State” and choose Reboot

After rebooting the instance, you will again have to ssh into the server by entering the following command (up arrow on the keyboard should bring the previous command). You may need to wait for a bit to allow the reboot to complete.

ssh -i hadoop_ec2.pem ubuntu@ec2-54-159-202-213.compute-1.amazonaws.com

Install Java

We now install Java 8 on the server by using the package “openjdk-8-jdk-headless”. Run the following command to install Java.

sudo apt -y install openjdk-8-jdk-headless

In the server, create a directory called “server” and go to that folder

cd
mkdir server
cd server

Install Apache Hadoop

Let’s install Hadoop 2.7.3 on the server. Copy the download link from the Apache website and use that link in the following command.

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar xvzf hadoop-2.7.3.tar.gz

Setup JAVA_HOME in the hadoop-env

Edit the file ~/server/hadoop-2.7.3/etc/hadoop/hadoop-env.sh using nano or vim.

nano ~/server/hadoop-2.7.3/etc/hadoop/hadoop-env.sh

In the editor, replace this line

export JAVA_HOME=${JAVA_HOME}

with the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Save and Exit

Setup core site configuration

Now edit the file ~/server/hadoop-2.7.3/etc/hadoop/core-site.xml

nano ~/server/hadoop-2.7.3/etc/hadoop/core-site.xml

Replace this block

<configuration>
</configuration>

with the following code block:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value><nnode>:9000</value>
  </property>
</configuration>

Replace the “nnode” with the public DNS of the EC2 instance as shown below:

Create Data Directory

We are going to create one Name Node, and three Data Nodes. In each of these four nodes, HDFS requires a data directory.

Enter the following code to create a data directory and change ownership to the user “ubuntu”.

sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown -R ubuntu:ubuntu /usr/local/hadoop/hdfs

Creating Data Nodes

We have now created Name Node and installed the required packages that are common to both the name node and data nodes. So, we can clone this instance to create three data nodes. By doing this, we won’t have to repeat the above steps for the data nodes; Java and Hadoop will already be available in the data nodes.

Architecture diagram

Apache Hadoop cluster in AWS

Create an Image from the instance

In the AWS EC2 > instances, select the name node instance, and go to “Actions”

Click on “Image and templates” and then “Create image”

Enter the image name as “Hadoop Image”

Create three new instances from the created image

Go to the “Images > AMIs” and wait for the image creation to be completed (look at the “Status” column)

Then launch three new instances from the previously created image.

Set the value of the “Name” to “dnode”

Set “Number of instances” to 3 and click on “Launch instance”

Note 1: Make sure that key pair name is the same value that we created when launching the first EC2 instance.

Click on “Launch instance”

Now, three new data node instances have been created.

Note down the public DNS of newly created data nodes. We use these DNS to enter into the nodes via ssh.

Setting up passwordless SSH between the servers

We need to create a passwordless ssh connection between the name node and the data nodes so that the name node can communicate with the data nodes in order to manage the distributed system.

Name Node

SSH into the name node, and run the following command:

ssh-keygen

Use default values and press Enter. It will create two new files “id_rsa” and “id_rsa.pub” inside the “/home/ubuntu/.ssh/” directory.

The ssh private is located at /home/ubuntu/.ssh/id_rsa and the public key is located at /home/ubuntu/.ssh/id_rsa.pub. We need to copy the public key value and use it in the data nodes.

We also need to put this public key value in the authorized_keys file of the name node.

cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys

Now copy the content of id_rsa.pub file to copy it into the data nodes

cat /home/ubuntu/.ssh/id_rsa.pub

Copy this value

Data Nodes

Now ssh into each of the data nodes using public DNS like this:

ssh -i hadoop_ec2.pem ubuntu@ec2-54-164-170-249.compute-1.amazonaws.com

In the data node, there is a file called authorized_keys in the /home/ubuntu/.ssh/ directory. Edit this file and append the value of id_rsa.pub file copied previously.

nano /home/ubuntu/.ssh/authorized_keys

Repeat this step for each data node

Setup ssh config in name node

Configurations for ssh can be stored in the file /home/ubuntu/.ssh/config . Copy the following content into that file of the name node. Replace the nnode and dnode1,2,3 with the public DNS of the respective servers.

nano /home/ubuntu/.ssh/config

Data: In the file editor, put the following data by replacing the nnode, dnode1, dnode2, and dnode3 values with respective public DNS.

Host nnode
  HostName <nnode>
  User ubuntu
  IdentityFile ~/.ssh/id_rsa
 
Host dnode1
  HostName <dnode1>
  User ubuntu
  IdentityFile ~/.ssh/id_rsa
 
Host dnode2
  HostName <dnode2>
  User ubuntu
  IdentityFile ~/.ssh/id_rsa
 
Host dnode3
  HostName <dnode3>
  User ubuntu
  IdentityFile ~/.ssh/id_rsa

like this:

Now, we need to verify if our ssh configuration works. Enter the following commands one by one in the name node:

ssh nnode

Type “yes” if prompted for confirmation.

Type exit to get back to the name node.

exit

Now test each of the data nodes

Repeat this step for all the nodes namely: nnode, dnode1, dnode2, dnode3

Configuring Hadoop on Name Node

HDFS Properties

On the name node, edit this file: ~/server/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

nano ~/server/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

Replace this

<configuration>
</configuration>

with the following code

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hdfs/data</value>
  </property>
</configuration>

MapReduce properties

On the name node, we will use a template for mapreduce configuration. Copy the file (~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml.template) to (~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml)

cp ~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml.template ~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml

Now edit the copied file ~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml

nano ~/server/hadoop-2.7.3/etc/hadoop/mapred-site.xml

Replace this

<configuration>

</configuration>

with the following code

<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value><nnode>:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Note: Replace the <nnode> with the public DNS of the name node.

YARN properties

On the name node, edit the file ~/server/hadoop-2.7.3/etc/hadoop/yarn-site.xml

nano ~/server/hadoop-2.7.3/etc/hadoop/yarn-site.xml

Replace this

<configuration>

<!-- Site specific YARN configuration properties -->

</configuration>

with the following code

<configuration>

  <!-- Site specific YARN configuration properties -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value><nnode></value>
  </property>

</configuration>

Note: Replace the <nnode> with the public DNS of the name node.

Setup master and slaves

On the name node, create two new files “~/server/hadoop-2.7.3/etc/hadoop/masters” and “~/server/hadoop-2.7.3/etc/hadoop/slaves” for master and slave nodes respectively. Then put the public DNS of the master(name node) and the slaves(data nodes) in the respective files.

nano ~/server/hadoop-2.7.3/etc/hadoop/masters

nano ~/server/hadoop-2.7.3/etc/hadoop/slaves

Configure Data Nodes

We also need to configure data nodes. On each of the data nodes, edit the file ~/server/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

nano ~/server/hadoop-2.7.3/etc/hadoop/hdfs-site.xml

replace this

<configuration>
</configuration>

with the following code

<configuration>

  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hdfs/data</value>
  </property>
</configuration>

Starting the Hadoop cluster

On the name node, format the HDFS data

 ~/server/hadoop-2.7.3/bin/hdfs namenode -format

Start the Hadoop Cluster by starting dfs and yarn

Start dfs

 ~/server/hadoop-2.7.3/sbin/start-dfs.sh

Start YARN

 ~/server/hadoop-2.7.3/sbin/start-yarn.sh

Start history server

 ~/server/hadoop-2.7.3/sbin/mr-jobhistory-daemon.sh start historyserver

Output:

Check for daemons using jps

jps

The above command should show the following four processes. If any of the following is missing, then you might have missed some steps during the setup process.

  • ResourceManager
  • NameNode
  • SecondaryNameNode
  • Jps

Running jps on the data node

On each of the data nodes, the following three processes should be running.

Checking the Web UI

If everything has been set up correctly, the Web UI shows the Hadoop Cluster information at the location <nnode>:50070 (e.g. ec2-174-129-116-153.compute-1.amazonaws.com:50070)

The following screenshot shows the overview of the Hadoop cluster. We can see that “Configured Capacity” is around 45 GB which is the combined storage capacity of three data nodes.

Go to the Datanodes tab to view the information of each data node.

Troubleshooting

Problem 1: Web UI not accessible at <nnode>:50070

If the Web UI is not showing up at the mentioned location, then most probably the name node instance doesn’t allow the TCP connection on the public domain in the security group. So, we need to edit the security group.

To fix this, go to the name node instance

Then go the the security group

After opening the security group, we can edit the inbound rules.

Then Add a new rule with the values as shown below:

This allows the port 50070 to be accessible from the web.

Problem 2: Data Nodes not being shown in the web UI

This issue arises if the data nodes cannot access the HDFS of the name node.

First of all, run the jps command on the data nodes to check if the “DataNode” is running. If the “DataNode” is not showing up, then you have configured the cluster incorrectly.

If the “DataNode” is running in the data nodes, then these data nodes are not able to access the 9000 port of the name node which is the port for HDFS. The reason might be that the name node and the data nodes are in separate security groups.

So, we need to allow this port of the name node to be accessible from the data nodes by editing the security group of the data nodes.

Note the security group of the data node

First of all, we need to note down the security group of the data node.

Go to one of the data nodes

And, note down the security group name of the data node

Here, the security group of the data node is “launch-wizard-2”

Now we need to edit the security group of the name node

Edit the security group of the name node

Go to the name node instance

Then open the security group of the name node

In the security group, click on Edit inbound rules

Then Add a new rule with the values as shown below:

Now, go to the web UI at <nnode>:50070 and see the data nodes. It might take 10 – 30 seconds to reflect the changes.

Conclusion

We have successfully set up the Hadoop cluster in the AWS EC2 with one name node and three data nodes.

For more tutorials, go to https://devpercept.com


You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *