Tag: scripting

Deploying multinode Hadoop 2.0 cluster using Apache Ambari

October 31, 2013

The Apache Hadoop community recently made the GA release of Apace Hadoop 2.0, which is a pretty big deal. Hadoop 2.0 is  basically a re-architecture and re-write of major components of classic Hadoop including the NextGen MapReduce Framework based on Hadoop YARN, and federated Namenodes. Bottomline, the architectural changes in Hadoop 2.0 allows it to scale to much larger clusters.

Deploying Hadoop manually can be a long and tedious process. I really wanted to try the new Hadoop, and I quickly realized Apache Ambari now supports the deployment of Hadoop 2.0. Apache Ambari has come a long way since last year and has really become one of my preferred Hadoop deployment tools for Hadoop 1.*.

In this article below, I will go through the steps I followed to get a Hadoop 2.0 cluster running on Rackspace Public Cloud. I just chose Rackspace public cloud as I have easy access to it, but doing it on Amazon or even dedicated servers should be just as easy too.

1. Create cloud servers on Rackspace Public Cloud.

You can create cloud servers using the Rackspace Control Panel or using their APIs directly or using any of the widely available bindings.

For Hadoop cluster, I am using:

  • Large flavors ie 8GB or above.
  • CentOS6.* as Guest Operating System.

To actually create the servers, I will use a slightly modified version of bulk servers create script. I will create one server for Apache Ambari and a number of servers for Apache Hadoop Cluster and I will then use Ambari to install the Hadoop onto the Hadoop cluster servers.

So basically, I have created the following servers:


and have recorded their hostnames, public/private ip addresses and root passwords for each.

2. Prepare the servers.

SSH into the newly created Ambari server eg. myhadoop-Ambari. Update its /etc/hosts file with the entry for each server above.

Also create a hosts.txt file with the hostnames of the servers from above.

root@myhadoop-Ambari$ cat hosts.txt

At this point, from the same Ambari server, run the following script which will ssh into all of the servers specified in the hosts.txt file and set them up.

Specifically, the script will set up passwordless SSH between the servers and also disable iptables among other things.



set -x

# Generate SSH keys
ssh-keygen -t rsa
cd ~/.ssh
cat id_rsa.pub >> authorized_keys

cd ~
# Distribute SSH keys
for host in `cat hosts.txt`; do
    cat ~/.ssh/id_rsa.pub | ssh root@$host "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"
    cat ~/.ssh/id_rsa | ssh root@$host "cat > ~/.ssh/id_rsa; chmod 400 ~/.ssh/id_rsa"
    cat ~/.ssh/id_rsa.pub | ssh root@$host "cat > ~/.ssh/id_rsa.pub"

# Distribute hosts file
for host in `cat hosts.txt`; do
    scp /etc/hosts root@$host:/etc/hosts

# Prepare other basic things
for host in `cat hosts.txt`; do
    ssh root@$host "sed -i s/SELINUX=enforcing/SELINUX=disabled/g /etc/selinux/config"
    ssh root@$host "chkconfig iptables off"
    ssh root@$host "/etc/init.d/iptables stop"
    echo "enabled=0" | ssh root@$host "cat > /etc/yum/pluginconf.d/refresh-packagekit.conf"

Note, this step will ask for root password for each of the servers before setting them for passwordless access.

3 Install Ambari.

While still on the Ambari server, run the following script that will install Apache Ambari.



set -x

if [[ $EUID -ne 0 ]]; then
    echo "This script must be run as root"
    exit 1

# Install Ambari server
cd ~
wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/GA/ambari.repo
cp ambari.repo /etc/yum.repos.d/
yum install -y epel-release
yum repolist
yum install -y ambari-server

# Setup Ambari server
ambari-server setup -s

# Start Ambari server
ambari-server start

ps -ef | grep Ambari

Once the installation completes, you should be able to login to the ip address of the Ambari servers on the browser and access its web interface.


admin/admin is the default username and password.

4. Install Hadoop.

Once logged into the Ambari web portal, it is pretty intuitive to create a Hadoop Cluster through its wizard.

It will ask for hostnames and SSH Private Key, which you can get from the Ambari Server.

root@myhadoop-Ambari$ cat hosts.txt
root@myhadoop-Ambari$ cat ~/.ssh/id_rsa

You should be able to just follow the wizard and complete the Hadoop 2.0 Installation at this point. The process the install Hadoop 1.* is almost exactly the same although some of the services like YARN don’t exist.

Apache Ambari will let you install a plethora of services including HDFS, YARN, MapReduce2, HBase, HIVE, Oozie, Ganglia, Nagios, ZooKeeper and Hive and Pig clients. As you go through the installation wizard, you can choose what service goes on which server.

5. Validate Hadoop:

SSH to myhadoop1 and run the script to do a wordcount on all books of Shakespeare.



set -x

su hdfs - -c "hadoop fs -rmdir /shakespeare"
cd /tmp
wget http://homepages.ihug.co.nz/~leonov/shakespeare.tar.bz2
tar xjvf shakespeare.tar.bz2
now=`date +"%y%m%d-%H%M"`
su hdfs - -c "hadoop fs -mkdir -p /shakespeare"
su hdfs - -c "hadoop fs -mkdir -p /shakespeare/$now"
su hdfs - -c "hadoop fs -put /tmp/Shakespeare /shakespeare/$now/input"
su hdfs - -c "hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples- wordcount /shakespeare/$now/input /shakespeare/$now/output"
su hdfs - -c "hadoop fs -cat /shakespeare/$now/output/part-r-* | sort -nk2"

So you have your first Hadoop 2.0 cluster running and validated. Feed free to look into the scripts, its mostly instructions from the Hortonworks docs scripted out. Have fun Hadooping!

Bash Script to Backup MySQL Databases

February 24, 2013

mysqldump is a program to do a dump of a MySQL database. It creates a .sql file, which you can then use to restore the database.

Back up a MySQL database:

mysqldump -u mysql_user -h ip -pmysql_password database_name > database_name.sql

To restore a database from the database_name.sql file:

mysql -u mysql_user -h ip -pmysql_password database_name < database_name.sql

Backup all databases on the server:
Interestingly, you can backup all databases on the server:

mysqldump -u mysql_user -h ip -pmysql_password -A > all_databases.sql

To restore all databases:

mysql  -u mysql_user -h ip -pmysql_password < all_databases.sql

Backup a table on a MySQL database:
You can also do mysqlduml at the table level:

mysqldump -u mysql_user -h ip -pmysql_password database_name table_name > table_name.sql

To restore the table to the database:

mysql  -u mysql_user -h ip -pmysql_password database_name table_name < table_name.sql

I have a bunch of MySQL databases hosted on a bunch of servers. I have been pretty lazy to back them up regularly. So I wrote a quick bash script to back up the MySQL databases, creating a separate backup file for each database.

Here’s what the script does in short:
1. You provide it a list of ip address, username and password.
2. It will mysqldump all the databases on each host server that the user has access to.
3. It will store all the dumps in the backup_dir and compress each dump using gunzip.

Bash doesn’t really support multi dimensional arrays. So I had to store the ip, username and password as a comma separated string and split it up in each iteration. Ugly, but gets the job done for now.

# Script to do back of all mysql databases on different hosts.


function mysql_dump() {
        local ip="$1"
        local mysql_user="$2"
        local mysql_password="$3"
        mysql_databases=`mysql -u ${mysql_user} -p${mysql_password} -h ${ip} -e "show databases"| sed /^Database$/d`
        for database in $mysql_databases
                if [ "${database}" == "information_schema" ]; then
                        echo "Skipping $database"
                        echo "Backing up ${database}"
                        mysqldump -u ${mysql_user} -p${mysql_password} -h ${ip} ${database} | gzip > "${backup_dir}/${database}.gz"

backup_date=`date +%Y_%m_%d_%H_%M`
mkdir -p "${backup_dir}"

for server in ${servers[@]}
        mysql_dump ${cols[0]} ${cols[1]} ${cols[2]}