Deleting Stuck Cinder Volumes

April 15, 2016

Recently, I was using Kilo version of OpenStack and was trying to programmatically attach/detach volumes to an instance. Every once in a while, volumes would go to an ‘in-use’ state even after the instance was destroyed. In fact, even in other releases, I have seen cinder volumes stuck in in-use or error state and sometimes not being able to delete those.

If the volume is in ‘in-use’ status, you first have to change to an available status though before you can issue a delete:

cinder reset-state --state available $VOLUME_ID
cinder delete $VOLUME_ID

If ‘cinder delete’ doesn’t work and you have admin privileges, you can try force-delete.

cinder force-delete $VOLUME_ID

But may be that will fix it. May be it will not. If the volume is still stuck, try going to the database and setting the status of the volume to a detached state:

update volume_attachment set attach_status="detached" where id="<attachment_id>";
update volumes set attach_status="detached" where id="<volume_id>";

Once I did that, I was able to delete or force-delete any stuck volumes.

SQLMap – Automatic SQL Injection Testing Tool

February 3, 2016

SQLMap is like Network Security Scanning tool called Nmap but for scanning databases for sql injection vulnerabilities. SQLMap has been one of the favorite tools in my toolkit for a while now, but it seems like not many people outside of the security space have heard of it.

SQLMap is an SQL injection testing tool that automates the process of detecting and exploiting sql injection vulnerabilities in database servers. It’s a very powerful tool of penetration testers, but its one of those tools every developer that writes code interfacing with databases should learn and use.

SQLMap supports most of the popular relational databases including MySQL, PostgreSQL, Oracle, Microsoft SQL Server, etc. Besides SQL injection feature, SQLMap also has the ability to automatically detect password hash formats and crack them using dictionary-based attacks. It also lets you retrieved information from the vulnerable database once sql injection vulnerability has been detected.

There are a lot of bad code samples out there for people who are just getting started into web programming. Even developers who are somewhat familiar with SQL injection believe that once you parameterized queries, you are safe. But there are many ways to to get it wrong. I am not going to go into how to write parameterized queries that are safe from injection attacks. But what SQLMap provides is a tool that you can point to url to and it will tell you in a minute whether your website is vulnerable. So you can go back to each of the urls and fix your code.

SQLMap is a very easy tool to get started.


SQLMap is written in Python. Assuming you have Python already installed on your system, you cal install SQLMap either through git:

git clone sqlmap

Or you can download a zip or tarball

To get started, go to the sqlmap directory and find the file. Check out what options are available using the sqlmap help option.

python -h


Now lets test the a url on my website for sql injection:

python -u ''

This will launch the automatic sql injection testing and will give you the result at the end. And this should be enough for basic testing.
However, you can further specify how to connect to the target URL using the following options:

Specify Data string to be sent through POST

Specify HTTP Cookie header value

Use randomly selected HTTP User-Agent header value

Use a proxy to connect to the target URL

Use Tor anonymity network    

Check to see if Tor is used properly

Extract information:

Once a vulnerability has been found, you can easily extract information out of the vulnerable database. The following options are available:

Retrieve everything

Detect session user

Detect current database

Find out if session user is a database admin.
List database system user
List databases

Enumerate tables

Enumerate columns

Dump database content

For more details on the usage, see this wiki.

Getting started with OneOps

January 28, 2016

OneOps is an opensource cloud application lifecycle management system that works with public and private cloud infrastructure providers. This means developers can write an application once and deploy the application in a high availability mode across hybrid multi cloud environments with auto-scaling and auto-healing features. OneOps is a very good tool not just for green field applications, but also for forklifting legacy applications to the cloud with relative ease.

OneOps is Apache 2 Licensed and is very recently released by WalmartLabs. In fact, Walmart Ecommerce actually runs on a large OpenStack private cloud and managed by OneOps.

OneOps currently supports: OpenStack, Rackspace Public cloud, AWS, and Microsoft Azure.

Getting started:
OneOne is a complicated piece of software with lots of components. In this first post, I will go through the process of getting a minimally functioning allinone OneOps cluster on a server and connect it to Rackspace cloud, and deploy a very simple sample app on it. The goal is to help get you started with the setup and get you playing with different components and the UI.

I will try to cover the more complex features in future posts e.g. multi-cloud, auto-scaling, auto-healing, monitoring, alerting, CI integration, etc.

OneOps has setup scripts to deploy an allinone cluster on a vagrant environment. I have tweaked the scripts a little bit so you can the scripts to deploy it outside of vagrant. Here I have a server on Rackspace cloud:

  • CentOs 6
  • 15GB RAM (8GB should work too)


yum -y update; yum -y install vim git
git clone
cd setup/vagrant
VAGRANT=false bash

At the end of it, if your install is successful, follow the instructions at the end of the run and hit port 3000 to get to the OneOps portal. Sign up for a new user and login.


Then, follow the instructions here to create organization and add cloud.

In my case, I created a “Blog Team” as an organization and “RAX-DFW” as my cloud choosing rackspace-dfw cloud from the dropdown of clouds. You can see the management location as:

This is the step which trips most people when they are getting started with their first OneOps cluster. Normally, when a cloud is first added, the admin should also create an inductor for that cloud, its the part in OneOps that actually talks to the different clouds.

cd /opt/oneops/inductor
inductor add

This step will ask a bunch of questions. Just accept the defaults except for the couple of questions below:

Queue location? /public/oneops/clouds/rackspace-dfw
What is the authorization key? rackspacesecretkey

Also, make sure you can connect to the cloud vms from the oneOps cluster. Use private_ip or public_ip accordingly.

What compute attribute to use for the ip to connect (if empty defaults to private_ip)?

In case you missed anything or anything changed, you can go to /opt/oneops/inductor/clouds-available/public.oneops.clouds.rackspace-dfw/conf/ and edit any values.

Add this line to the file so the inductor can talk to activemq over ssl:

vi /opt/oneops/inductor/clouds-available/public.oneops.clouds.rackspace-dfw/conf/vmargs

At this point, you should be able to restart the inductor and make sure its running.

inductor restart
inductor status
inductor tail

Add cloud services:
Now go back to the OneOps portal and add cloud services for Rackspace-dfw cloud that you created. You need tenant_id, username and apikey etc and add services like rackspace compute, loadbalancer, dns etc

For dns, if you don’t have a real registered domain, use can actually fake it with any domain name for testing. Eg.

Cloud DNS Id*: dfw

Deploying an app:
The UI isn’t really that friendly but follow the wizard at the top and these instructions  on getting started:
1. Create assembly
2. Create platform (Choose the apache pack from the list) and commit the design
3. Create environment and deploy to the cloud.

At the end of it, you should have apache deployed to Rackspace dfw. You can verify it by going to the ip of the apache server created by OneOps.

TerraKube – Kubernetes on Openstack

November 14, 2015

TerraKube is the simplest way to get started with Kubernetes on OpenStack.

TerraKube is a simple tool to provision a Kubernetes cluster on top of OpenStack using Hashicorp’s Terraform. If you are unfamiliar with Terraform, Terraform is a declarative tool for building, changing and versioning infrastructure. Desired state of the infrastructure is described in a configuration file and Terraform takes the plan and builds the desired state. If you are familiar with AWS CloudFormation or OpenStack Heat, here’s how it compares to OpenStack Heat: Terraform or Heat

TerraKube is a project that I started few months ago while I was evaluating Kubernetes and needed a simple, quick and repeatable way to install Kubernetes on OpenStack. Keep in mind that this is a work in progress.

For the sake of this tutorial, I will assume you already have some familiarity with OpenStack and know how to use the OpenStack command line.

TerraKube Overview

So what we are going to do here is to install Terraform on a node, typically your workstation. TerraKube is just a terraform configuraion file called a plan. We will apply the plan, which in turn talks to OpenStack, launches instances and configures them with Kubernetes.

The Kubernetes cluster will consist of one Kubernetes Master and n number of Kubernetes nodes:

Kubernetes Master: CoreOS, etcd, kube-api, kube-scheduler, kube-controller-manager
Kubernetes Nodes: CoreOS, etcd, kube-kubelet, kube-proxy, flannel, docker


1. Install Terraform
Follow the instructions from here:

2. Upload CoreOS Image to OpenStack Glance
TerraKube deploys Kubernetes on top of CoreOS instances on OpenStack. So first we need a CoreOS image to be uploaded to our OpenStack Image Service.

bunzip2 coreos_production_openstack_image.img.bz2
glance image-create --name CoreOs --container-format bare --disk-format qcow2 --file coreos_production_openstack_image.img --is-public True

3. Configure Terrakube

git clone
cd terrakube
mv terraform.tfvars.example terraform.tfvars

Edit your terraform.tfvars with your configuration info. Most of the configuration should be pretty straight forward.

Please note that you have to get a new etcd_discovery_url for every new cluster. Take a look at for example, where the etcd_discovery_url in the terraform.vars file is updated with a new value before you apply the terraform plan/

4. Using Terrakube
Show the execution plan

terraform plan

Execute the plan

terraform apply

One you apply the plan and wait for few minutes, you should get an output like:

  master_ip  =
  worker_ips =

The master_ip is the Kubernetes Master and worker_ips is a list of Kubernetes nodes.

Login to the master and make sure all services are up and Kubernetes is functioning properly.

ssh core@
cd /opt/kubernetes/server/bin
./kubectl get cluster-info
./kubectl get nodes

5. Running some examples
Kubernetes comes with a lot of examples that you can try out. Note that many of the examples are configured to run on top of Google Container Engine (GKE), and may not run on top of OpenStack without some tweaking. But the manifests are a pretty good starting point to learn about deploying apps on Kubernetes.

git clone ~/kubernetes

There are plenty of example applications under examples directory. Examples/guestbook is a good start.

Crawling anonymously with Tor in Python

March 5, 2014

There are a lot of valid usecases when you need to protect your identity while communicating over the public internet. It is 2013 and so you probably already know about Tor. Most people use Tor through the browser. The cool thing is that you can get access to the Tor network programmatically so you can build interesting tools with privacy built into it.

The most common usecase to be able to hide the identity using TOR or being able to change identities programmatically is when you are crawling a website like Google (well, this one is harder than you think) and you don’t want to be rate-limited or forbidden.

This did take a fair amount hit and trial to get it working though.
First of all, lets install Tor.

apt-get update
apt-get install tor
/etc/init.d/tor restart

You will notice that socks listener is on port 9050.

Lets enable the ControlPort listener for Tor to listen on port 9051. This is the port Tor will listen to for any communication from applications talking to Tor controller. The Hashed password is to enable authentication to the port to prevent any random access to the port.

You can create a hashed password out of your password using:

tor --hash-password mypassword

So, update the torrc with the port and the hashed password.


ControlPort 9051
HashedControlPassword 16:872860B76453A77D60CA2BB8C1A7042072093276A3D701AD684053EC4C

Restart Tor again to the configuration changes are applied.

/etc/init.d/tor restart

Next, we will install pytorctl which is a python based module to interact with the Tor Controller. This lets us send and receive commands from the Tor Control port programmatically.

apt-get install git
apt-get install python-dev python-pip
git clone git://
pip install pytorctl/

Tor itself is not a http proxy. So in order to get access to the Tor Network, we will use the Privoxy as an http-proxy though socks5..

Install Privoxy.

apt-get install privoxy

Now lets tell privoxy to use TOR. This will tell Privoxy to route all traffic through the SOCKS servers at localhost port 9050.
Go to /etc/privoxy/config and enable forward-socks5:

forward-socks5 / localhost:9050 .

Restart Privoxy after making the change to the configuration file.

/etc/init.d/privoxy restart

In the script below, we’re using urllib2 to use the proxy. Privoxy listens on port 8118 by default, and forwards the traffic to port 9050 which the Tor socks is listening on.
Additionally, in the renew_connection() function, I am also sending signal to Tor controller to change the identity, so you get new identities without restarting Tor. You don’t have to change the ip, but sometimes it comes in handy with you are crawling and don’t wanted to be blocked based on ip.

from TorCtl import TorCtl
import urllib2

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2009021910 Firefox/3.0.7'

def request(url):
    def _set_urlproxy():
        proxy_support = urllib2.ProxyHandler({"http" : ""})
        opener = urllib2.build_opener(proxy_support)
    request=urllib2.Request(url, None, headers)
    return urllib2.urlopen(request).read()

def renew_connection():
    conn = TorCtl.connect(controlAddr="", controlPort=9051, passphrase="your_password")

for i in range(0, 10):
    print request("")

Running the script:


Now, watch your ip change every few seconds.

Use it, but don’t abuse it.

Dynamic Inventory with Ansible and Rackspace Cloud

March 4, 2014

Typically, with Ansible you create one or more hosts file which it calls Inventory file and Ansible will pick the servers from the hosts file and runs the playbooks onto the servers. This is a simple and straightforward way to do it. However, if you are using the Cloud, its very likely that your applications are creating and deleting servers based on some other logic and its very impractical to maintain a static Inventory file. In that case, Ansible can directly talk to your cloud (AWS, Rackspace, OpenStack, etc) or a dynamic source (Cobbler etc) through what it calls Dynamic Inventory plugins, without you having to maintain a static list of servers.

Here, I will go through the process of using the Rackspace Public Cloud Dynamic Inventory Plugin with Ansible.

Install Ansible
First of all, if you have not already installed Ansible, go ahead and do so. I like to install Ansible within virtualenv using pip.

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install python-dev python-virtualenv
virtualenv env
source env/bin/activate
pip install ansible

Install Rax Dynamic Inventory Plugin
Ansible maintains an external RAX Inventory File on its repository (Not sure why these plugins do not get bundled with the Ansible package). The script depends on pyrax module, which is the client binding for Rackspace Cloud.

pip install pyrax
chmod +x

The script needs a configuration file named ~/.rackspace_cloud_credentials, which will store your auth credentials to Rackspace Cloud.

cat ~/.rackspace_cloud_credentials
username = <username>
api_key = <apikey>

As you can see, is a very simple script that provides a couple of methods to list and show servers in your cloud. By default, it grabs the servers in all Rackspace regions. If you are interested in only one region, you can specify the RAX_REGION.

./ --list
RAX_REGION=DFW ./ --list
RAX_REGION=DFW ./ --host some-cloud-server

Create Cloud Servers
Since you have already pyrax installed as a dependency of inventory plugin, you can use command-line to create a cloud server named ‘staging-apache1′ and and tag the server as staging-apache group using the metadata key-value feature.

export OS_USERNAME=<username>
export OS_PASSWORD=<apikey>
export OS_TENANT_NAME=<username>
export OS_AUTH_SYSTEM=rackspace
export OS_AUTH_URL=
nova keypair-add --pub-key ~/.ssh/ stagingkey
nova boot --image 80fbcb55-b206-41f9-9bc2-2dd7aac6c061 --flavor 2 --meta group=staging-apache --key-name stagingkey staging-apache1

If you want to install Apache on more staging servers, you would create server named staging-apache2 and tag it with the same group name staging-apache.

Also note, we are injecting ssh keys to the servers on creation, so ansible will be able to do ssh passwordless login. With Ansible, you also have the option of using username-password if you choose so.

Once the server is booted, lets make sure ansible can ping all the servers tagged with the group staging-apache.

ansible -i staging-apache -u root -m ping

Run a sample playbook
Now, lets create a very simple playbook to install apache on the inventory.

$ cat apache.yml
- hosts: staging-apache
      - name: Installs apache web server
        apt: pkg=apache2 state=installed update_cache=true

Lets run the apache playbook on all rax servers in the region DFW and that match the hosts in the group staging-apache.

RAX_REGION=DFW ansible-playbook -i apache.yml

With static inventory, you’d be doing this instead, and manually updating the hosts file:

ansible-playbook -i hosts apache.yml

Now you can ssh into the staging-apache1 server and make sure everything is configured as per your playbook.

ssh -i ~/.ssh/id_rsa root@staging-apache1

You may add more servers to the staging-apache group, and on the next run, ansible will detect the updated inventory dynamically and run the playbooks.

Rackspace Public Cloud is based off of OpenStack Nova. So inventory should work pretty much the same. You can look at the complete lists of dynamic inventory plugins here. Adding a new inventory plugin like for say Razor that isn’t already there would be fairly simple.

Getting started with CQL – Cassandra Query Language

February 14, 2014

One of the most intriguing things about NoSQL databases for beginners is the data modeling. Cassandra Query Language (CQL) provides a query language that is similar to the Structured Query Language (SQL) that is the standard in many RDBMS like MySQL.

Cassandra first experimented with CQL 1 in Cassandra 0.8 version, it was a very basic implementation and used the underlying Thrift Protocol.

CQL 2.0 made a lot of improvements but still used the Thrift Protocol, which means developers still have to know the internals of the underlying Cassandra data structures.
The current CQL spec 3.0 became the default in Apache Cassandra 1.2, and supports all the datatypes available in Cassandra. It is not backwards compatible with CQL2, and uses the Cassandra’s native protocol. And instead of Cassandra’s terms like column families, it uses the terms like tables making it more familiar to SQL developers. And going forward, CQL will be the default and only way to interact with the Cassandra storage system.

Here’s a bunch of CQL queries that I was playing with as I was playing with CQL . If you have a Cassandra cluster, login to a node that has CQLSH and issue the command:

cqlsh cassandra1 9160 -f cqls.txt

where cassandra1 is the ip of one of the Cassandra nodes.
Alternatively, you can so get inside the cqlsh and source the query file.

cqlsh cassandra1 9160
source ~/cqls.txt

Here is the cqls.txt. Feel free to play with it.

-- Information about the cassandra cluster



DESCRIBE TABLE system.schema_keyspaces;

SELECT * FROM system.schema_keyspaces;
SELECT * FROM system.local;
SELECT * FROM system.peers;
SELECT * FROM system.schema_columns;
SELECT * FROM system.schema_columnfamilies;
SELECT * FROM system.schema_keyspaces  WHERE keyspace_name='system';

-- Consistency level

-- Tracing a request
SELECT * FROM system.schema_keyspaces  WHERE keyspace_name='system';

-- Keyspace
CREATE KEYSPACE web WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'};
CREATE KEYSPACE IF NOT EXISTS web WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'};
ALTER KEYSPACE web WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '2'};

SELECT * FROM system.schema_keyspaces WHERE keyspace_name='web';

USE web;

-- Create Table
        username text,
        first_name text,
        last_name varchar,
        skills set,
        PRIMARY KEY(username)
) WITH comment='Author records';

-- Alter table
ALTER TABLE users ADD phone_numbers list;
ALTER TABLE users ADD skill_levels map<text, int>;
ALTER TABLE users ALTER last_name TYPE text;

ALTER TABLE users ADD nickname text;
ALTER TABLE users DROP nickname;

-- Entry for the table in system keyspace
SELECT * FROM system.schema_columnfamilies where keyspace_name='web' and

-- Insert
INSERT INTO users (username, first_name, last_name, skills) VALUES
('cnorris','chuck','norris', {'Java', 'Python'});
INSERT INTO users (username, first_name, last_name, skills) VALUES
('jskeet','jon','skeet', {'XML', 'HTML'}) USING TTL 3600;

-- Update
UPDATE users USING TTL 7200 SET first_name='jon1', last_name='skeet1' WHERE
UPDATE users SET skills = skills + {'Bash'} WHERE username='cnorris';
UPDATE users SET skills = {} WHERE username='cnorris';
UPDATE users SET skill_levels = {'Java':10, 'Python':10} WHERE
UPDATE users SET skill_levels['Java']=9 where username='cnorris';
UPDATE users SET phone_numbers = ['123456789', '987654321', '123456789'] +
phone_numbers WHERE username='cnorris';

-- Select
-- Select using a primary key
SELECT * FROM users WHERE username='cnorris';
SELECT username, first_name, last_name FROM users WHERE first_name='chuck';
SELECT * FROM users WHERE username in ('cnorris', 'jskeet');
SELECT username AS userName, first_name as firstName, last_name AS lastName FROM

-- If I want to query by a column other than by a primary key, index it first
CREATE INDEX on users (first_name);
SELECT * FROM users WHERE first_name='chuck';

-- Index
CREATE INDEX IF NOT EXISTS last_name_index ON users (last_name);
DROP INDEX last_name_index;
DROP INDEX IF EXISTS last_name_index;

-- Count
SELECT COUNT(*) AS user_count FROM users;

-- Limit
SELECT * FROM users limit 1;

-- Allow filtering
SELECT * FROM users WHERE first_name='chuck' AND last_name='norris' ALLOW

-- Export
COPY users TO 'users.csv';
COPY users (username, first_name, last_name) TO 'users-select.txt';

-- Delete
DELETE FROM users WHERE username='cnorris';
DELETE FROM users WHERE username in ('cnorris', 'jskeet');
DELETE skills FROM users WHERE username='jskeet';
DELETE skill_levels['Java'] FROM users where username='jskeet';
-- Permamently remove all data from the table

-- Import
COPY users FROM 'users.csv';
COPY users (username, first_name, last_name) FROM 'users-select.txt';

-- Batch
INSERT INTO users (username, first_name, last_name, skills) VALUES
('jbrown','johnny','brown', {'Java', 'Python'});
UPDATE users SET skills = {} WHERE username='jbrown';
DELETE FROM users WHERE username='jbrown';

-- Composite keys and order by
CREATE TABLE timeline (
    username text,
    posted_month int,
    body text,
    posted_by text,
    PRIMARY KEY (username, posted_month)
) WITH comment='Timeline records'
  AND compaction = { 'class' : 'LeveledCompactionStrategy' };

INSERT INTO timeline (username, posted_month, body, posted_by) VALUES ('jbrown',
1,'This is important', 'stevej');
SELECT * FROM timeline where username='jbrown' and posted_month=1;
SELECT * FROM timeline where username='jbrown' order by posted_month desc;


Parsing SQL with pyparsing

November 1, 2013

Recently, I was working on a NoSQL database and wanted to expose a SQL interface to it so I can use it just like a RDBMS from my application. Not being much familiar with the python ecosystem libraries, I quickly searched and found this python library called pyparsing.

Now, if you know anything about parsing, you know regex and traditional lex parsers can get complicated very soon. But after playing with pyparsing for a few minutes, I quickly realized pyparsing makes it really easy to write and execute grammars. Pyparsing has a set of good APIs, handles spaces well, makes debugging easy and have a good documentation.

The code below doesn’t cover all the edge-cases and documented grammar of SQL, but it was a good excuse to learn pyparsing anyway; good enough for my usecase.

Install pypasing python module.

pip install pyparsing

Here is my

from pyparsing import CaselessKeyword, delimitedList, Each, Forward, Group, \
        Optional, Word, alphas,alphanums, nums, oneOf, ZeroOrMore, quotedString, \

keywords = ["select", "from", "where", "group by", "order by", "and", "or"]
[select, _from, where, groupby, orderby, _and, _or] = [ CaselessKeyword(word)
        for word in keywords ]

table = column = Word(alphas)
columns = Group(delimitedList(column))
columnVal = (nums | quotedString)

whereCond = (column + oneOf("= != < > >= <=") + columnVal)
whereExpr = whereCond + ZeroOrMore((_and | _or) + whereCond)

selectStmt = Forward().setName("select statement")
selectStmt << (select +
        ('*' | columns).setResultsName("columns") +
        _from +
        table.setResultsName("table") +
        Optional(where + Group(whereExpr), '').setResultsName("where").setDebug(False) +
        Each([Optional(groupby + columns("groupby"),'').setDebug(False),
            Optional(orderby + columns("orderby"),'').setDebug(False)

def log(sql, parsed):
    print "##################################################"
    print sql
    print parsed.table
    print parsed.columns
    print parsed.where
    print parsed.groupby
    print parsed.orderby

sqls = [
        """select * from users where username='johnabc'""",
        """SELECT * FROM users WHERE username='johnabc'""",
        """SELECT * FRom users""",
        """SELECT * FRom USERS""",
        """SELECT * FROM users WHERE username='johnabc' or email=''""",
        """SELECT id, username, email FROM users WHERE username='johnabc' order by email, id""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by school""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by city, school order by firstname, lastname"""

for sql in sqls:
    log(sql, selectStmt.parseString(sql))

To run the script


As soon as I wrote my crappy little version and blogged about it, I actually found written by Paul McGuire, the author of pyparsing. Oh well!

Deploying multinode Hadoop 2.0 cluster using Apache Ambari

October 31, 2013

The Apache Hadoop community recently made the GA release of Apace Hadoop 2.0, which is a pretty big deal. Hadoop 2.0 is  basically a re-architecture and re-write of major components of classic Hadoop including the NextGen MapReduce Framework based on Hadoop YARN, and federated Namenodes. Bottomline, the architectural changes in Hadoop 2.0 allows it to scale to much larger clusters.

Deploying Hadoop manually can be a long and tedious process. I really wanted to try the new Hadoop, and I quickly realized Apache Ambari now supports the deployment of Hadoop 2.0. Apache Ambari has come a long way since last year and has really become one of my preferred Hadoop deployment tools for Hadoop 1.*.

In this article below, I will go through the steps I followed to get a Hadoop 2.0 cluster running on Rackspace Public Cloud. I just chose Rackspace public cloud as I have easy access to it, but doing it on Amazon or even dedicated servers should be just as easy too.

1. Create cloud servers on Rackspace Public Cloud.

You can create cloud servers using the Rackspace Control Panel or using their APIs directly or using any of the widely available bindings.

For Hadoop cluster, I am using:

  • Large flavors ie 8GB or above.
  • CentOS6.* as Guest Operating System.

To actually create the servers, I will use a slightly modified version of bulk servers create script. I will create one server for Apache Ambari and a number of servers for Apache Hadoop Cluster and I will then use Ambari to install the Hadoop onto the Hadoop cluster servers.

So basically, I have created the following servers:


and have recorded their hostnames, public/private ip addresses and root passwords for each.

2. Prepare the servers.

SSH into the newly created Ambari server eg. myhadoop-Ambari. Update its /etc/hosts file with the entry for each server above.

Also create a hosts.txt file with the hostnames of the servers from above.

root@myhadoop-Ambari$ cat hosts.txt

At this point, from the same Ambari server, run the following script which will ssh into all of the servers specified in the hosts.txt file and set them up.

Specifically, the script will set up passwordless SSH between the servers and also disable iptables among other things.


set -x

# Generate SSH keys
ssh-keygen -t rsa
cd ~/.ssh
cat >> authorized_keys

cd ~
# Distribute SSH keys
for host in `cat hosts.txt`; do
    cat ~/.ssh/ | ssh root@$host "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"
    cat ~/.ssh/id_rsa | ssh root@$host "cat > ~/.ssh/id_rsa; chmod 400 ~/.ssh/id_rsa"
    cat ~/.ssh/ | ssh root@$host "cat > ~/.ssh/"

# Distribute hosts file
for host in `cat hosts.txt`; do
    scp /etc/hosts root@$host:/etc/hosts

# Prepare other basic things
for host in `cat hosts.txt`; do
    ssh root@$host "sed -i s/SELINUX=enforcing/SELINUX=disabled/g /etc/selinux/config"
    ssh root@$host "chkconfig iptables off"
    ssh root@$host "/etc/init.d/iptables stop"
    echo "enabled=0" | ssh root@$host "cat > /etc/yum/pluginconf.d/refresh-packagekit.conf"

Note, this step will ask for root password for each of the servers before setting them for passwordless access.

3 Install Ambari.

While still on the Ambari server, run the following script that will install Apache Ambari.


set -x

if [[ $EUID -ne 0 ]]; then
    echo "This script must be run as root"
    exit 1

# Install Ambari server
cd ~
cp ambari.repo /etc/yum.repos.d/
yum install -y epel-release
yum repolist
yum install -y ambari-server

# Setup Ambari server
ambari-server setup -s

# Start Ambari server
ambari-server start

ps -ef | grep Ambari

Once the installation completes, you should be able to login to the ip address of the Ambari servers on the browser and access its web interface.


admin/admin is the default username and password.

4. Install Hadoop.

Once logged into the Ambari web portal, it is pretty intuitive to create a Hadoop Cluster through its wizard.

It will ask for hostnames and SSH Private Key, which you can get from the Ambari Server.

root@myhadoop-Ambari$ cat hosts.txt
root@myhadoop-Ambari$ cat ~/.ssh/id_rsa

You should be able to just follow the wizard and complete the Hadoop 2.0 Installation at this point. The process the install Hadoop 1.* is almost exactly the same although some of the services like YARN don’t exist.

Apache Ambari will let you install a plethora of services including HDFS, YARN, MapReduce2, HBase, HIVE, Oozie, Ganglia, Nagios, ZooKeeper and Hive and Pig clients. As you go through the installation wizard, you can choose what service goes on which server.

5. Validate Hadoop:

SSH to myhadoop1 and run the script to do a wordcount on all books of Shakespeare.


set -x

su hdfs - -c "hadoop fs -rmdir /shakespeare"
cd /tmp
tar xjvf shakespeare.tar.bz2
now=`date +"%y%m%d-%H%M"`
su hdfs - -c "hadoop fs -mkdir -p /shakespeare"
su hdfs - -c "hadoop fs -mkdir -p /shakespeare/$now"
su hdfs - -c "hadoop fs -put /tmp/Shakespeare /shakespeare/$now/input"
su hdfs - -c "hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples- wordcount /shakespeare/$now/input /shakespeare/$now/output"
su hdfs - -c "hadoop fs -cat /shakespeare/$now/output/part-r-* | sort -nk2"

So you have your first Hadoop 2.0 cluster running and validated. Feed free to look into the scripts, its mostly instructions from the Hortonworks docs scripted out. Have fun Hadooping!

Bulk creating Rackspace cloud servers using Script

October 31, 2013

I keep having to create a large number of cloud servers on Rackspace Cloud, so I can play with things like Hadoop and Cassandra.
Using the control panel to create one server at a time, and record each login password and ip, and wait till the server goes active can get really tedious very soon.
So here’s a little script that installs the REST API python binding ‘rackspace-novaclient’ on an Ubuntu server, and prompts you for the image, flavor and number of servers to create, then goes and creates the servers.

On an Ubuntu server, first export your Rackspace Cloud auth credentials (either as root or sudo user)

export OS_USERNAME=<username>
export OS_PASSWORD=<apikey>
export OS_TENANT_NAME=<username>
export OS_AUTH_SYSTEM=rackspace
export OS_AUTH_URL=

Here is the actual script:


set -x

# Install the Client
if [[ $EUID -ne 0 ]]; then
	sudo apt-get update
	sudo apt-get install python-dev python-pip python-virtualenv
	apt-get update
	apt-get install python-dev python-pip python-virtualenv

virtualenv ~/.env
source ~/.env/bin/activate
pip install pbr
pip install python-novaclient
pip install rackspace-novaclient

# Read AUTH Credentials
: ${OS_USERNAME:?"Need to set OS_USERNAME non-empty"}
: ${OS_PASSWORD:?"Need to set OS_PASSWORD non-empty"}
: ${OS_TENANT_NAME:?"Need to set OS_TENANT_NAME non-empty"}
: ${OS_AUTH_SYSTEM:?"Need to set OS_AUTH_SYSTEM non-empty"}
: ${OS_AUTH_URL:?"Need to set OS_AUTH_URL non-empty"}
: ${OS_REGION_NAME:?"Need to set OS_REGION_NAME non-empty"}
: ${OS_NO_CACHE:?"Need to set OS_NO_CACHE non-empty"}

# Write credentials to a file
cat > ~/novarc <> 'server_passwords.txt'

for i in $(eval echo "{1..$CLUSTER_SIZE}")

is_not_active() {
	status=`nova show $1 | grep 'status' | awk '{print $4}'`
	if [ "$status" != "ACTIVE" ] && [ "$status" != "ERROR" ]; then
		echo "$1 in $status"
		return 0
		return 1

# Wait for all the instances to go ACTIVE or ERROR
while true
	for i in $(eval echo "{1..$CLUSTER_SIZE}")
		if is_not_active $CLUSTER_NAME$i; then

	echo "READY is $READY"
	if [ "$READY" -eq "1" ]; then
	sleep 5

for i in $(eval echo "{1..$CLUSTER_SIZE}")
	echo $CLUSTER_NAME$i >> 'hosts.txt'
cat hosts.txt

	private_ip=`nova show $1 | grep 'private network' | awk '{print $5}'`
	public_ip=`nova show $1 | grep 'accessIPv4' | awk '{print $4}'`
	echo $private_ip $1 >> 'etc_hosts.txt'
	echo $public_ip $1 >> 'etc_hosts.txt'

for i in $(eval echo "{1..$CLUSTER_SIZE}"); do record_ip $CLUSTER_NAME$i; done

cat etc_hosts.txt


Then, execute the script


Alternatively, I put the script on github, so you can also do curl, pipe, bash.

bash <(curl -s

The script will wait till all the servers go to an active status and will save the the ips, hostnames and passwords for each of the servers onto these three files.

  • etc_hosts.txt
  • hosts.txt
  • server_passwords.txt