Crawling anonymously with Tor in Python

March 5, 2014

There are a lot of valid usecases when you need to protect your identity while communicating over the public internet. It is 2013 and so you probably already know about Tor. Most people use Tor through the browser. The cool thing is that you can get access to the Tor network programmatically so you can build interesting tools with privacy built into it.

The most common usecase to be able to hide the identity using TOR or being able to change identities programmatically is when you are crawling a website like Google (well, this one is harder than you think) and you don’t want to be rate-limited or forbidden.

This did take a fair amount hit and trial to get it working though.
First of all, lets install Tor.

apt-get update
apt-get install tor
/etc/init.d/tor restart

You will notice that socks listener is on port 9050.

Lets enable the ControlPort listener for Tor to listen on port 9051. This is the port Tor will listen to for any communication from applications talking to Tor controller. The Hashed password is to enable authentication to the port to prevent any random access to the port.

You can create a hashed password out of your password using:

tor --hash-password mypassword

So, update the torrc with the port and the hashed password.


ControlPort 9051
HashedControlPassword 16:872860B76453A77D60CA2BB8C1A7042072093276A3D701AD684053EC4C

Restart Tor again to the configuration changes are applied.

/etc/init.d/tor restart

Next, we will install pytorctl which is a python based module to interact with the Tor Controller. This lets us send and receive commands from the Tor Control port programmatically.

apt-get install git
apt-get install python-dev python-pip
git clone git://
pip install pytorctl/

Tor itself is not a http proxy. So in order to get access to the Tor Network, we will use the Privoxy as an http-proxy though socks5..

Install Privoxy.

apt-get install privoxy

Now lets tell privoxy to use TOR. This will tell Privoxy to route all traffic through the SOCKS servers at localhost port 9050.
Go to /etc/privoxy/config and enable forward-socks5:

forward-socks5 / localhost:9050 .

Restart Privoxy after making the change to the configuration file.

/etc/init.d/privoxy restart

In the script below, we’re using urllib2 to use the proxy. Privoxy listens on port 8118 by default, and forwards the traffic to port 9050 which the Tor socks is listening on.
Additionally, in the renew_connection() function, I am also sending signal to Tor controller to change the identity, so you get new identities without restarting Tor. You don’t have to change the ip, but sometimes it comes in handy with you are crawling and don’t wanted to be blocked based on ip.

from TorCtl import TorCtl
import urllib2

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv: Gecko/2009021910 Firefox/3.0.7'

def request(url):
    def _set_urlproxy():
        proxy_support = urllib2.ProxyHandler({"http" : ""})
        opener = urllib2.build_opener(proxy_support)
    request=urllib2.Request(url, None, headers)
    return urllib2.urlopen(request).read()

def renew_connection():
    conn = TorCtl.connect(controlAddr="", controlPort=9051, passphrase="your_password")

for i in range(0, 10):
    print request("")

Running the script:


Now, watch your ip change every few seconds.

Use it, but don’t abuse it.

Dynamic Inventory with Ansible and Rackspace Cloud

March 4, 2014

Typically, with Ansible you create one or more hosts file which it calls Inventory file and Ansible will pick the servers from the hosts file and runs the playbooks onto the servers. This is a simple and straightforward way to do it. However, if you are using the Cloud, its very likely that your applications are creating and deleting servers based on some other logic and its very impractical to maintain a static Inventory file. In that case, Ansible can directly talk to your cloud (AWS, Rackspace, OpenStack, etc) or a dynamic source (Cobbler etc) through what it calls Dynamic Inventory plugins, without you having to maintain a static list of servers.

Here, I will go through the process of using the Rackspace Public Cloud Dynamic Inventory Plugin with Ansible.

Install Ansible
First of all, if you have not already installed Ansible, go ahead and do so. I like to install Ansible within virtualenv using pip.

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install python-dev python-virtualenv
virtualenv env
source env/bin/activate
pip install ansible

Install Rax Dynamic Inventory Plugin
Ansible maintains an external RAX Inventory File on its repository (Not sure why these plugins do not get bundled with the Ansible package). The script depends on pyrax module, which is the client binding for Rackspace Cloud.

pip install pyrax
chmod +x

The script needs a configuration file named ~/.rackspace_cloud_credentials, which will store your auth credentials to Rackspace Cloud.

cat ~/.rackspace_cloud_credentials
username = <username>
api_key = <apikey>

As you can see, is a very simple script that provides a couple of methods to list and show servers in your cloud. By default, it grabs the servers in all Rackspace regions. If you are interested in only one region, you can specify the RAX_REGION.

./ --list
RAX_REGION=DFW ./ --list
RAX_REGION=DFW ./ --host some-cloud-server

Create Cloud Servers
Since you have already pyrax installed as a dependency of inventory plugin, you can use command-line to create a cloud server named ‘staging-apache1′ and and tag the server as staging-apache group using the metadata key-value feature.

export OS_USERNAME=<username>
export OS_PASSWORD=<apikey>
export OS_TENANT_NAME=<username>
export OS_AUTH_SYSTEM=rackspace
export OS_AUTH_URL=
nova keypair-add --pub-key ~/.ssh/ stagingkey
nova boot --image 80fbcb55-b206-41f9-9bc2-2dd7aac6c061 --flavor 2 --meta group=staging-apache --key-name stagingkey staging-apache1

If you want to install Apache on more staging servers, you would create server named staging-apache2 and tag it with the same group name staging-apache.

Also note, we are injecting ssh keys to the servers on creation, so ansible will be able to do ssh passwordless login. With Ansible, you also have the option of using username-password if you choose so.

Once the server is booted, lets make sure ansible can ping all the servers tagged with the group staging-apache.

ansible -i staging-apache -u root -m ping

Run a sample playbook
Now, lets create a very simple playbook to install apache on the inventory.

$ cat apache.yml
- hosts: staging-apache
      - name: Installs apache web server
        apt: pkg=apache2 state=installed update_cache=true

Lets run the apache playbook on all rax servers in the region DFW and that match the hosts in the group staging-apache.

RAX_REGION=DFW ansible-playbook -i apache.yml

With static inventory, you’d be doing this instead, and manually updating the hosts file:

ansible-playbook -i hosts apache.yml

Now you can ssh into the staging-apache1 server and make sure everything is configured as per your playbook.

ssh -i ~/.ssh/id_rsa root@staging-apache1

You may add more servers to the staging-apache group, and on the next run, ansible will detect the updated inventory dynamically and run the playbooks.

Rackspace Public Cloud is based off of OpenStack Nova. So inventory should work pretty much the same. You can look at the complete lists of dynamic inventory plugins here. Adding a new inventory plugin like for say Razor that isn’t already there would be fairly simple.

Getting started with CQL – Cassandra Query Language

February 14, 2014

One of the most intriguing things about NoSQL databases for beginners is the data modeling. Cassandra Query Language (CQL) provides a query language that is similar to the Structured Query Language (SQL) that is the standard in many RDBMS like MySQL.

Cassandra first experimented with CQL 1 in Cassandra 0.8 version, it was a very basic implementation and used the underlying Thrift Protocol.

CQL 2.0 made a lot of improvements but still used the Thrift Protocol, which means developers still have to know the internals of the underlying Cassandra data structures.
The current CQL spec 3.0 became the default in Apache Cassandra 1.2, and supports all the datatypes available in Cassandra. It is not backwards compatible with CQL2, and uses the Cassandra’s native protocol. And instead of Cassandra’s terms like column families, it uses the terms like tables making it more familiar to SQL developers. And going forward, CQL will be the default and only way to interact with the Cassandra storage system.

Here’s a bunch of CQL queries that I was playing with as I was playing with CQL . If you have a Cassandra cluster, login to a node that has CQLSH and issue the command:

cqlsh cassandra1 9160 -f cqls.txt

where cassandra1 is the ip of one of the Cassandra nodes.
Alternatively, you can so get inside the cqlsh and source the query file.

cqlsh cassandra1 9160
source ~/cqls.txt

Here is the cqls.txt. Feel free to play with it.

-- Information about the cassandra cluster



DESCRIBE TABLE system.schema_keyspaces;

SELECT * FROM system.schema_keyspaces;
SELECT * FROM system.local;
SELECT * FROM system.peers;
SELECT * FROM system.schema_columns;
SELECT * FROM system.schema_columnfamilies;
SELECT * FROM system.schema_keyspaces  WHERE keyspace_name='system';

-- Consistency level

-- Tracing a request
SELECT * FROM system.schema_keyspaces  WHERE keyspace_name='system';

-- Keyspace
CREATE KEYSPACE web WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'};
CREATE KEYSPACE IF NOT EXISTS web WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'};
ALTER KEYSPACE web WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '2'};

SELECT * FROM system.schema_keyspaces WHERE keyspace_name='web';

USE web;

-- Create Table
        username text,
        first_name text,
        last_name varchar,
        skills set,
        PRIMARY KEY(username)
) WITH comment='Author records';

-- Alter table
ALTER TABLE users ADD phone_numbers list;
ALTER TABLE users ADD skill_levels map<text, int>;
ALTER TABLE users ALTER last_name TYPE text;

ALTER TABLE users ADD nickname text;
ALTER TABLE users DROP nickname;

-- Entry for the table in system keyspace
SELECT * FROM system.schema_columnfamilies where keyspace_name='web' and

-- Insert
INSERT INTO users (username, first_name, last_name, skills) VALUES
('cnorris','chuck','norris', {'Java', 'Python'});
INSERT INTO users (username, first_name, last_name, skills) VALUES
('jskeet','jon','skeet', {'XML', 'HTML'}) USING TTL 3600;

-- Update
UPDATE users USING TTL 7200 SET first_name='jon1', last_name='skeet1' WHERE
UPDATE users SET skills = skills + {'Bash'} WHERE username='cnorris';
UPDATE users SET skills = {} WHERE username='cnorris';
UPDATE users SET skill_levels = {'Java':10, 'Python':10} WHERE
UPDATE users SET skill_levels['Java']=9 where username='cnorris';
UPDATE users SET phone_numbers = ['123456789', '987654321', '123456789'] +
phone_numbers WHERE username='cnorris';

-- Select
-- Select using a primary key
SELECT * FROM users WHERE username='cnorris';
SELECT username, first_name, last_name FROM users WHERE first_name='chuck';
SELECT * FROM users WHERE username in ('cnorris', 'jskeet');
SELECT username AS userName, first_name as firstName, last_name AS lastName FROM

-- If I want to query by a column other than by a primary key, index it first
CREATE INDEX on users (first_name);
SELECT * FROM users WHERE first_name='chuck';

-- Index
CREATE INDEX IF NOT EXISTS last_name_index ON users (last_name);
DROP INDEX last_name_index;
DROP INDEX IF EXISTS last_name_index;

-- Count
SELECT COUNT(*) AS user_count FROM users;

-- Limit
SELECT * FROM users limit 1;

-- Allow filtering
SELECT * FROM users WHERE first_name='chuck' AND last_name='norris' ALLOW

-- Export
COPY users TO 'users.csv';
COPY users (username, first_name, last_name) TO 'users-select.txt';

-- Delete
DELETE FROM users WHERE username='cnorris';
DELETE FROM users WHERE username in ('cnorris', 'jskeet');
DELETE skills FROM users WHERE username='jskeet';
DELETE skill_levels['Java'] FROM users where username='jskeet';
-- Permamently remove all data from the table

-- Import
COPY users FROM 'users.csv';
COPY users (username, first_name, last_name) FROM 'users-select.txt';

-- Batch
INSERT INTO users (username, first_name, last_name, skills) VALUES
('jbrown','johnny','brown', {'Java', 'Python'});
UPDATE users SET skills = {} WHERE username='jbrown';
DELETE FROM users WHERE username='jbrown';

-- Composite keys and order by
CREATE TABLE timeline (
    username text,
    posted_month int,
    body text,
    posted_by text,
    PRIMARY KEY (username, posted_month)
) WITH comment='Timeline records'
  AND compaction = { 'class' : 'LeveledCompactionStrategy' };

INSERT INTO timeline (username, posted_month, body, posted_by) VALUES ('jbrown',
1,'This is important', 'stevej');
SELECT * FROM timeline where username='jbrown' and posted_month=1;
SELECT * FROM timeline where username='jbrown' order by posted_month desc;


Parsing SQL with pyparsing

November 1, 2013

Recently, I was working on a NoSQL database and wanted to expose a SQL interface to it so I can use it just like a RDBMS from my application. Not being much familiar with the python ecosystem libraries, I quickly searched and found this python library called pyparsing.

Now, if you know anything about parsing, you know regex and traditional lex parsers can get complicated very soon. But after playing with pyparsing for a few minutes, I quickly realized pyparsing makes it really easy to write and execute grammars. Pyparsing has a set of good APIs, handles spaces well, makes debugging easy and have a good documentation.

The code below doesn’t cover all the edge-cases and documented grammar of SQL, but it was a good excuse to learn pyparsing anyway; good enough for my usecase.

Install pypasing python module.

pip install pyparsing

Here is my

from pyparsing import CaselessKeyword, delimitedList, Each, Forward, Group, \
        Optional, Word, alphas,alphanums, nums, oneOf, ZeroOrMore, quotedString, \

keywords = ["select", "from", "where", "group by", "order by", "and", "or"]
[select, _from, where, groupby, orderby, _and, _or] = [ CaselessKeyword(word)
        for word in keywords ]

table = column = Word(alphas)
columns = Group(delimitedList(column))
columnVal = (nums | quotedString)

whereCond = (column + oneOf("= != < > >= <=") + columnVal)
whereExpr = whereCond + ZeroOrMore((_and | _or) + whereCond)

selectStmt = Forward().setName("select statement")
selectStmt << (select +
        ('*' | columns).setResultsName("columns") +
        _from +
        table.setResultsName("table") +
        Optional(where + Group(whereExpr), '').setResultsName("where").setDebug(False) +
        Each([Optional(groupby + columns("groupby"),'').setDebug(False),
            Optional(orderby + columns("orderby"),'').setDebug(False)

def log(sql, parsed):
    print "##################################################"
    print sql
    print parsed.table
    print parsed.columns
    print parsed.where
    print parsed.groupby
    print parsed.orderby

sqls = [
        """select * from users where username='johnabc'""",
        """SELECT * FROM users WHERE username='johnabc'""",
        """SELECT * FRom users""",
        """SELECT * FRom USERS""",
        """SELECT * FROM users WHERE username='johnabc' or email=''""",
        """SELECT id, username, email FROM users WHERE username='johnabc' order by email, id""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by school""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by city, school order by firstname, lastname"""

for sql in sqls:
    log(sql, selectStmt.parseString(sql))

To run the script


As soon as I wrote my crappy little version and blogged about it, I actually found written by Paul McGuire, the author of pyparsing. Oh well!

Deploying multinode Hadoop 2.0 cluster using Apache Ambari

October 31, 2013

The Apache Hadoop community recently made the GA release of Apace Hadoop 2.0, which is a pretty big deal. Hadoop 2.0 is  basically a re-architecture and re-write of major components of classic Hadoop including the NextGen MapReduce Framework based on Hadoop YARN, and federated Namenodes. Bottomline, the architectural changes in Hadoop 2.0 allows it to scale to much larger clusters.

Deploying Hadoop manually can be a long and tedious process. I really wanted to try the new Hadoop, and I quickly realized Apache Ambari now supports the deployment of Hadoop 2.0. Apache Ambari has come a long way since last year and has really become one of my preferred Hadoop deployment tools for Hadoop 1.*.

In this article below, I will go through the steps I followed to get a Hadoop 2.0 cluster running on Rackspace Public Cloud. I just chose Rackspace public cloud as I have easy access to it, but doing it on Amazon or even dedicated servers should be just as easy too.

1. Create cloud servers on Rackspace Public Cloud.

You can create cloud servers using the Rackspace Control Panel or using their APIs directly or using any of the widely available bindings.

For Hadoop cluster, I am using:

  • Large flavors ie 8GB or above.
  • CentOS6.* as Guest Operating System.

To actually create the servers, I will use a slightly modified version of bulk servers create script. I will create one server for Apache Ambari and a number of servers for Apache Hadoop Cluster and I will then use Ambari to install the Hadoop onto the Hadoop cluster servers.

So basically, I have created the following servers:


and have recorded their hostnames, public/private ip addresses and root passwords for each.

2. Prepare the servers.

SSH into the newly created Ambari server eg. myhadoop-Ambari. Update its /etc/hosts file with the entry for each server above.

Also create a hosts.txt file with the hostnames of the servers from above.

root@myhadoop-Ambari$ cat hosts.txt

At this point, from the same Ambari server, run the following script which will ssh into all of the servers specified in the hosts.txt file and set them up.

Specifically, the script will set up passwordless SSH between the servers and also disable iptables among other things.


set -x

# Generate SSH keys
ssh-keygen -t rsa
cd ~/.ssh
cat >> authorized_keys

cd ~
# Distribute SSH keys
for host in `cat hosts.txt`; do
    cat ~/.ssh/ | ssh root@$host "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"
    cat ~/.ssh/id_rsa | ssh root@$host "cat > ~/.ssh/id_rsa; chmod 400 ~/.ssh/id_rsa"
    cat ~/.ssh/ | ssh root@$host "cat > ~/.ssh/"

# Distribute hosts file
for host in `cat hosts.txt`; do
    scp /etc/hosts root@$host:/etc/hosts

# Prepare other basic things
for host in `cat hosts.txt`; do
    ssh root@$host "sed -i s/SELINUX=enforcing/SELINUX=disabled/g /etc/selinux/config"
    ssh root@$host "chkconfig iptables off"
    ssh root@$host "/etc/init.d/iptables stop"
    echo "enabled=0" | ssh root@$host "cat > /etc/yum/pluginconf.d/refresh-packagekit.conf"

Note, this step will ask for root password for each of the servers before setting them for passwordless access.

3 Install Ambari.

While still on the Ambari server, run the following script that will install Apache Ambari.


set -x

if [[ $EUID -ne 0 ]]; then
    echo "This script must be run as root"
    exit 1

# Install Ambari server
cd ~
cp ambari.repo /etc/yum.repos.d/
yum install -y epel-release
yum repolist
yum install -y ambari-server

# Setup Ambari server
ambari-server setup -s

# Start Ambari server
ambari-server start

ps -ef | grep Ambari

Once the installation completes, you should be able to login to the ip address of the Ambari servers on the browser and access its web interface.


admin/admin is the default username and password.

4. Install Hadoop.

Once logged into the Ambari web portal, it is pretty intuitive to create a Hadoop Cluster through its wizard.

It will ask for hostnames and SSH Private Key, which you can get from the Ambari Server.

root@myhadoop-Ambari$ cat hosts.txt
root@myhadoop-Ambari$ cat ~/.ssh/id_rsa

You should be able to just follow the wizard and complete the Hadoop 2.0 Installation at this point. The process the install Hadoop 1.* is almost exactly the same although some of the services like YARN don’t exist.

Apache Ambari will let you install a plethora of services including HDFS, YARN, MapReduce2, HBase, HIVE, Oozie, Ganglia, Nagios, ZooKeeper and Hive and Pig clients. As you go through the installation wizard, you can choose what service goes on which server.

5. Validate Hadoop:

SSH to myhadoop1 and run the script to do a wordcount on all books of Shakespeare.


set -x

su hdfs - -c "hadoop fs -rmdir /shakespeare"
cd /tmp
tar xjvf shakespeare.tar.bz2
now=`date +"%y%m%d-%H%M"`
su hdfs - -c "hadoop fs -mkdir -p /shakespeare"
su hdfs - -c "hadoop fs -mkdir -p /shakespeare/$now"
su hdfs - -c "hadoop fs -put /tmp/Shakespeare /shakespeare/$now/input"
su hdfs - -c "hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples- wordcount /shakespeare/$now/input /shakespeare/$now/output"
su hdfs - -c "hadoop fs -cat /shakespeare/$now/output/part-r-* | sort -nk2"

So you have your first Hadoop 2.0 cluster running and validated. Feed free to look into the scripts, its mostly instructions from the Hortonworks docs scripted out. Have fun Hadooping!

Bulk creating Rackspace cloud servers using Script

October 31, 2013

I keep having to create a large number of cloud servers on Rackspace Cloud, so I can play with things like Hadoop and Cassandra.
Using the control panel to create one server at a time, and record each login password and ip, and wait till the server goes active can get really tedious very soon.
So here’s a little script that installs the REST API python binding ‘rackspace-novaclient’ on an Ubuntu server, and prompts you for the image, flavor and number of servers to create, then goes and creates the servers.

On an Ubuntu server, first export your Rackspace Cloud auth credentials (either as root or sudo user)

export OS_USERNAME=<username>
export OS_PASSWORD=<apikey>
export OS_TENANT_NAME=<username>
export OS_AUTH_SYSTEM=rackspace
export OS_AUTH_URL=

Here is the actual script:


set -x

# Install the Client
if [[ $EUID -ne 0 ]]; then
	sudo apt-get update
	sudo apt-get install python-dev python-pip python-virtualenv
	apt-get update
	apt-get install python-dev python-pip python-virtualenv

virtualenv ~/.env
source ~/.env/bin/activate
pip install pbr
pip install python-novaclient
pip install rackspace-novaclient

# Read AUTH Credentials
: ${OS_USERNAME:?"Need to set OS_USERNAME non-empty"}
: ${OS_PASSWORD:?"Need to set OS_PASSWORD non-empty"}
: ${OS_TENANT_NAME:?"Need to set OS_TENANT_NAME non-empty"}
: ${OS_AUTH_SYSTEM:?"Need to set OS_AUTH_SYSTEM non-empty"}
: ${OS_AUTH_URL:?"Need to set OS_AUTH_URL non-empty"}
: ${OS_REGION_NAME:?"Need to set OS_REGION_NAME non-empty"}
: ${OS_NO_CACHE:?"Need to set OS_NO_CACHE non-empty"}

# Write credentials to a file
cat > ~/novarc <> 'server_passwords.txt'

for i in $(eval echo "{1..$CLUSTER_SIZE}")

is_not_active() {
	status=`nova show $1 | grep 'status' | awk '{print $4}'`
	if [ "$status" != "ACTIVE" ] && [ "$status" != "ERROR" ]; then
		echo "$1 in $status"
		return 0
		return 1

# Wait for all the instances to go ACTIVE or ERROR
while true
	for i in $(eval echo "{1..$CLUSTER_SIZE}")
		if is_not_active $CLUSTER_NAME$i; then

	echo "READY is $READY"
	if [ "$READY" -eq "1" ]; then
	sleep 5

for i in $(eval echo "{1..$CLUSTER_SIZE}")
	echo $CLUSTER_NAME$i >> 'hosts.txt'
cat hosts.txt

	private_ip=`nova show $1 | grep 'private network' | awk '{print $5}'`
	public_ip=`nova show $1 | grep 'accessIPv4' | awk '{print $4}'`
	echo $private_ip $1 >> 'etc_hosts.txt'
	echo $public_ip $1 >> 'etc_hosts.txt'

for i in $(eval echo "{1..$CLUSTER_SIZE}"); do record_ip $CLUSTER_NAME$i; done

cat etc_hosts.txt


Then, execute the script


Alternatively, I put the script on github, so you can also do curl, pipe, bash.

bash <(curl -s

The script will wait till all the servers go to an active status and will save the the ips, hostnames and passwords for each of the servers onto these three files.

  • etc_hosts.txt
  • hosts.txt
  • server_passwords.txt

Validating JSON using Python jsonschema

August 27, 2013

JSON data format can be described using jsonschema, which can then be used to do validation of input JSON and all kinda of automated testing and processing of data. Coming from the Java and XML world, I find it very handy to validate incoming json requests on RESTful apis.

jsonschema is an implemenation of json-schema for Python, and its pretty easy to use.

Given your json data and an associated schema for the json that you have created, using the jsonschema library is pretty easy:

pip install jsonschema
import json
import jsonschema

schema = open("schema.json").read()
print schema

data = open("data.json").read()
print data

    jsonschema.validate(json.loads(data), json.loads(schema))
except jsonschema.ValidationError as e:
    print e.message
except jsonschema.SchemaError as e:
    print e

This will validate your schema first and then validate the data. If you are sure your schema is valid, you can directly use one of the available Validators.

# Use a Draft3Validator
except jsonschema.ValidationError as e:
    print e.message

This will just report the first error it catches. Interestingly, you can also use lazy validation to report all validation errors:

# Lazily report all errors in the instance
    v = jsonschema.Draft3Validator(json.loads(schema))
    for error in sorted(v.iter_errors(json.loads(data)), key=str):
except jsonschema.ValidationError as e:
    print e.message

For your reference, here is my sample json and its schema for you to start with. You can generate a basic schema out of your data using tools like Or you can dive into yourself for details of the jsonschema spec.


    "streetAddress": "21 2nd Street",
    "city":"New York",
      "number":"212 555-1234"


  "$schema": "",
    "address": {
        "city": {
        "houseNumber": {
        "streetAddress": {
    "phoneNumber": {
          "number": {
          "type": {

Bash Script to Backup MySQL Databases

February 24, 2013

mysqldump is a program to do a dump of a MySQL database. It creates a .sql file, which you can then use to restore the database.

Back up a MySQL database:

mysqldump -u mysql_user -h ip -pmysql_password database_name > database_name.sql

To restore a database from the database_name.sql file:

mysql -u mysql_user -h ip -pmysql_password database_name < database_name.sql

Backup all databases on the server:
Interestingly, you can backup all databases on the server:

mysqldump -u mysql_user -h ip -pmysql_password -A > all_databases.sql

To restore all databases:

mysql  -u mysql_user -h ip -pmysql_password < all_databases.sql

Backup a table on a MySQL database:
You can also do mysqlduml at the table level:

mysqldump -u mysql_user -h ip -pmysql_password database_name table_name > table_name.sql

To restore the table to the database:

mysql  -u mysql_user -h ip -pmysql_password database_name table_name < table_name.sql

I have a bunch of MySQL databases hosted on a bunch of servers. I have been pretty lazy to back them up regularly. So I wrote a quick bash script to back up the MySQL databases, creating a separate backup file for each database.

Here’s what the script does in short:
1. You provide it a list of ip address, username and password.
2. It will mysqldump all the databases on each host server that the user has access to.
3. It will store all the dumps in the backup_dir and compress each dump using gunzip.

Bash doesn’t really support multi dimensional arrays. So I had to store the ip, username and password as a comma separated string and split it up in each iteration. Ugly, but gets the job done for now.

# Script to do back of all mysql databases on different hosts.


function mysql_dump() {
        local ip="$1"
        local mysql_user="$2"
        local mysql_password="$3"
        mysql_databases=`mysql -u ${mysql_user} -p${mysql_password} -h ${ip} -e "show databases"| sed /^Database$/d`
        for database in $mysql_databases
                if [ "${database}" == "information_schema" ]; then
                        echo "Skipping $database"
                        echo "Backing up ${database}"
                        mysqldump -u ${mysql_user} -p${mysql_password} -h ${ip} ${database} | gzip > "${backup_dir}/${database}.gz"

backup_date=`date +%Y_%m_%d_%H_%M`
mkdir -p "${backup_dir}"

for server in ${servers[@]}
        mysql_dump ${cols[0]} ${cols[1]} ${cols[2]}

Error Messages in Java

December 1, 2011

Every time something goes wrong within the application or you want to notify the user of something, you display messages to the user. These messages could be simple validation errors, or severe errors in the backend. Such events can happen anywhere within the system and we want to be able to tell the user what happened in a user-friendly way. So its important that there messages are well-written and consistent and they better be stored in a central place for maintenance.

Here’s a simple and clean way to do it that I have used over the years. Its pretty simple, really. I am using Enums for all error messages and using message format so I can pass in extra variables. While the messages itself are still in code, at least they are in the same file that anyone can use and maintain.

package errors;
import java.text.MessageFormat;

public enum Errors {

     USERNAME_NOT_FOUND("User not found"),
     USERNAME_EXISTS("Username {0} already exists."),
     USERNAME_CONTAINS_INVALID_CHARS("Username {0} contains invalid characters {1}.");

     private final String message;

     ErrorMessages(String message) {
          this.message = message;

     public String toString() {
          return message;

     public String getMessage(Object... args) {
          return MessageFormat.format(message, args);
     public static void main(String args[]) {
          System.out.println(ErrorMessages.USERNAME_CONTAINS_INVALID_CHARS.getMessage("s%acharya", "%"));

Sometimes the above solution is not enough. You want to do internationalization on the messages and display different messages to the user based on the locale. In such a case, you can externalize the messages to a file, and have translation of the messages in each locale file. You can then use ResourceBundle and currentLocale to get the right properties file and construct the right message for the given key using MessageFormat.

package errors;

import java.text.MessageFormat;
import java.util.Locale;
import java.util.ResourceBundle;

public class Messages {

     public static String getMessage(String key, Object... args) {
          // Decide which locale to use
          Locale currentLocale = new Locale("en", "US");
          ResourceBundle messages = ResourceBundle.getBundle("resource", currentLocale);

          MessageFormat formatter = new MessageFormat("");

          String output = formatter.format(args);
          return output;

     public static void main(String args[]) {
          System.out.println(Messages.getMessage("USERNAME_EXISTS", "sacharya"));
          System.out.println(Messages.getMessage("USERNAME_CONTAINS_INVALID_CHARS", "s%acharya", "%"));


There following files should be available in the classpath so the class can see it.

USERNAME_NOT_FOUND=User not found.
USERNAME_EXISTS=Username {0} already exists."
USERNAME_CONTAINS_INVALID_CHARS=Username {0} contains invalid characters {1}.


USERNAME_NOT_FOUND=User not found.
USERNAME_EXISTS=Username {0} already exists."
USERNAME_CONTAINS_INVALID_CHARS=Username {0} contains invalid characters {1}.

Sample usages have been demonstrated in the main method above. Choose whatever way serves your purpose.

Nginx Proxy to Jetty for Java Apps

March 4, 2011

Traditionally, I used to go with Apache, Mod Jk and Tomcat to host any Java web apps. But this time I was working on a small hobby project written in Groovy on Grails and had to deploy it to a VPS with a very limited resources. So I had to make the most of the server configuration that I had. So I went with a combination of Nginx and Jetty.

If you’ve never heard of Nginx, it is a very simple HTTP server that is known for its high-performance, low and predictable resource consumption and low memory footprint under load. It uses an asynchronous even-driven model to handle requests which enables it to efficiently handle a large no of requests concurrently.

Similarly, Jetty provides a very good Java Servlet Container. Jetty can be used either as a Standalone application server or can be embedded into an application or framework as a HTTP Component or a servlet engine. It servers as a direct alternative to Tomcat in many cases. Because of its use of advanced NIO and small memory footprint, it provides very good scalability.

Below, I will jot down the steps I went through to configure Nginx as a frontend to Jetty on my VPS running Ubuntu Hardy.

Install Java:

$sudo apt-get install openjdk-6-jdk
$ java -version
java version "1.6.0_0"
OpenJDK  Runtime Environment (build 1.6.0_0-b11)
OpenJDK 64-Bit Server VM (build 1.6.0_0-b11, mixed mode)

$ which java

Install Jetty:

Download the latest version of Jetty, and upload the tar file to your directory of chose on your server.

$ scp jetty-6.1.22.tar

Now login to your server, go to the directory where you uploaded Jetty above.

$ cd /user/java
$ tar xvf  jetty-6.1.22.tar

Now you can start or stop the Jetty server using the following commands:

$ cd /user/java/jetty-6.1.22/bin
$./ start
ps aux | grep java
root     21766  1.2 72.4 1085176 387196 ?    Sl   Mar27   1:12 /usr/lib/jvm/java-6-openjdk/
/bin/java -Djetty.home=/user/java/jetty-6.1.22 -jar
/user/java/jetty-6.1.22/start.jar /user/java/jetty-6.1.22/etc/jetty-logging.xml

The jetty logs are under jetty-6.1.22/logs if you are interested.

Open up your bash profile and set the following paths:

$ vi ~/.bash_profile



Now that Jetty is running, you can go the its default port 8080 and verify that everything is working as expected.

Now that you have Jetty, its time to deploy your app to the Jetty container.

$ scp myapp.war
$ tar -xvf myapp.war

$ vi /user/java/jetty-6.1.22/contexts/myapp.xml

<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Mort Bay Consulting//DTD Configure//EN" "">
<Configure class="org.mortbay.jetty.webapp.WebAppContext">
<Set name="configurationClasses">
<Array type="java.lang.String">
<Set name="contextPath">/</Set>
<Set name="resourceBase"><SystemProperty name="jetty.home" default="."/>/webapps/myapp</Set>

Restart jetty and go to http://ipAddress:8080/myapp, and you should be getting your app.

Install Nginx:

$ sudo aptitude install nginx

This will install Nginx under /etc/nginx

You can start, stop or restart the Nginx server using the commands:

$ sudo /etc/init.d/nginx start
$ sudo /etc/init.d/nginx stop
$ sudo /etc/init.d/nginx restart

Go to your server ip address (or locahost of local) in your browser, and you should be able to see the default welcome page.

Nginx Proxy to Jetty:
Now, lets point configure Nginx as a proxy to our Jetty Server:

$ cd /etc/nginx/sites-available
$ vi default
Point your proxy_pass to:

location / {

Basically, nginx listens on port 80 and forwards it to port 8080. Jetty sets anything on / to /webapps/myapp which means any request to from nginx is served from

Now if you type your IP address or domain name in the browser, content will be served from your application in Jetty. Right now, you are serving everything through Jetty including the scatic files like images, javascript and css. But you can easily serve the static files directly through Nginx: Just add a couple of locations in there:

location /images {
root /user/java/jetty-6.1.22/webapps/myapp;
location /css {
root /user/java/jetty-6.1.22/webapps/myapp;
location /js {
root /user/java/jetty-6.1.22/webapps/myapp;

My final configuration is:

server {
listen   80;

access_log  /var/log/nginx/localhost.access.log;

location / {
location /images {
root /user/java/jetty-6.1.22/webapps/myapp;
location /css {
root /user/java/jetty-6.1.22/webapps/myapp;
location /js {
root /user/java/jetty-6.1.22/webapps/myapp;

# redirect server error pages to the static page /50x.html
error_page   500 502 503 504  /50x.html;
location = /50x.html {
root   /var/www/nginx-default;