Category: Python

Crawling anonymously with Tor in Python

March 5, 2014

There are a lot of valid usecases when you need to protect your identity while communicating over the public internet. It is 2013 and so you probably already know about Tor. Most people use Tor through the browser. The cool thing is that you can get access to the Tor network programmatically so you can build interesting tools with privacy built into it.

The most common usecase to be able to hide the identity using TOR or being able to change identities programmatically is when you are crawling a website like Google (well, this one is harder than you think) and you don’t want to be rate-limited or forbidden.

This did take a fair amount hit and trial to get it working though.
Tor
First of all, lets install Tor.

apt-get update
apt-get install tor
/etc/init.d/tor restart

You will notice that socks listener is on port 9050.

Lets enable the ControlPort listener for Tor to listen on port 9051. This is the port Tor will listen to for any communication from applications talking to Tor controller. The Hashed password is to enable authentication to the port to prevent any random access to the port.

You can create a hashed password out of your password using:

tor --hash-password mypassword

So, update the torrc with the port and the hashed password.

/etc/tor/torrc

ControlPort 9051
HashedControlPassword 16:872860B76453A77D60CA2BB8C1A7042072093276A3D701AD684053EC4C

Restart Tor again to the configuration changes are applied.

/etc/init.d/tor restart

PyTorCtl
Next, we will install pytorctl which is a python based module to interact with the Tor Controller. This lets us send and receive commands from the Tor Control port programmatically.

apt-get install git
apt-get install python-dev python-pip
git clone git://github.com/aaronsw/pytorctl.git
pip install pytorctl/

Privoxy
Tor itself is not a http proxy. So in order to get access to the Tor Network, we will use the Privoxy as an http-proxy though socks5..

Install Privoxy.

apt-get install privoxy

Now lets tell privoxy to use TOR. This will tell Privoxy to route all traffic through the SOCKS servers at localhost port 9050.
Go to /etc/privoxy/config and enable forward-socks5:

forward-socks5 / localhost:9050 .

Restart Privoxy after making the change to the configuration file.

/etc/init.d/privoxy restart

Script:
In the script below, we’re using urllib2 to use the proxy. Privoxy listens on port 8118 by default, and forwards the traffic to port 9050 which the Tor socks is listening on.
Additionally, in the renew_connection() function, I am also sending signal to Tor controller to change the identity, so you get new identities without restarting Tor. You don’t have to change the ip, but sometimes it comes in handy with you are crawling and don’t wanted to be blocked based on ip.

ip_renew.py

from TorCtl import TorCtl
import urllib2

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent}

def request(url):
    def _set_urlproxy():
        proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
        opener = urllib2.build_opener(proxy_support)
        urllib2.install_opener(opener)
    _set_urlproxy()
    request=urllib2.Request(url, None, headers)
    return urllib2.urlopen(request).read()

def renew_connection():
    conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="your_password")
    conn.send_signal("NEWNYM")
    conn.close()

for i in range(0, 10):
    renew_connection()
    print request("http://icanhazip.com/")

Running the script:

python ip_renew.py

Now, watch your ip change every few seconds.

Use it, but don’t abuse it.

Parsing SQL with pyparsing

November 1, 2013

Recently, I was working on a NoSQL database and wanted to expose a SQL interface to it so I can use it just like a RDBMS from my application. Not being much familiar with the python ecosystem libraries, I quickly searched and found this python library called pyparsing.

Now, if you know anything about parsing, you know regex and traditional lex parsers can get complicated very soon. But after playing with pyparsing for a few minutes, I quickly realized pyparsing makes it really easy to write and execute grammars. Pyparsing has a set of good APIs, handles spaces well, makes debugging easy and have a good documentation.

The code below doesn’t cover all the edge-cases and documented grammar of SQL, but it was a good excuse to learn pyparsing anyway; good enough for my usecase.

Install pypasing python module.

pip install pyparsing

Here is my parse_sql.py

from pyparsing import CaselessKeyword, delimitedList, Each, Forward, Group, \
        Optional, Word, alphas,alphanums, nums, oneOf, ZeroOrMore, quotedString, \
        Upcase

keywords = ["select", "from", "where", "group by", "order by", "and", "or"]
[select, _from, where, groupby, orderby, _and, _or] = [ CaselessKeyword(word)
        for word in keywords ]

table = column = Word(alphas)
columns = Group(delimitedList(column))
columnVal = (nums | quotedString)

whereCond = (column + oneOf("= != < > >= <=") + columnVal)
whereExpr = whereCond + ZeroOrMore((_and | _or) + whereCond)

selectStmt = Forward().setName("select statement")
selectStmt << (select +
        ('*' | columns).setResultsName("columns") +
        _from +
        table.setResultsName("table") +
        Optional(where + Group(whereExpr), '').setResultsName("where").setDebug(False) +
        Each([Optional(groupby + columns("groupby"),'').setDebug(False),
            Optional(orderby + columns("orderby"),'').setDebug(False)
            ])
        )

def log(sql, parsed):
    print "##################################################"
    print sql
    print parsed.table
    print parsed.columns
    print parsed.where
    print parsed.groupby
    print parsed.orderby

sqls = [
        """select * from users where username='johnabc'""",
        """SELECT * FROM users WHERE username='johnabc'""",
        """SELECT * FRom users""",
        """SELECT * FRom USERS""",
        """SELECT * FROM users WHERE username='johnabc' or email='johnabc@gmail.com'""",
        """SELECT id, username, email FROM users WHERE username='johnabc' order by email, id""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by school""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by city, school order by firstname, lastname"""
        ]

for sql in sqls:
    log(sql, selectStmt.parseString(sql))

To run the script


python parse_sql.py

As soon as I wrote my crappy little version and blogged about it, I actually found simpleSQL.py written by Paul McGuire, the author of pyparsing. Oh well!

Validating JSON using Python jsonschema

August 27, 2013

JSON data format can be described using jsonschema, which can then be used to do validation of input JSON and all kinda of automated testing and processing of data. Coming from the Java and XML world, I find it very handy to validate incoming json requests on RESTful apis.

jsonschema is an implemenation of json-schema for Python, and its pretty easy to use.

Given your json data and an associated schema for the json that you have created, using the jsonschema library is pretty easy:

pip install jsonschema
import json
import jsonschema

schema = open("schema.json").read()
print schema

data = open("data.json").read()
print data

try:
    jsonschema.validate(json.loads(data), json.loads(schema))
except jsonschema.ValidationError as e:
    print e.message
except jsonschema.SchemaError as e:
    print e

This will validate your schema first and then validate the data. If you are sure your schema is valid, you can directly use one of the available Validators.

# Use a Draft3Validator
try:
    jsonschema.Draft3Validator(json.loads(schema)).validate(json.loads(data))
except jsonschema.ValidationError as e:
    print e.message

This will just report the first error it catches. Interestingly, you can also use lazy validation to report all validation errors:

# Lazily report all errors in the instance
try:
    v = jsonschema.Draft3Validator(json.loads(schema))
    for error in sorted(v.iter_errors(json.loads(data)), key=str):
        print(error.message)
except jsonschema.ValidationError as e:
    print e.message

For your reference, here is my sample json and its schema for you to start with. You can generate a basic schema out of your data using tools like http://www.jsonschema.net/. Or you can dive into http://json-schema.org/ yourself for details of the jsonschema spec.

data.json

{
  "address":{
    "streetAddress": "21 2nd Street",
    "city":"New York",
    "houseNumber":12
  },
  "phoneNumber":
    [
    {
      "type":"home",
      "number":"212 555-1234"
    }
  ]
}

schema.json

{
  "type":"object",
  "$schema": "http://json-schema.org/draft-03/schema",
  "required":false,
  "properties":{
    "address": {
      "type":"object",
      "required":true,
      "properties":{
        "city": {
          "type":"string",
          "required":true
        },
        "houseNumber": {
          "type":"number",
          "required":false
        },
        "streetAddress": {
          "type":"string",
          "required":true
        }
      }
    },
    "phoneNumber": {
      "type":"array",
      "required":false,
      "items":
      {
        "type":"object",
        "required":false,
        "properties":{
          "number": {
            "type":"string",
            "required":false
          },
          "type": {
            "type":"string",
            "required":false
          }
        }
      }
    }
  }
}