Tag: python

Crawling anonymously with Tor in Python

March 5, 2014

There are a lot of valid usecases when you need to protect your identity while communicating over the public internet. It is 2013 and so you probably already know about Tor. Most people use Tor through the browser. The cool thing is that you can get access to the Tor network programmatically so you can build interesting tools with privacy built into it.

The most common usecase to be able to hide the identity using TOR or being able to change identities programmatically is when you are crawling a website like Google (well, this one is harder than you think) and you don’t want to be rate-limited or forbidden.

This did take a fair amount hit and trial to get it working though.
Tor
First of all, lets install Tor.

apt-get update
apt-get install tor
/etc/init.d/tor restart

You will notice that socks listener is on port 9050.

Lets enable the ControlPort listener for Tor to listen on port 9051. This is the port Tor will listen to for any communication from applications talking to Tor controller. The Hashed password is to enable authentication to the port to prevent any random access to the port.

You can create a hashed password out of your password using:

tor --hash-password mypassword

So, update the torrc with the port and the hashed password.

/etc/tor/torrc

ControlPort 9051
HashedControlPassword 16:872860B76453A77D60CA2BB8C1A7042072093276A3D701AD684053EC4C

Restart Tor again to the configuration changes are applied.

/etc/init.d/tor restart

PyTorCtl
Next, we will install pytorctl which is a python based module to interact with the Tor Controller. This lets us send and receive commands from the Tor Control port programmatically.

apt-get install git
apt-get install python-dev python-pip
git clone git://github.com/aaronsw/pytorctl.git
pip install pytorctl/

Privoxy
Tor itself is not a http proxy. So in order to get access to the Tor Network, we will use the Privoxy as an http-proxy though socks5..

Install Privoxy.

apt-get install privoxy

Now lets tell privoxy to use TOR. This will tell Privoxy to route all traffic through the SOCKS servers at localhost port 9050.
Go to /etc/privoxy/config and enable forward-socks5:

forward-socks5 / localhost:9050 .

Restart Privoxy after making the change to the configuration file.

/etc/init.d/privoxy restart

Script:
In the script below, we’re using urllib2 to use the proxy. Privoxy listens on port 8118 by default, and forwards the traffic to port 9050 which the Tor socks is listening on.
Additionally, in the renew_connection() function, I am also sending signal to Tor controller to change the identity, so you get new identities without restarting Tor. You don’t have to change the ip, but sometimes it comes in handy with you are crawling and don’t wanted to be blocked based on ip.

ip_renew.py

from TorCtl import TorCtl
import urllib2

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent}

def request(url):
    def _set_urlproxy():
        proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
        opener = urllib2.build_opener(proxy_support)
        urllib2.install_opener(opener)
    _set_urlproxy()
    request=urllib2.Request(url, None, headers)
    return urllib2.urlopen(request).read()

def renew_connection():
    conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="your_password")
    conn.send_signal("NEWNYM")
    conn.close()

for i in range(0, 10):
    renew_connection()
    print request("http://icanhazip.com/")

Running the script:

python ip_renew.py

Now, watch your ip change every few seconds.

Use it, but don’t abuse it.

Parsing SQL with pyparsing

November 1, 2013

Recently, I was working on a NoSQL database and wanted to expose a SQL interface to it so I can use it just like a RDBMS from my application. Not being much familiar with the python ecosystem libraries, I quickly searched and found this python library called pyparsing.

Now, if you know anything about parsing, you know regex and traditional lex parsers can get complicated very soon. But after playing with pyparsing for a few minutes, I quickly realized pyparsing makes it really easy to write and execute grammars. Pyparsing has a set of good APIs, handles spaces well, makes debugging easy and have a good documentation.

The code below doesn’t cover all the edge-cases and documented grammar of SQL, but it was a good excuse to learn pyparsing anyway; good enough for my usecase.

Install pypasing python module.

pip install pyparsing

Here is my parse_sql.py

from pyparsing import CaselessKeyword, delimitedList, Each, Forward, Group, \
        Optional, Word, alphas,alphanums, nums, oneOf, ZeroOrMore, quotedString, \
        Upcase

keywords = ["select", "from", "where", "group by", "order by", "and", "or"]
[select, _from, where, groupby, orderby, _and, _or] = [ CaselessKeyword(word)
        for word in keywords ]

table = column = Word(alphas)
columns = Group(delimitedList(column))
columnVal = (nums | quotedString)

whereCond = (column + oneOf("= != < > >= <=") + columnVal)
whereExpr = whereCond + ZeroOrMore((_and | _or) + whereCond)

selectStmt = Forward().setName("select statement")
selectStmt << (select +
        ('*' | columns).setResultsName("columns") +
        _from +
        table.setResultsName("table") +
        Optional(where + Group(whereExpr), '').setResultsName("where").setDebug(False) +
        Each([Optional(groupby + columns("groupby"),'').setDebug(False),
            Optional(orderby + columns("orderby"),'').setDebug(False)
            ])
        )

def log(sql, parsed):
    print "##################################################"
    print sql
    print parsed.table
    print parsed.columns
    print parsed.where
    print parsed.groupby
    print parsed.orderby

sqls = [
        """select * from users where username='johnabc'""",
        """SELECT * FROM users WHERE username='johnabc'""",
        """SELECT * FRom users""",
        """SELECT * FRom USERS""",
        """SELECT * FROM users WHERE username='johnabc' or email='johnabc@gmail.com'""",
        """SELECT id, username, email FROM users WHERE username='johnabc' order by email, id""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by school""",
        """SELECT id, username, email FROM users WHERE username='johnabc' group by city, school order by firstname, lastname"""
        ]

for sql in sqls:
    log(sql, selectStmt.parseString(sql))

To run the script


python parse_sql.py

As soon as I wrote my crappy little version and blogged about it, I actually found simpleSQL.py written by Paul McGuire, the author of pyparsing. Oh well!