Crawling anonymously with Tor in Python

March 5, 2014

There are a lot of valid usecases when you need to protect your identity while communicating over the public internet. It is 2013 and so you probably already know about Tor. Most people use Tor through the browser. The cool thing is that you can get access to the Tor network programmatically so you can build interesting tools with privacy built into it.

The most common usecase to be able to hide the identity using TOR or being able to change identities programmatically is when you are crawling a website like Google (well, this one is harder than you think) and you don’t want to be rate-limited or forbidden.

This did take a fair amount hit and trial to get it working though.
Tor
First of all, lets install Tor.

apt-get update
apt-get install tor
/etc/init.d/tor restart

You will notice that socks listener is on port 9050.

Lets enable the ControlPort listener for Tor to listen on port 9051. This is the port Tor will listen to for any communication from applications talking to Tor controller. The Hashed password is to enable authentication to the port to prevent any random access to the port.

You can create a hashed password out of your password using:

tor --hash-password mypassword

So, update the torrc with the port and the hashed password.

/etc/tor/torrc

ControlPort 9051
HashedControlPassword 16:872860B76453A77D60CA2BB8C1A7042072093276A3D701AD684053EC4C

Restart Tor again to the configuration changes are applied.

/etc/init.d/tor restart

PyTorCtl
Next, we will install pytorctl which is a python based module to interact with the Tor Controller. This lets us send and receive commands from the Tor Control port programmatically.

apt-get install git
apt-get install python-dev python-pip
git clone git://github.com/aaronsw/pytorctl.git
pip install pytorctl/

Privoxy
Tor itself is not a http proxy. So in order to get access to the Tor Network, we will use the Privoxy as an http-proxy though socks5..

Install Privoxy.

apt-get install privoxy

Now lets tell privoxy to use TOR. This will tell Privoxy to route all traffic through the SOCKS servers at localhost port 9050.
Go to /etc/privoxy/config and enable forward-socks5:

forward-socks5 / localhost:9050 .

Restart Privoxy after making the change to the configuration file.

/etc/init.d/privoxy restart

Script:
In the script below, we’re using urllib2 to use the proxy. Privoxy listens on port 8118 by default, and forwards the traffic to port 9050 which the Tor socks is listening on.
Additionally, in the renew_connection() function, I am also sending signal to Tor controller to change the identity, so you get new identities without restarting Tor. You don’t have to change the ip, but sometimes it comes in handy with you are crawling and don’t wanted to be blocked based on ip.

ip_renew.py

from TorCtl import TorCtl
import urllib2

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':user_agent}

def request(url):
    def _set_urlproxy():
        proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
        opener = urllib2.build_opener(proxy_support)
        urllib2.install_opener(opener)
    _set_urlproxy()
    request=urllib2.Request(url, None, headers)
    return urllib2.urlopen(request).read()

def renew_connection():
    conn = TorCtl.connect(controlAddr="127.0.0.1", controlPort=9051, passphrase="your_password")
    conn.send_signal("NEWNYM")
    conn.close()

for i in range(0, 10):
    renew_connection()
    print request("http://icanhazip.com/")

Running the script:

python ip_renew.py

Now, watch your ip change every few seconds.

Use it, but don’t abuse it.

24 Comments on Crawling anonymously with Tor in Python

Respond | Trackback

  1. Arul says:

    Hi

    Thanks for this tutorial. How you are getting the hashed password..?

    I try to use this

    tor –hash-password your_password

    But every time it returns different hash password. How to do that.?

    Thanks
    Arul

  2. Pavan says:

    Hi

    Thanks for this tutorial. I have tried everything as you mentioned in the above tutorial and my program also running but 10 times i am getting the same ip address only.

    Please, solve my problem that will be helpful.

    Thanks
    pavan

    • coolcoolcoolia says:

      I ran into this issue also and found a fix. It takes time to change identity through tor, so I edited the ip_renew.py file to handle the issue.

      1) Create two new variables underneath the “headers” variable:

      oldIP = “0.0.0.0″
      newIP = “0.0.0.0″

      2) Change the for loop to:

      for i in range(0, 10):
          if oldIP == "0.0.0.0":
              renew_connection()
              oldIP = request("http://icanhazip.com/")
          else:
      	oldIP = request("http://icanhazip.com/")
      	renew_connection()
      	newIP = request("http://icanhazip.com/")
          while oldIP == newIP:
          	newIP = request("http://icanhazip.com/")
          print request("http://icanhazip.com/")
      
  3. Ballman50 says:

    Hi,

    J’ve an “invalid syntax” for print request

    Could you help me ?

    Ballman.

  4. Tiago says:

    On line 17 what shoul I fill out on the password (passphrase=”your_password”)?

  5. sbleecpa says:

    헐퀴…별게 다 있구만 ㄷㄷㄷㄷㄷ

  6. […] is a socks proxy. Connecting to it directly with the example you cite fails with “urlopen error Tunnel connection failed: 501 Tor is not an HTTP Proxy”. As […]

  7. george militaru says:

    Hi all,
    I am new to this kind of things.I want to start an automhated ,,traing” system with some associates but my local provider is blocking me.If there is somewone interested to research this ideea and join the project and think please drop me an email at :
    brigadafulger@gmail.com
    Tnks

  8. Usuario PTC says:

    Hello, greetings to all … Initially thank for their work and for sharing.

    Home in the field of programming and running the code gives me the following error:

    Failed to read authentication cookie (permission denied): /var/run/tor/control.authcookie
    Traceback (most recent call last):
    File “ip_renew.py”, line 22, in
    renew_connection()
    File “ip_renew.py”, line 18, in renew_connection
    conn.send_signal(“NEWNYM”)
    AttributeError: ‘NoneType’ object has no attribute ‘send_signal’
    256

    I wonder … if additional configuration is required TOR I’m running bad or something.

    Thank you for your understanding and support on this issue.

    • Italian_Friend says:

      Your TOR may be 0.2.5.1 version, you need to udpate to 0.2.7.1
      here the currect procedure

      https://www.torproject.org/docs/debian.html.en

    • here2help says:

      You need to remove the hashtags from the relevant lines in the torrc file.

      Specifically, the lines that read

      ## The port on which Tor will listen for local connections from Tor
      ## controller applications, as documented in control-spec.txt.
      #ControlPort 9051
      ## If you enable the controlport, be sure to enable one of these
      ## authentication methods, to prevent attackers from accessing it.
      #HashedControlPassword ……

      should read

      ## The port on which Tor will listen for local connections from Tor
      ## controller applications, as documented in control-spec.txt.
      ControlPort 9051
      ## If you enable the controlport, be sure to enable one of these
      ## authentication methods, to prevent attackers from accessing it.
      HashedControlPassword …….

    • I think you had already solved your problem, but it may help other with the same issue.
      Make sure you have tor 0.2.7.1 or above and try launch the script using sudo:

      sudo python ip_renew.py

      I had the got the same error, and it solved my problem.

  9. f4lse says:

    Thanks for the the simple and effective tutorial.
    Works like a breeze with 0 errors on:

    Ubuntu Server 15.10
    python 2.7.10

  10. Abi says:

    Hi,

    I m getting an error
    Connection refused. Is the ControlPort enabled?
    Traceback (most recent call last):
    File “ip_renew.py”, line 22, in
    renew_connection()
    File “ip_renew.py”, line 18, in renew_connection
    conn.send_signal(“NEWNYM”)
    AttributeError: ‘NoneType’ object has no attribute ‘send_signal’

    I try different port but no use

  11. B98 says:

    Hey guys,

    I was wandering if it is possible to trace you back if using this kind of crawling.

    Someone have enough experience to answer this question?

    Any help will be warmly received

  12. here2help says:

    hi! this is awesome, but why did you use TorCtl instead of Stem? TorCtl’s been deprecated for some time, and occasionally spits errors. I think Stem may be a cleaner option, in this case.

    thanks again for the tutorial!!

  13. PegasusWang says:

    Now requests support socks proxy. version >= 2.10.0

  14. markus says:

    Also with the same issue.

    After update my tor version I have the same error

    “Connection refused. Is the ControlPort enabled?”

  15. Alexandre cavalcante says:

    more than 3 years and it still working great!!!

  16. kn16h7 says:

    Hi,

    How do i crawl a router and scrap all the content after login?

Respond to f4lse

Comments

Comments