Tuesday, February 11, 2014

Python WebCrawler [Basic] [Practice]

So lately I have been trying to expand on my knowledge by taking on personal projects that help me teach myself new things. While it can be a bit difficult finding time to work on side projects with my official school studies and work, I'm quite happy I found the time for this little project.

My program experience is limited to a brief python introduction last semester, and a java programming class that I am currently in. Personally I favor the python and would really like to get better at it. Unfortunately, after this Java class my programming requirements will be meant for my major so I must work on teaching myself in my spare time. Something that I really wanted to dive into was the networking portion of python using modules such as socket, http.client, urllib.request, and so on. (These are the python v3 variants as I want to focus on learning the newest version of python as opposed to python versions < 3).

After about an hour of reading up on some of the modules in question I decided to just throw myself into a program and try my best. I came up with the idea of making a simple webcrawler as it seemed simple utilizing the little knowledge I just gained on the networking modules.

In an attempt to utilize more then just one module that I read about I decided to go with http.client, and urlib.request as the http support for the socket module was a bit limited for my goal of a webcrawler. After just a little bit of time I ended up with this:

import http.client
import urllib.request
#Brought to you by Hunter Gregal

host=str(input("Please input the target host url. Ex: aptgetswag.com:\n"))
dir1=str(input("Please input the directory to crawl. Ex: '/pages/' or simple '/':\n"))
while (a < len(myName)):
    while (b < len(myExt)):
        myDict.append(myName[a] + myExt[b])
while (i < len(myDict)):
    conn = http.client.HTTPConnection(host)
    conn.request("HEAD", dir1 + myDict[i])
    res = conn.getresponse()
    page = str(("http://" + host + dir1 + myDict[i]))
    if (res.status == 200):
        print(page + "    " + res.reason)
        usock = urllib.request.urlopen(page)
    i = i+1

How It Works :
The code started out as simply a way to check if a specific URL returned an active response. I used http.client to request a connection to a user defined host in a user define directory. The program attempts to make a connection to a predefined list of page names that iterated through another list of predefined extensions. Upon the return of a "200" value (as opposed to a "404" or similar) the urllib.request module then retrieves the pages response headers to print out to the user.

Learning Experience:
Making this simple little program was a huge step in teaching myself python. It gave me the chance to learn in the best way that I know possible: by just doing it. I jumped right in with no idea how to even open a socket in python. Each problem and even syntax error I ran into was a chance for me to learn from my mistakes. At the end of this program I left feeling comfortable that I at least had a grasp on networking in Python. I recommend this method of learning to anyone attempting to teach themselves a programming language.