The Cure to Curiosity: Python WebCrawler [Basic] [Practice]

So lately I have been trying to expand on my knowledge by taking on personal projects that help me teach myself new things. While it can be a bit difficult finding time to work on side projects with my official school studies and work, I'm quite happy I found the time for this little project.

My program experience is limited to a brief python introduction last semester, and a java programming class that I am currently in. Personally I favor the python and would really like to get better at it. Unfortunately, after this Java class my programming requirements will be meant for my major so I must work on teaching myself in my spare time. Something that I really wanted to dive into was the networking portion of python using modules such as socket, http.client, urllib.request, and so on. (These are the python v3 variants as I want to focus on learning the newest version of python as opposed to python versions < 3).

After about an hour of reading up on some of the modules in question I decided to just throw myself into a program and try my best. I came up with the idea of making a simple webcrawler as it seemed simple utilizing the little knowledge I just gained on the networking modules.

In an attempt to utilize more then just one module that I read about I decided to go with http.client, and urlib.request as the http support for the socket module was a bit limited for my goal of a webcrawler. After just a little bit of time I ended up with this:

Code:

import http.client
import urllib.request
#WebCrawler
#Brought to you by Hunter Gregal

host=str(input("Please input the target host url. Ex: aptgetswag.com:\n"))
dir1=str(input("Please input the directory to crawl. Ex: '/pages/' or simple '/':\n"))
myExt=["php","html","js","jpeg","jpg","png","txt"]
myName=["index.","robots.","page.","password.","secret."]
myDict=[]
a=0
b=0
while (a < len(myName)):
while (b < len(myExt)):
myDict.append(myName[a] + myExt[b])
b=b+1
b=0
a=a+1

i=0
while (i < len(myDict)):
conn = http.client.HTTPConnection(host)
conn.request("HEAD", dir1 + myDict[i])
res = conn.getresponse()
page = str(("http://" + host + dir1 + myDict[i]))

if (res.status == 200):
print(page + " " + res.reason)
usock = urllib.request.urlopen(page)
print(usock.info())

conn.close()
i = i+1

How It Works :
The code started out as simply a way to check if a specific URL returned an active response. I used http.client to request a connection to a user defined host in a user define directory. The program attempts to make a connection to a predefined list of page names that iterated through another list of predefined extensions. Upon the return of a "200" value (as opposed to a "404" or similar) the urllib.request module then retrieves the pages response headers to print out to the user.

Learning Experience:

Making this simple little program was a huge step in teaching myself python. It gave me the chance to learn in the best way that I know possible: by just doing it. I jumped right in with no idea how to even open a socket in python. Each problem and even syntax error I ran into was a chance for me to learn from my mistakes. At the end of this program I left feeling comfortable that I at least had a grasp on networking in Python. I recommend this method of learning to anyone attempting to teach themselves a programming language.

Download WebCrawler Here

2 comments:

roseNovember 15, 2017 at 2:06 AM
I and my friends were going through the nice, helpful tips from the blog then the sudden came up with an awful suspicion I never expressed respect to the website owner for those secrets.

python training in bangalore|
PrwatechJune 20, 2021 at 5:23 AM
Wow, amazing post! Really engaging, thank you.

Python Training In Bangalore
Python Training Institute In Bangalore

The Cure to Curiosity

Tuesday, February 11, 2014

Python WebCrawler [Basic] [Practice]

2 comments:

Blog Archive