Contact Us

If you are interested in our services leave your contact details below and our sales representatives will contact you.

The organization which you represent
Email address we will use to contact you
Longer contact form…
 
  • About

    mFabrik Blog is about mobile and web software development, open source and Linux. We tell exciting tales where business, technology, web and mobile convergence.

    Creative Commons License
    This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Viivi & Wagner strip scraper

Posted on May 7, 2008  by Mikko Ohtamaa
Filed Under python

I wrote this little script as a mental exercise and to prove the power of Python programming language. If anyone accepts the challenge, I’d like to see submissions in other programming langauges ;)

For the foreigners: this is the best comic in Finland, so I hope you’ll get translations soon! It tells about the relationship of a woman and a pig (sic) reflecting the deepest shadows of Finnish social life.

"""
	Creats local mirror from Viivi & Wagner strips by fetching all of them from hs.fi.

	Will create downloaded strips as
		2004/1.1.2004.gif
		2004/2.1.2004.gif
		...
		until today

	Try this in C++!

	Motivation: No one has build Viivi & Wagner search engine with speech bubble OCR support
	and I desperately wanted to find "Kottarainen lentaa korvaan" strip for my gf.

	Time to complete: 20 min.

"""

__docformat__ = "epytext"
__author__ = "Mikko Ohtamaa"
__license__ = "BSD"
__copyright__ = "2008 Mikko Ohtamaa"

import os
import re
import urllib
from BeautifulSoup import BeautifulSoup

# 1.1.2004 start page
url = "http://www.hs.fi/viivijawagner/1073386660690"

# Loop until there is no longer next link
while True:
	stream = urllib.urlopen(url)
	html = stream.read()
	stream.close()
	soup = BeautifulSoup(html)

	# Parse strip date from contents
	date = None

	# Find strip date, which is next to a title
	h1 = soup.findAll(text="Viivi ja Wagner")
	# Should be present always
	date = h1[0].parent.parent.p.string

	print "Fetching " + date

	# Scrape strip
	strip = soup.findAll("div" , { "class" : "strip" })
	img = strip[0].img

	stream = urllib.urlopen(img["src"])
	data = stream.read()
	stream.close()

	# For each year, give a new folder to avoid file system stress
	# (lotsa files in a folder kill poor Gnome)
	day, month, year = date.split(".")
	folder = year

	if not os.path.exists(folder):
		os.mkdir(folder)	

	# Store contents
	fname = os.path.join(folder, date + ".gif")
	f = open(fname, "wb")
	f.write(data)
	f.close()

	# Find next url, it is a containing one img tag
	img = soup.findAll(alt="seuraava")
        if len(img) == 0:
             break
	a = img[0].parent
	url = a["href"]

See preview


 

Other posts by Mikko Ohtamaa

 

Comments

One Response to “Viivi & Wagner strip scraper”

  1. DMC on July 10th, 2008 12:20 pm
    Gravatar

    How would one do that in c#?

    i have created a program that rips full html from a url into a textbox and then stores it to a sql database, but can’t seem to get the method of pulling the unique charecters

    Regards

    DMC

     

Leave a Reply