went to an A’s game. but chants of “Let’s go Oakland!” just felt weird.
went to an A’s game. but chants of “Let’s go Oakland!” just felt weird.
Grabbing a site’s favicon from Python code is slightly complicated by support of the standard
<link rel="icon" type="image/png" href="/path/image.png"/>
and the older favicon.ico in the server’s root director. But parsing a site’s html using Python’s sgmllib is easy enough.
import sgmllib
def get_favicon_url(url):
"""
Get the favicon used for site at url.
- First try parsing for a favicon link in the html head.
- Then try just /favicon.ico.
- If neither found, return None
"""
class FaviconFinder(sgmllib.SGMLParser):
"""
A Parser class for finding the favicon used (if specified).
"""
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.favicon_url = None
def start_link(self, attributes):
attributes = dict(attributes)
if not self.favicon_url:
if 'rel' in attributes and 'icon' in attributes['rel']:
if 'href' in attributes:
self.favicon_url = attributes['href']
# Try to parse html at url and get favicon
if not url.startswith('http://') or url.startswith('https://'):
url = 'http://%s' % url
try:
site = urllib.urlopen(url)
contents = site.read()
favicon_parser = FaviconFinder()
favicon_parser.feed(contents)
except:
pass
# Another try block in case the parser throws an exception
# AFTER finding the appropriate url.
try:
if favicon_parser.favicon_url:
return favicon_parser.favicon_url
else:
url = '/'.join(url.split('/',3)[2:])
root_directory = url.split('/')[0]
favicon = httplib.HTTPConnection(root_directory)
favicon.request('GET','/favicon.ico')
response = favicon.getresponse()
if response.status == 200:
return 'http://%s/favicon.ico' % root_directory
favicon = httplib.HTTPConnection('www.' + root_directory)
favicon.request('GET','/favicon.ico')
response = favicon.getresponse()
if response.status == 200:
return 'http://%s/favicon.ico' % ('www.' + root_directory)
except:
pass
return None
It was late in the game when I got accepted to UT Graduate School. A lateness I attribute to every other applicant being much more interesting than me but ultimately deciding to instead go on and do something completely amazing.
So I had a week to decide, and everyone had already hired interns. Even my inside contacts couldn’t change that the hiring had already been done. So I resigned myself to the prospects of being unemployed for the summer and started making myself useful. But I jumped on an internship opportunity that popped up.
After a typical get-to-know-you type phone interview, they tossed a programming assignment my way. Certainly more interesting than the “insert this number into a linked list” and “implement a graph search” kinds of interview problems thrown at me before.
The idea I had was to rewrite the django.utils.html.urlize function. From the docs: Convert any URLs in text into clickable links. …
More completely described in the source. So that night, I cracked my knuckles and cranked something out. For comparison:
if middle.startswith('http://') or middle.startswith('https://'):
url = urlquote(middle, safe='/&=:;#?+*')
...
trimmed_url = trim_url(middle)
middle = '<a href="%s"%s>%s</a>' % (url, nofollow_attr, trimmed_url)
if lower(current).startswith(prefix):
url = PREFIXES[pref] + current
...
if url:
result.append(unicode('%s<a href="%s"%s>%s</a>%s'
% (beg, url, ' rel="nofollow"' if nofollow else '', link, end))))
Which prompted a quick response of:
Looks very good — definitely a lot more readable than the Django version. … A quick benchmark shows that your version is twice as fast as Django’s version.
And an internship offer. Now to be fair, my code is doing approximately half the work Django’s is, so it’s expected that mine is twice as fast. But there is an open bug to optimize urlize, so I tack another task onto my todo.
Since “you should always optimize for readability first” 1, it should be much easier to optimize my code than Django’s.
There was one big lesson I learned about software development that I picked up in a Algorithms and Data Structures class at Rice. When implementing a large application, get something working first, then add features one at a time. In that vein, I got swample working and online quickly. But soon thereafter, I focused on optimization.
As I touched on last time, from the beginning I wanted to support all possible words up to the maximum of sixteen letters. To do that, I knew that I would have to be clever in solving to keep the wait time to a minimum. So from the beginning I implemented the narrowing I mentioned. However, once the application was live, I knew I’d have to go back and revisit optimizing the solver. Which Google App Engine quickly reminded me of by spouting CPU quota warnings.
This request used a high amount of CPU, and was roughly 1.4 times over the average request CPU limit. High CPU requests have a small quota, and if you exceed this quota, your app will be temporarily disabled.
So I cleaned up my prototype code and turned to a profiler. Python has an included profiler, so getting some statistics about my algorithm was easy enough.
>>> cProfile.run('b.solve()')
1926701 function calls (1916099 primitive calls) in 3.640 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.014 0.014 3.640 3.640 :1()
1 1.489 1.489 3.626 3.626 board.py:20(solve)
2338/16 0.710 0.000 1.171 0.073 board.py:35(__bfs)
...
Which gave me some direction on where to focus my attention. From there, I refactored code to make use of pretty simple efficiency improvements and cut down the number of unnecessary string operations that I had thrown in just to get something working. Since I already had a working prototype, I could ensure that every optimization I made both (a) didn’t break the application and (b) actually lowered processing time.
Immediately wait time to the end user was dramatically lowered, and I haven’t gotten one of those CPU warnings since I posted the change.
The best advice that’s post-it noted to my monitor is “patience/do it now!”. I lean left or right depending on context. It’s been a lean right couple of weeks. It’s better to do something and be making progress than have a great idea. A step forward is a step forward.
So this past week, I wrote and launched Swample. It’s something Richard and I had been talking about throwing together for a while now. Our first serious attempt at a web application. So with the launch of Google App Engine, I finally got around to it and knocked out a prototype.
It’s a pretty trivial little application for now, I’m looking to add features as we go. There was only one algorithmic sticking point. I wanted to tell the user during the game whether or not the words entered were valid. Which meant I had to solve the board up front before I set the user loose. But how can I solve the board fast enough so that the user isn’t waiting forever for a game to load?
Most solutions I saw via google just simply give up at a certain word length. Not counting words with more than 8 letters, presumably because of a scrabble dictionary. This certainly simplifies the solving algorithm and the space required for a dictionary. But for me wasn’t a valid solution.
In the unlikely event that a users finds a word that’s more than 8 letters, I didn’t want to disappoint with an ‘word not found’. And a brute force breadth first search was simply too slow. So I solved the problem by pruning the breadth first search.
If you’re looking at words and your word starts off with ‘QuZR’, there’s no point looking any more. So at each step along the way, only consider the segment of the dictionary that starts with the word you’ve built so far. For example, in python:
#narrow dictionary before next step of bfs ndict = dict([(word, 0) for word in words if word.startswith(start)])
This cuts huge dead end paths. As well as lowers your time searching in the dictionary, since you’re only searching a subset of it. It’s maybe memory intensive, but that’s the trade-off I was looking to make.
The last line of my last post bothered me from the time I wrote it.
However, the real lesson is that I should have just posted the sketch. The desire for perfection is useless.
Not that I don’t agree with it, I do. I just couldn’t precisely define what I actually meant by it.
Mathematics is a field where perfection is everything. So it bothered me that I could say that the desire for perfection in a proof was useless, yet I knew I agreed.
I now finally understand what I meant. The key is context. I was sketching a proof not to convince anyone of the truth of the statement but to justify to myself my own use of a termination condition that I had already written into code. The sketch I drew up, despite being incorrect, convinced me of that condition’s correctness and gave me a glimpse of why. Though that glimpse was incomplete, it served it’s purpose in the context.
Anyway, at some point, I’ll actually follow through with writing things other than maths here.
I was writing some code last week to compute the period of the continued fraction expansion of the square root of the natural numbers. When I needed a condition to know when the repetition started to end my recursion. There’s a natural point when double the leading term appears that pops up in the tables.
But I wasn’t entirely convinced, so I sketched a quick proof to make myself believe before I ended up coding it into my recursion.
So in keeping with my ideals to keep everything in direct contact with each other. Like my notes from classes at Rice. Math rubbing shoulders with Literature. Science next to art. I intended on posting it up here.
However, in fleshing out the details of my proof, it ended up being much more complicated than the simple induction that was enough to convince me to use it. I ended up falling back onto the shoulders of giants and read the section on it in Waclaw Sierpinski’s Elementary Theory of Numbers.
Ultimately, it’s all for naught anyway, since a simple recursion on the error term is a terrible way to tackle the problem because of the lack of floating-point precision. The exact problem continued fractions are trying to solve in the first place.
However, the real lesson is that I should have just posted the sketch. The desire for perfection is useless.
For the last couple years of my time at Rice I kept all my class notes in one black marble bound composition notebook. Spring 2005. Fall 2005. Every class in the same book marked off. English notes on post-modern novels next to frantically scribbled “Let ∂>0.” next to the next little snippet of my next short story.
When I graduated I intended on keeping the same composition notebook. No longer marked off by semesters, but one book kept until full. And there it sits in my laptop bag labeled Spring 2007 until … But there’s only one page written in it. My musings coming to terms with why I really do have to break up with Nikki and my flash fiction end-of-a-chapter description.
So here I am, almost a year later with a realization. The black marble composition book will no longer work. I don’t carry it around with a freshly sharpened #2 pencil everywhere I go anymore. But what I do do now is sit around for a large portion of everyday staring into the warm glow of a computer monitor. Wasting most of my time reading random internet articles. It’s time for me to abandon my #2 pencil and take up my keyboard.
Same story different language.