Random
· 16TH OF OCTOBER, THE YEAR 2006GUNNING FOR GUH
I am not dead, just rather busy. I know, I got through four years of college and still managed to squeeze in a little time on the weekends for poor old Guh, but I spend more than enough time in front of a computer these days, so you’ll forgive me if I spend my free moments . . . refreshing my RSS feeds.
One of the things occupying me this weekend was a school assignment that required us to write a program that calculates the Gunning-Fog Index for a passage of text. You know The Man says some prose is written at a 10th grade level? Or when The Jackass used to say he was reading at an 8th grade level when you and he were in 6th? Well, that grade level is actually the Gunning-Fog Index, and there’s an equation to calculate it:
where “complex words” have three or more syllables and exclude proper nouns, compound words, common suffixes (-ed, -ing), and jargon.
Excluding proper nouns and counting syllables is actually a bit tougher than it sounds. I have the code put together (handed it in this morning), but I want to brush it up a little bit before I post it. I really wanted to run it on my Random posts to see if there are any trends, but my first pass doesn’t seem to reveal any interesting patterns:

More later…
Much much later…
Ok, here’s a slightly better chart, where I filtered out the outliers (zeros and values above 25), and linked all the points their respective posts.
If you click through some of those you’ll probably notice some fishiness, mostly due to posts with barely any words. They should all score quite low due to low sentence count and mostly low words/sentence, but there are some werid exceptions. I think there’s a semi-significant trend toward a more consistent level of readability in the past few years, which is kind of neat. Or at least a higher minimum. Actually, the readability of my writing seems to be settling around the grade level at which I started this blog: junior year of high school, grade 11. That’s actually kind of cool.
I spent a lot of time putting together a fancy schmancy DIV-based bar chart too somewhat similar to those on last.fm, but WordPress actually choked on a post that big. Go figure. Code in a bit…
Methods
First step was to pull all my posts out of the database, filtering by category and ordering by date. Then I did a little pre-processing, stripping out <code> content and all HTML tags. Then I ran my Gunning-Fog code on each post, which is based on a simple tokenizer and sentence boundary detector, part-of-speech tagging based on the Brown Corpus General Fiction and Press Reportage, word stemming based on WordNet, and proper noun removal based on the POS tags. I didn’t try to deal with compound words or jargon.
Here’s some conceivably interesting code:
Syllable Counting
def countSyllables(word):
from re import findall, IGNORECASE, VERBOSE
word = ' ' + word
syllablePattern = r'''
[^aeiouy][aeiouy] # consonant-vowel
'''
addPattern = r'''
[^aeiouy]ia[^aeiouy] # ia, e.g. diabetic
| ism$ # -ism, e.g. postmodernism
'''
removePattern = r'''
[aeiouy][b-z^aeiouy]e$ # ends with vowel-consonant-e
| [^aeiouy]ense$ # -ense, e.g. tense, nonsense
'''
syllables = len(findall(syllablePattern, word, IGNORECASE|VERBOSE))
additions = len(findall(addPattern, word, IGNORECASE|VERBOSE))
exceptions = len(findall(removePattern, word, IGNORECASE|VERBOSE))
return syllables + additions - exceptions
Making the Chart with Matplotlib
import matplotlib as mpl
from matplotlib import dates, figure
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
DPI = 72.0
FIGWIDTH = 450.0
FIGHEIGHT = 500.0;
# only look at GFs under 50 (assume the others are outliers)
# resultsPart is my list of WP posts with Gunning-Fog indices
targetResults = [(pid, date, title, gf) for pid, date, title, gf in resultsPart
if gf < 25 and gf > 0]
# format date axis
years = dates.YearLocator() # every year
months = dates.MonthLocator() # every month
monthsFmt = dates.DateFormatter('%m')
yearsFmt = dates.DateFormatter('%Y')
# setup and make the figure/plot
fig = figure.Figure()
fig.set_size_inches(FIGWIDTH/DPI, FIGHEIGHT/DPI)
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(yearsFmt)
xdata = [dates.date2num(date) for pid, date, title, gf \
in targetResults]
ydata = [gf for pid, date, title, gf in targetResults]
plot = ax.scatter(xdata, ydata)
# print the fig to file
canvas = FigureCanvas(fig)
canvas.print_figure(plotName, dpi=DPI)
Making an HTML Image Map from a Matplotlib Plot
def plot2imgmap(plot, hrefs=None, onclicks=None):
"""Generate an HTML image map for the points in a scatter plot. Adapted
from code by Andrew Dalke at
http://www.dalkescientific.com/writings/diary/archive/2005/04/24/
interactive_html.html
@param plot RegularPolyCollection from matplotlib. The sort of thing
returned by scatter()
@param href List of URIs for the href attributes of the
elements
@param onclick List of contents for the onclick attribute of the
elements
"""
dpi = plot.figure.get_dpi()
img_width = int(plot.figure.get_figwidth() * dpi)
img_height = int(plot.figure.get_figheight() * dpi)
trans = plot.get_transform()
data = plot.get_verts(trans)
xdata = [x for x, y in data]
ydata = [y for x, y in data]
xcoords, ycoords = trans.seq_x_y(xdata, ydata)
if hrefs and onclicks:
attrList = ['href="%s" onclick="%s"'
% href, onclick in zip(hrefs, onclicks)]
mapElts = zip(xcoords, ycoords, attrList)
elif hrefs:
mapElts = zip(xcoords, ycoords, ['href="http:/pageofguh.org/random/%s"'
% href for href in hrefs])
elif onclicks:
mapElts = zip(xcoords, ycoords, ['onclick="%s"' % onclick for onclick
in onclicks])
outStr = '''

'
return outStr
HTML-Only Bar Chart
I wrote this before I got the client-side image map working, but I still think it’s pretty cool.
def divchart(data, chartWidth=500, barElt="div", barAttrs=[]):
"""Takes a sequence of input data and labels and returns a string containing
a block element-based HTML bar chart.
@param data List of (dataPoint, label) tuples
@param chartWidth Width of the chart in pixels
@param barElt HTML element to be used for the chart bars
@param barAttrs List of additional HTML attributes for each bar. Must
be the same length as data
"""
outStr = '
maxPt = max([pt for pt, label in data])
# make the x axis
axisStr = '
for tick in range(int(math.ceil(maxPt))):
tick += 1
width = tick / maxPt * chartWidth
axisStr += '''
''' % (width, tick)
axisStr += '
'
outStr += axisStr
if len(data) == len(barAttrs):
data = [(pt, label, barAttrs[i]) for i, (pt, label) in enumerate(data)]
else:
data = [(pt, label, '') for pt, label in data]
for pt, label, attr in data:
width = pt / maxPt * chartWidth
outStr += '''
<%s %s class="bar" style="width: %0.2fpx">
%0.2f
%s
%s>''' % (barElt, attr, width, pt, label, barElt)
outStr += axisStr
outStr += '
'
return outStr

3 COMMENTS
what’s on the horizontal axis? time? have you read the serenity/firefly comic? covers the time between the final episode of firefly and the movie. pretty cool. jasten lent it to me. 10 bucks…expensive for a short ass comic. waiting for frank miller to arrive: 300 and ronin. letters. b.
The only reason my blog might be great one day, is because people are required to post completely anonomously, and all PROPER nouns are filtered and replaced by *****, so that no names or places are visible. And that’s the way it should be. And the way I’d like it.
Is that possible to do, and how can it be added to my blog?
http://whenyouhavetotellsomeone.blogspot.com
Hi Bill,
The only solutions I can think of require some programming. The stupidly simple way to filter proper nouns would be to simply look for capitalized words, but obviously that would be a simple constraint to avoid, and you’d have problems with sentence boundaries. The harder way would be to write and train some kind of part-of-speech tagger and/or shallow parser (check out the class syllabus for the class I did the work above in). There are probably more innovative solutions involving dictionary look-ups (most proper nouns won’t be in a dictionary, or they will be capitalized) or using google search frequencies.