«» · 24TH OF SEPTEMBER, THE YEAR 2006
I’m taking a class on natural language processing, the art/science/practice of using computers to derive meaning from natural human language. In our current assignment, we have to create Amazon-style concordances for freely available bodies of text, basically tag clouds of the most frequent words. It’s kind of cool to see how obvious the work’s identity is based on word frequency, usually because of character names. These concordances are all the top 100 most frequent words, sans common ones like “and” and “the,” sized by standard deviations. I log normalized the sizes a little (base 1.3). See if you can guess what books in Project Gutenberg these are.
added always answer anything attention aunt away believe Bennet Bingley both came cannot Catherine certainly Charlotte Collins cried Darcy daughter dear do does done each Elizabeth enough evening family father feelings felt found friend Gardiner go happiness happy having hear heard herself himself hope however indeed Jane Kitty ladies Lady least less letter Lizzy Longbourn looked Lucas Lydia manner marriage Miss morning mother must myself Netherfield nor nothing often Oh opinion place pleasure present quite received replied room saw seemed seen shall sister sisters soon speak subject tell therefore though till told towards up went whole whom Wickham wish young
already always answered anything asked away behind between black brought business came cannot case chair course cried dear do done door doubt end enough eyes face far father found friend front gave go hand hands hardly having head heard himself Holmes however indeed knew lady leave light looked matter mind minutes Miss money morning must myself night nothing Oh once perhaps place possible put quite rather remarked room round saw seemed seen shall Sherlock side sir small son strange Street tell though told took turned understand until up upon Watson went wife window wish within woman words years yet young
Ahab air almost along among away Aye between boat boats body both called came Captain captain certain CHAPTER crew cried dead deck do does each end eyes face far feet fish Flask found full go God half hand hands head heard heart high himself hold men Moby moment must night nothing Oh once Pequod place poor Queequeg round saw sea seemed seen ship side sight sir small soon sort Sperm sperm Starbuck stood strange Stubb tell thee things thou though thus thy till times towards up upon voyage water went whale Whale whales whaling white White whole ye years yet
almost along among answered aspect away bosom breast brought came character child Chillingworth clergyman dark days deep Dimmesdale do enough evil eyes face far felt forest forth Governor hand hath head heart herself Hester himself however human indeed itself kept kind knew less letter light looked looking men mind minister moment mother must myself nature nor nothing once Pearl perhaps physician poor Prynne public Roger scarlet seemed seen shall shame side sin smile soul speak spirit stood strange thee themselves thou Thou though thus thy towards truth up upon voice whether whole whom whose wild within woman years yet young
allied almost America amount animals become beings believe between birds both breeds cannot case cases certain character characters closely common conditions country degree descendants descended differences different difficulty distinct do domestic doubt during each either extinct fact facts far form formations forms found genera general generally genus given groups habits having hybrids importance important individuals inhabitants instance intermediate islands large less manner modification modified must natural nature nearly number often organic organs parts period plants present probably produced productions quite seeds seems selection several single slight small sometimes species state structure theory though thus together varieties whether whole within yet
actually agreement already anyway Art Audie away bad bed book Boston both came car Colonelonic comm cool couple course days do doctor doing done door electronic end enough eyes face Fede felt found Foundation friends fucking full getting go Gran Gutenberg hand hands hard head hell himself http idea Jersey job Junta knew License Linda London looked making MassPike maybe means meet moment must nice office Oh OK once pretty private Project put roof room says set side sir stuff talk tell terms things though told took Trepan Tribe tried trying until up used wanted went whole working works years
So for all y’all nerds, here’s the code I used to generate these. It’s mostly based on the nltk_lite NLP package for Python, which includes a class called FreqDist. FreqDist is a frequency distribution. You add a bunch of things to it, and it reports on the frequency at which certain items were added. More in the docs. If you spot any bugs, leave a comment. I’m not exactly the greatest programmer ever, so no doubt there are problems, or at least more efficient solutions.
def amazonConc(text):
# tokenize the text, using a tokenizer a partner and I developed for class
t = tokenizer.parseTokens(text)
# make a FreqDist of word frequencies, ignoring words in the stoplist
fd = FreqDist()
for word in t:
if word.lower() not in STOPLIST:
if re.match(r'^[A-z]+$', word):
fd.inc(word)
# return a list of (word, count) pairs
d = []
for word in fd.sorted_samples():
d.append((word, fd.count(word)))
return d
def tagCloud(tags, logB=None):
# take the stdev of the counts
std = stdev([freq for tag, freq in tags])
# recalc the counts in terms of stdevs
f = [(tag, freq / std) for tag, freq in tags]
# log normalize the counts
if log is not None:
f = [(tag, int(log(devs, logB))) for tag, devs in f]
else:
f = [(tag, int(devs)) for tag, devs in f]
# format the HTML string
html = '
'
html += ' '.join([(devs * '' + tag + devs * '') for tag, devs in f])
html += '
'
return html
def stdev(seq):
"""Returns the standard deviation of the values in the sequence seq"""
mean = float(sum(seq) / len(seq))
sumSquares = sum([(i - mean)**2 for i in seq])
std = sqrt(float(sumSquares) / (len(seq) - 1))
return std
2 COMMENTS
hey ken-ichi, that’s really cool!
are those from the entire book?
how about posting some code?
Your wish is my command, Brent-san. Even if it screws up my formatting (need to look into this…)