Random
· 31ST OF DECEMBER, THE YEAR 2006COMPUTERS VS. ENGLISH! EPIC SHOWDOWN!
For my final project in my Natural Language Processing? class, I tried to develop provocative metrics for assessing writing quality in fiction. That is, describing passages of fiction with numbers that could, in theory, provoke discussion about their stylistic qualities, qualities wholly independent of their meaning. I tried a bunch of things, like measuring alliteration, consonance, passages containing steady rhythms of emphatic syllables or particular phonemes, but the simplest things seemed to yield the most provocative results.
Unique Words

This is a plot of the number of unique words in a given “window” of text. Essentially, I chose a window size of, say, 200 words (“tokens” in NLP jargon), and then for each adjacent window of 200 words, counted how many unique words occurred. The books that I looked at were largely just single chapter excerpts, with the exception Moby Dick, The Scarlet Letter, Eastern Standard Tribe, and Pride & Prejudice. I personally consider Perdido Street Station to be poorly written, Eastern Standard Tribe mediocre, Black Swan Green, Cloud Atlas, Never Let Me Go, and Pride & Prejudice to be excellent, and the rest to be so-so or unknown to me. I was happy to see that Black Swan Green distinguished itself. It’s also neat that Never Let Me Go and Harry Potter hung down near the bottom, with few unique words. These two patterns were actually present in most of my experiments, that being that Black Swan Green was consistently weird and Never Let Me Go and Harry Potter were often “simple.” The latter may be due to Harry Potter being written for children and young adults, and Never Let Me Go being written as a young adult recalling her childhood.
Sentence Length

This graph plots sentence length for all the works under examination, and I threw in sentences from the Bulwer-Lytton Fiction Contest, where contestants compete to write the worst sentences possible. Again, Black Swan Green has exceedingly an exceedingly limited range of sentence lengths, and the lowest mean length. Interestingly, the Bulwer-Lytton sentences were all over the place. Their high mean sentence length is probably due to the perception that run-on sentences are considered “bad,” so many of the Bulwer-Lytton entries just go on forever.
Details
If you’re interested in all the gory details and the experiments that yielded less interesting results, you can download the PDF of my report below. I’ve also included a couple Python modules I used for calculating my metrics. They’re, uh, far from perfect, but I’m pretty sure they work as described. They’re a little interdependent, and most depend on the nltk_lite NLP package for Python, so be advised.

NO COMMENTS YET