Natural Language Processing (NLP) is a pretty cool thing to have on your resume right? So, I figured I’d repurpose an old personal project to be more abstract, and in the process, incorporate NLP.
The Challenge
Here is the challenge: given a book with a number of different point of view (POV) characters, figure out who the POV characters are for each chapter.
People reading this are probably wondering why anyone would ever want to know this. Well, I just wanted to know if a character who had ostensibly died was really dead, or if they’d have a chapter later in the book. I also have POV characters whose chapters I really look forward to, and if I know one is coming up, I am more inclined to continue pushing through the current chapter.
First Attempts
A year and a half back, I did this for Cryptonomicon, by Neal Stephenson, a book every geek either had read or should read 🙂  but it was done in a very naive way: frequency counts.
Strategy: given a finite set of names, for each chapter in a book, count the number of times the name appears. Whoever has the most mentions must be the POV character. Right? Riiiiight…

As I said, it wasn’t bad, but it wasn’t great. For one thing, I’d have to give the program the set of characters before I ran it. For another, basing the determination off of frequency counts is by no means a perfect determinant.
Why NLP?
Imagine, instead, if the code could figure out who is speaking by looking for key phrases like “Waterhouse thought to himself, wow, a spider web!‘
So, using verbs that are personal, like felt, loved, heard, thought, etc. are very good determinants. There are constraints there, too – for instance if the word appeared inside quotes. Also, that private verb might be preceded by a generic pronoun – he felt, she loved.
What I think I need to do is use part-of-speech tagging to analyze each sentence, come up with weights, and figure out who is the subject of the sentence based on the context (previous or subsequent sentences).
For that, I’ll use an NLP library, something like openNLP from Apache or LingPipe. I will document my exploration in the next post.

Leave a Reply