Babbler: Markov Chains

The Second Toy In The NLP Toybox

Documentation

Read the POD documentation.

History

This is another project in my experiments with Natural Language Processing (NLP). I've called it 'Mockingbird' because it attempts to imitate the writing style of another written work. The real name is Babbler, but I decided to use that as an umbrella for my NLP projects and research.

I had heard some time ago, must've been 1993, about software written to help spoof someone else on USENET. "Wow, great idea!" thought my hacker self, but as usual I didn't have the malevolence to seek out this tool or to find a victim to target. I didn't think of it much until I came across an MS-DOS program (also called Babbler) that claimed to do the same. I employed it to imitate a friend's eclectic stream-of-conciousness writing, and offered as a contest to his friends to identify which passage was written by the human. MS-DOS Babbler lost, no contest, but it still got me thinking.

Later, I wrote TinGoth and used it as a puppet. People found it very entertaining (especially it's caustic attitude towards it's creator), and I set on the idea of getting TinGoth to speak for itself. First I explored AI attempts like Eliza, Racter and Parry (I'll show off my succeses in that later). The rumour about the USENET imitation software came to mind again, so I started working on something similar.

The method I decided on for the imitator is very similar to a statistical technique called Markov Chains. I didn't know it at the time; all I had to work on was a goal and a cloudy memory of seeing a similar algorithm before. A Markov Chain is a sequence with a finite number of different values, where each value in sequence depends on the previous value. Well, it's states and probability vectors, but when you're usually dealing with Markov Chains you only want to know the last state/value, whereas I'm interested in the sequence. In the case of imitating someone's style, I'm hoping to learn the probability of a word occuring when following another word.

First I start by reading a large text of the human's writing. I've been using Carroll's "Alice's Adventures in Wonderland" and the King James Bible book of Genesis, thanks to the Guttenberg Project, and an essay written by a friend of mine on a theological subject. The babbler parses the text into a stream of tokens, where each space-separated word is a token, and the start and end of a sentence are special tokens. A hashtable is formed, where each key-token has a list of value-tokens that have followed it in the text (and how often each value-token occurs, which is later flattened into probabilities). After finishing the text, the program has an imitation matrix ready to start churning out sentences in the author's style.

To go from imitation matrix to output, the babbler starts with the start of sentence special key-token, finds it's list of value-tokens, and chooses a value-token randomly (with the same freqency as was observed in the text). The value-token is sent to output, and becomes the new key-token for repeating the process. This continues until the end of sentence special token has become the key-token, and output stops.

Shortcomings

I'd rather talk about these before showing you the sample outputs, because the output can be quite impressive so hopefully you'll have forgotten these after seeing it.