Documentation
Read the POD documentation.
History
This is another project in my experiments with Natural Language Processing (NLP). I've called it 'Mockingbird' because it attempts to imitate the writing style of another written work. The real name is Babbler, but I decided to use that as an umbrella for my NLP projects and research.
I had heard some time ago, must've been 1993, about software written to help spoof someone else on USENET. "Wow, great idea!" thought my hacker self, but as usual I didn't have the malevolence to seek out this tool or to find a victim to target. I didn't think of it much until I came across an MS-DOS program (also called Babbler) that claimed to do the same. I employed it to imitate a friend's eclectic stream-of-conciousness writing, and offered as a contest to his friends to identify which passage was written by the human. MS-DOS Babbler lost, no contest, but it still got me thinking.
Later, I wrote TinGoth and used it as a puppet. People found it very entertaining (especially it's caustic attitude towards it's creator), and I set on the idea of getting TinGoth to speak for itself. First I explored AI attempts like Eliza, Racter and Parry (I'll show off my succeses in that later). The rumour about the USENET imitation software came to mind again, so I started working on something similar.
The method I decided on for the imitator is very similar to a statistical technique called Markov Chains. I didn't know it at the time; all I had to work on was a goal and a cloudy memory of seeing a similar algorithm before. A Markov Chain is a sequence with a finite number of different values, where each value in sequence depends on the previous value. Well, it's states and probability vectors, but when you're usually dealing with Markov Chains you only want to know the last state/value, whereas I'm interested in the sequence. In the case of imitating someone's style, I'm hoping to learn the probability of a word occuring when following another word.
First I start by reading a large text of the human's writing. I've been using Carroll's "Alice's Adventures in Wonderland" and the King James Bible book of Genesis, thanks to the Guttenberg Project, and an essay written by a friend of mine on a theological subject. The babbler parses the text into a stream of tokens, where each space-separated word is a token, and the start and end of a sentence are special tokens. A hashtable is formed, where each key-token has a list of value-tokens that have followed it in the text (and how often each value-token occurs, which is later flattened into probabilities). After finishing the text, the program has an imitation matrix ready to start churning out sentences in the author's style.
To go from imitation matrix to output, the babbler starts with the start of sentence special key-token, finds it's list of value-tokens, and chooses a value-token randomly (with the same freqency as was observed in the text). The value-token is sent to output, and becomes the new key-token for repeating the process. This continues until the end of sentence special token has become the key-token, and output stops.
Shortcomings
I'd rather talk about these before showing you the sample outputs, because the output can be quite impressive so hopefully you'll have forgotten these after seeing it.
- Imitating sentence-by-sentence is great for one of short attention span, but the text being analyzed is rarely a series of unrelated sentences. Authors like Lewis Carrol, Ghod Almighty or my friend Pete tend to write in paragraphs, and larger structures like essays or complex structures like verse.
- The imitation matrix doesn't have anything that could be mistaken for comprehension. Grammar is often broken in the output, simply because of the flexibility of English words for many parts of speech.
- Quotations are sometimes orphaned, with a single " mark in a sentence. I've merely obliterated them, but sometimes (especially when using more power in creating the imitation matrix) a human will infer where quotation marks "should" be placed.
- I used to have a method of saving imitation matrixes to disk, but after out-of-memory problems using Data::Dumper, I will have to rewrite a new method. In the meantime, the examples on this page are going to be slow because they have to rebuild the imitation matrix each time.