NAME

Chatbot::Babbler - an object that attempts to mimic writing style


SYNOPSIS

$yakker = new Chatbot::Babbler textfile.txt; print $yakker->speak().``\n'';

$yakker = new Chatbot::Babbler; $yakker->analyze_line($paragraph); print $yakker->speak().``\n'';


DESCRIPTION

Babbler. Based on some code I saw elsewhere, but I can't find it, so I'm just going to reuse the algorithm. I found out later the algorithm is based on the statistical technique called Markov chains (only I don't use matrix multiplication). This is what I remember:

1 - read a text file, and break it up into individual words or groups of words as tokens.

2 - create a hash of arrays, with each token as a key, and each sequential following word as an element in the array. Special words are 'end of sentence' and 'beginning of sentence'.

3a - To create a sentence, start with a 'beginning' token. 3b - Get a random word out of the token's array, and output it. 3c - Change the token by taking out it's first word and tack on the new word from step 3a on the end of it.

Repeat steps 3b and 3c until the new word it eh special 'end of sentence' word.

4 - Store the hash of arrays efficiently, for ghod's sake, because it will be huge. tie() may be a good idea, as well as Compress::Zlib. I'm using Data::Dumper for now.

Hopefully the current version of Chatbot::Babbler.pm is available on CPAN:

        http://www.perl.com/CPAN/modules/by-module/Chatbot/

INSTALLATION

To install this module, just copy it into the Chatbot subdir of your local site_perl directory (use perl -V to see where it is). I'll get a CPAN wrapper around it, I promise, but I'm developing it to be pure Perl and use only CORE modules, so it doesn't matter that much.


USAGE

Ideally you have a text file kicking around of a style you particularily like (I've been using alice30.txt, or ``Alice's Adventures in Wonderland'' that I took from Project Guttenberg. Long live PG!). You can analyze the text file when you create the object, or later (if you wish to include more than one text file).

        use Chatbot::Babbler;

        $yakker = new Chatbot::Babbler "alice30.txt";
        $yakker->analyze_file("alice31.txt");

You can also feed sentences into Babbler one at a time:
        $yakker->analyze_line(qq["Mr. Carrol!
                                  What a wonderful story!" exclaimed Alice]);

When you think the Babbler has read enough, you can ask it to attempt to speak in the same style:

        print $yakker->speak();

=head2 Frequency

The analyze(), analyze_fh(), analyze_file() and analyze_line() all have an optional parameter for how to analyze the frequency of recurring or mixed-case letters. By default, Babbler will attempt to use words with the same frequency as the text it was given, and will treat words of different lettercases as different words completely.

$yakker->analyze($x,0); is the default

$yakker->analyze($x,1); folds all words to lowercase

$yakker->analyze($x,2); ignore frequency, but be case sensitive

$yakker->analyze($x,3); ignore frequency, fold to lowercase

To make things easier, I've got a couple of constants you can import.

        use Chatbot::Babbler qw(IGNORE_FREQ IGNORE_CASE);

        $yakker->analyze($x,(IGNORE_FREQ|IGNORE_CASE));

Saving State

        $success = $yakker->save('filename.dat');
        $tokens_loaded =$yakker->load('filename.dat');

In case you want to reuse Babbler's memory. It uses Data::Dumper right now, so you could examine the Babbler's memory pretty easily if you wanted. It's included so you can use it for analyzing an ongoing process (like, say, people chatting in an IRC channel).


METHODS

new($token_size)

Creates a new Babbler, optionally loading data right away. token_size is how many words are used are used in a token (but words are added one at a time). Token size is 1 word by default, and if you increase it you will get results that mimic better but may be accordingly less interesting. (At a token size of 3, when fed the complete text of Alice's Adventures In Wonderland, it appeared as if quoting passages instead of making it's own)

load('filename.dat')

Restores a Babbler memory state to this object. Returns the number of tokens loaded.

save('filename.dat')

Saves a Babbler's memory state. Can be restored later with load('filename.dat'). Returns the number of keys stored.

QUACK: Getting out of memory errors.

analyze($something)

My attempt at a magical analyze method; you can pass it a string, a filename, a file *GLOB or a FileHandle object and it will call the appropriate analyze function. Returns the number of tokens memorized.

I differentiate between filenames and strings by looking for spaces (which means no sentences of one word). I assume that any GLOB that is passed is a filehandle.

analyze_file('filename.txt');

Reads the text from a text file and analyzes it, adding the results to the Babbler's memory. Returns the number of tokens memorized.

analyze_line('``Is that the way YOU manage?'' Alice asked.');

Reads the text from the scalar (string) passed to it, and adds the analyzed results to the Babbler's memory. Returns the number of tokens memorized.

speak()

Returns a sentence from what the Babbler has learned.

err()

Returns the most recent error message if a problem occured. Consult it if the analyze, load or save functions returned 0/false.

freq()

Returns the least restrictive frequency setting that's been used in analyses.


BUGS

There are many bugs, I'm sure. I suspect I could use memory more efficiently, and the inelegance of using Data::Dumper to record the state of the Babbler's memory is a perceived bug if not a real one. Actually, it's developed into a real bug because I get Out Of Memory errors on my MS-DOS box if I try to use Data::Dumper.

Quotation marks are clobbered to dodge the problem of making sure there are no orphaned quote marks in the sentences made by speak(). As a human, I find it pretty obvious where to put them back in, but sentience is funny that way.

Sentences generated by speak() may be ungrammatically correct. This happens most often when learning from text written in a casual style. I've been trying to add a new parameter in the constructor (object property $obj->{' tsize'}) to make tokens consist of more than one word, which should make for more accurate imitations at the expense of more resources consumed.

The case-insensitive frequency mode for analyzing text does so by folding everything to lowercase, including ALL CAPS words and Proper Nouns -- this isn't so much a bug as a 'gotcha!' for people who might forget and they'll be puzzled by lowercase-folded output.

When analyzing, the Babbler assumes the language's tokens are organized left- to-right in the text file. This makes Babbler less effective if you were to give it most of the languages east of the Mediterranean.


AUTHOR

Moses Moore <mozai@canada.com>, 20 Dec 1999

Thank you, John Nolan, for Chatbot::Eliza, and Joseph Weizenbaum for Eliza.

I should mention Robert S. (Bob) Fritzius <rsf1@ra.msstate.edu> for writing a character-based Babbler 18 Dec 1992, but I must admit I never heard of the software until after I wrote this module.

Another find, with a similar name, is written in QuickBASIC:

        http://www.cs.cmu.edu/afs/cs.cmu.edu/project/ai-repository/ai/areas/nlp/misc/babbler/