O
Oblomov
Hello all,
I'm building a Markov chain handler with a couple of twists over the
usual naive handlers.
First of all, I consider both words and non-words (separators), so the
string scanner
uses the regexp /\w+|\W/u
This gives much better handling of punctuation and so, but it has the
downside that
I need twice the order of a naive handler to get the same context
(word-punctuation-
word-punctuation instead of word-word).
Secondly, it's bidirectional, so it can create a sentence by starting
from the middle and
adding both 'next' and 'previous' words. This obviously doubles the
memory requirements,
as we store two chains instead of one.
My prototype implementation is available on http://oblomov.dnsalias.org/git
in the rbot-mark.git repository.
The structure is rather simple. I have a ChanceHash class to abstract
the collection of
possible next (or previous) words, with corresponding chances, and the
MarkovChainer
class which abstracts the engine: the chains ar stored in @mkv, which
is a simple hash
where values are {rev => ChanceHash, :next => ChanceHash} and keys
are arrays of
symbols. nil is used to mark the begin/end of a sentence.
Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).
Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
'live' without
some on-disk backend, while still being able to go up to order 6 (at
least).
Any suggestions?
I'm building a Markov chain handler with a couple of twists over the
usual naive handlers.
First of all, I consider both words and non-words (separators), so the
string scanner
uses the regexp /\w+|\W/u
This gives much better handling of punctuation and so, but it has the
downside that
I need twice the order of a naive handler to get the same context
(word-punctuation-
word-punctuation instead of word-word).
Secondly, it's bidirectional, so it can create a sentence by starting
from the middle and
adding both 'next' and 'previous' words. This obviously doubles the
memory requirements,
as we store two chains instead of one.
My prototype implementation is available on http://oblomov.dnsalias.org/git
in the rbot-mark.git repository.
The structure is rather simple. I have a ChanceHash class to abstract
the collection of
possible next (or previous) words, with corresponding chances, and the
MarkovChainer
class which abstracts the engine: the chains ar stored in @mkv, which
is a simple hash
where values are {rev => ChanceHash, :next => ChanceHash} and keys
are arrays of
symbols. nil is used to mark the begin/end of a sentence.
Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).
Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
'live' without
some on-disk backend, while still being able to go up to order 6 (at
least).
Any suggestions?