T
Tom Anderson
Evening all,
I'd like to implement a move-to-front coder, and i'm wondering how to do
it - specifically, what data structure to use. This is not a java-specific
question, and i haven't googled yet, so i'm just chucking this out in case
anyone else finds it interesting.
For those who aren't familiar with move-to-front (from here, MTF) coding
(and unless you've delved into the implementation of bzip2 compression,
you probably aren't), it's a scheme for coding a sequence of symbols from
an alphabet as numbers such that higher-order entropy (unevenness in the
position of symbols in the sequence) is converted into first-order entropy
(unevenness in the frequency of symbols). As an example, this string over
the alphabet {A, B, C, D}:
ABABBACDDDCCCDDDCAABBCBA
Becomes this sequence of numbers:
011101230010010012030212
The first string contains six of each of the letters; the second contains
10 zeroes, 8 ones, 4 twos and 3 threes. If you encoded the former with
Huffman coding, you'd have to allocate each symbol two bits, and it would
take 48 bits. You could encode the latter with one bit for 0, two for 1,
and three each for 2 and 3, which would make 40 bits in total. An amazing
saving, i'm sure you'll all agree. It's even better with bigger alphabets!
The way it works is that you put the symbols of the alphabet into a sort
of stack, in some fixed but arbitrary order. Like this:
A
B
C
D
To encode a symbol, you (a) find the depth of the symbol in the stack and
write that out and then (b) move that symbol to the top of the stack. So
for instance, if you had a C to encode, you'd write out a 2, and then have
a stack like this:
C
A
B
D
And encoding the string above looks like this:
ABABBACDDDCCCDDDCAABBCBA <-- the string
AABABBACDDDCCCDDDCAABBCB <-- top
BBABAABACCCDDDCCCDCCAABC the (before)
CCCCCCCBAAAAAAAAAADDCCAA stack (movement)
DDDDDDDDBBBBBBBBBBBBDDDD <-- bottom
011101230010010012030212 <-- the code
I've described this as a stack with a top, but in the literature it's
something which has a front rather than a top, hence the name.
Anyway, that's the classic MTF coder. It's not the most exciting or
effective compression technique in the world, but it's what i need in my
app right now.
Although in my app, things are slightly more complicated. I don't have a
fixed alphabet: i have a sequence of arbitrary symbols. Initially, my set
of possible symbols is empty, and as symbols come through, i learn about
them and add them to the set. I can still apply MTF to this, though, just
with a growing stack. Like this (again with letters - and this time i
think making the string less compressible, but never mind):
AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBAA
AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBA
ABBABABCABCBDBAAADBEFADFDEBAFCCDBEFDCCDFEB
ABCAACCDBBBADBEFAAFDEBAFFCDBEFFFCDFE
AACCCCCADBEEEAFDEBAAFCDBEEEECDF
CADBBBBAFDEBBAFCCBBBBBCD
CCCCCCCCCDEEEAAAAAAAAAC
!0!10111!2221!132002!!44213444550454341023450
Where ! means that the symbol isn't currently in the stack.
So, how would you implement an MTF coder?
Oh, and as an added complication, the symbols in my app are arbitrary
objects. They could be literally anything - Strings, Maps, Dates,
Customers, Tomcat instances, etc. And for an object to match an entry in
the stack, they have to be identical (==) rather than just equal.
The obvious way is with a stack (implemented with a list), as described:
public class MoveToFrontCoder<T> {
private List<T> stack = new ArrayList<T>();
/** @return the code for the object, or -1 if it's new */
public int encode(T obj) {
stack.add(0, obj);
Iterator<T> it = stack.iterator();
it.next(); // discard the one we just added
int idx = 0;
while (it.hasNext()) {
if (it.next() == obj) {
it.remove();
return idx;
}
++idx;
}
return -1;
}
}
Note that i can't just use indexOf because that's defined on top of
equals, not ==. Hence the iterator.
That's simple and works, but its average-case behaviour is O(n). Of
course, the whole point of MTF coding is that the normal case isn't
average (IYSWIM), so you'd hope it would be nearer O(1) in practice.
But is there a way to do it that has a better worse-case (if not
worst-case) behaviour?
I could use a LinkedList in the above structure, and it might be faster,
but it's still O(n).
Could i use an IdentityHashMap from objects to Integers, holding the
position of that object in the stack? But then how do i do updates?
My gut tells me that i could use an IdentityHashMap in which the values
were links in a linked list, which i could then rearrange easily, but i
can't see how i could compute stack positions in O(1) that way. And my gut
mostly does perl, so its advice is not usually worth bothering with.
A priority queue of some sort? Like a heap? But finding the index of an
object in a heap is O(n). Or possibly more - i think it's equivalent to
sorting, so O(n log n).
An IdentityHashMap pointing to nodes in an enfilade? I don't actually
understand those well enough to be sure that they're applicable, but i
think they might be, and Ted Nelson invented them, so they're cool.
Right, time to sleep on it, i think.
tom
I'd like to implement a move-to-front coder, and i'm wondering how to do
it - specifically, what data structure to use. This is not a java-specific
question, and i haven't googled yet, so i'm just chucking this out in case
anyone else finds it interesting.
For those who aren't familiar with move-to-front (from here, MTF) coding
(and unless you've delved into the implementation of bzip2 compression,
you probably aren't), it's a scheme for coding a sequence of symbols from
an alphabet as numbers such that higher-order entropy (unevenness in the
position of symbols in the sequence) is converted into first-order entropy
(unevenness in the frequency of symbols). As an example, this string over
the alphabet {A, B, C, D}:
ABABBACDDDCCCDDDCAABBCBA
Becomes this sequence of numbers:
011101230010010012030212
The first string contains six of each of the letters; the second contains
10 zeroes, 8 ones, 4 twos and 3 threes. If you encoded the former with
Huffman coding, you'd have to allocate each symbol two bits, and it would
take 48 bits. You could encode the latter with one bit for 0, two for 1,
and three each for 2 and 3, which would make 40 bits in total. An amazing
saving, i'm sure you'll all agree. It's even better with bigger alphabets!
The way it works is that you put the symbols of the alphabet into a sort
of stack, in some fixed but arbitrary order. Like this:
A
B
C
D
To encode a symbol, you (a) find the depth of the symbol in the stack and
write that out and then (b) move that symbol to the top of the stack. So
for instance, if you had a C to encode, you'd write out a 2, and then have
a stack like this:
C
A
B
D
And encoding the string above looks like this:
ABABBACDDDCCCDDDCAABBCBA <-- the string
AABABBACDDDCCCDDDCAABBCB <-- top
BBABAABACCCDDDCCCDCCAABC the (before)
CCCCCCCBAAAAAAAAAADDCCAA stack (movement)
DDDDDDDDBBBBBBBBBBBBDDDD <-- bottom
011101230010010012030212 <-- the code
I've described this as a stack with a top, but in the literature it's
something which has a front rather than a top, hence the name.
Anyway, that's the classic MTF coder. It's not the most exciting or
effective compression technique in the world, but it's what i need in my
app right now.
Although in my app, things are slightly more complicated. I don't have a
fixed alphabet: i have a sequence of arbitrary symbols. Initially, my set
of possible symbols is empty, and as symbols come through, i learn about
them and add them to the set. I can still apply MTF to this, though, just
with a growing stack. Like this (again with letters - and this time i
think making the string less compressible, but never mind):
AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBAA
AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBA
ABBABABCABCBDBAAADBEFADFDEBAFCCDBEFDCCDFEB
ABCAACCDBBBADBEFAAFDEBAFFCDBEFFFCDFE
AACCCCCADBEEEAFDEBAAFCDBEEEECDF
CADBBBBAFDEBBAFCCBBBBBCD
CCCCCCCCCDEEEAAAAAAAAAC
!0!10111!2221!132002!!44213444550454341023450
Where ! means that the symbol isn't currently in the stack.
So, how would you implement an MTF coder?
Oh, and as an added complication, the symbols in my app are arbitrary
objects. They could be literally anything - Strings, Maps, Dates,
Customers, Tomcat instances, etc. And for an object to match an entry in
the stack, they have to be identical (==) rather than just equal.
The obvious way is with a stack (implemented with a list), as described:
public class MoveToFrontCoder<T> {
private List<T> stack = new ArrayList<T>();
/** @return the code for the object, or -1 if it's new */
public int encode(T obj) {
stack.add(0, obj);
Iterator<T> it = stack.iterator();
it.next(); // discard the one we just added
int idx = 0;
while (it.hasNext()) {
if (it.next() == obj) {
it.remove();
return idx;
}
++idx;
}
return -1;
}
}
Note that i can't just use indexOf because that's defined on top of
equals, not ==. Hence the iterator.
That's simple and works, but its average-case behaviour is O(n). Of
course, the whole point of MTF coding is that the normal case isn't
average (IYSWIM), so you'd hope it would be nearer O(1) in practice.
But is there a way to do it that has a better worse-case (if not
worst-case) behaviour?
I could use a LinkedList in the above structure, and it might be faster,
but it's still O(n).
Could i use an IdentityHashMap from objects to Integers, holding the
position of that object in the stack? But then how do i do updates?
My gut tells me that i could use an IdentityHashMap in which the values
were links in a linked list, which i could then rearrange easily, but i
can't see how i could compute stack positions in O(1) that way. And my gut
mostly does perl, so its advice is not usually worth bothering with.
A priority queue of some sort? Like a heap? But finding the index of an
object in a heap is O(n). Or possibly more - i think it's equivalent to
sorting, so O(n log n).
An IdentityHashMap pointing to nodes in an enfilade? I don't actually
understand those well enough to be sure that they're applicable, but i
think they might be, and Ted Nelson invented them, so they're cool.
Right, time to sleep on it, i think.
tom