There is no problem on deletion.
A deletion may force you to split nodes. For example, use the DAWG from
the wikipedia article which represents the words "tap", "taps", "top",
"tops":
digraph "g1" {
n0 -> n1 [ label="t" ];
n1 -> n2 [ label="a" ];
n1 -> n2 [ label="o" ];
n2 -> n3 [ label="p" ];
n3 -> n4 [ label="s" ];
n3 -> n5 [ label="EOW" ];
n4 -> n5 [ label="EOW" ];
}
[use the dotty program from the graphviz package to view the graph]
When you delete "tops" from this DAWG, nodes n2 and n3 can no longer be
used for "top" since they would also lead to "tops". So you have to
create new nodes for "top":
digraph "g2" {
n0 -> n1 [ label="t" ];
n1 -> n2 [ label="a" ];
n1 -> n6 [ label="o" ];
n2 -> n3 [ label="p" ];
n3 -> n4 [ label="s" ];
n3 -> n5 [ label="EOW" ];
n4 -> n5 [ label="EOW" ];
n6 -> n7 [ label="p" ];
n7 -> n5 [ label="EOW" ];
}
In this case I think this is the only possible result of the deletion,
but in more complex DAWGs there may be more than one possible result, so
you also have to find the optimal out of several possible results.
For insertion, one needs a
dictionary of suffixes - which is, essentially, given by parent
pointers, since they form a DAWG as well.
An insertion may add or remove nodes (or both). For example if we start
at ("tap", "top"):
digraph "g3" {
n0 -> n1 [ label="t" ];
n1 -> n2 [ label="a" ];
n1 -> n2 [ label="o" ];
n2 -> n3 [ label="p" ];
n3 -> n5 [ label="EOW" ];
}
and add "taps", we need to split n2 and n3 to get graph g2. And if we
then add "tops" we can again merge nodes to get to graph g1. Having a
suffix dictionary probably helps (I haven't tried to implement it), but
in any case it doesn't look quite straightforward to me, and a suffix
dictionary needs extra space and since the whole reason for a DAWG is to
save space you wouldn't want that (I'm not sure whether you can even
call a DAWG with parent pointers a DAWG).
"Significant" in which sense? For me, half an order of magnitude is
significant. 30% is not - in most situations Perl is used for.
My rule of thumb is that it's worth considering if it's an improvement
by a factor of at least two (i.e., it saves at least 50% of memory). But
of course that depends on the problem. If I have a 16 bit system a
reduction from 70k (too big) to 60k (small enough) may be worthwhile. On
my workstation with 4GB RAM I may not care about the possibility of
reducing memory consumption from 2GB to 0.5 GB.
I would expect it to be of second type for English words - but it is
just a guts feeling...
My gut feeling is that it's better. But I don't have any data for it so
I won't make any claims.
(And I was considered "DAWG vs *folded* tries"
- those where a node contains a substring, not a char.)
I found several references to "folded tries", but all of them were about
storing unicode strings and they stored only part of a character per
node. Anyway, whatever optimization you use, you can also use it for a
DAWG, so I don't see the difference.
DAWG may be MUCH worse than a folded trie - even if you fold a DAWG.
You didn't show why my assumption that DAWG will degenerate into a trie
in the worst case is false.
hp