C
cr88192
for various reasons, I added an imo ugly hack to my xml parser.
basically, I wanted the ability to have binary payload within the xml parse
trees.
this was partly because I came up with a binary xml format (mentioned more
later), and thought it would be "useful" to be able to store binary data
inline with this format, and still wanted to keep things balanced (whatever
the binary version can do, the textual version can do as well).
the approach involved, well, a bastardized subset of xml-data.
the attribute 'dt:dt' now has a special meaning (along with the rest of the
'dt' namespace prefix), and the contents of such nodes are parsed specially
(though still within xml's syntactic rules, eg, as a normal xml text glob).
(an alternative would have been leaving the trees and textual-parser as-is,
but doing this in the binary-xml reader/writer, but I opted to hack the text
version as, otherwise, there is no real point in this feature anyways...).
<foo dt:dt="binary.base64">
adz6A7dG9TaB41H7D6G5KSt3
</foo>
I wonder if anyone has suggestions for a better approach?
(the risk here being that one might want to use the 'dt' prefix for
something else, and this is an actual parser-hack rather than a more generic
semantics hack...).
other comments:
for the most part, my xml stuff is being used for "offline" uses, though
avoiding my stuff blowing up when dealing with generic xml would be
preferable (this happens occasionally, making me go and patch up whatever in
the parser/other code).
no schemas or similar are used, really, anywhere. pretty much everything
tends to be hard-coded c code, which is expected to deal with whatever.
likewise, many of the formats are ad-hoc and can change randomly (most
things use an "ignore what is not understood" policy).
my parser only really implements a subset of xml anyways (entities, dtd's,
.... are ignored, and only built-in entities are handled). it does, however,
have basic namespaces support, and things like 'CDATA', it skips comments,
....
about the binary format (for anyone that cares):
well, I presently have no real intent on trying to push it as any form of
standard or trying to compete with anything.
everyone can do as they will imo, just I want to use the format for things
which the textual variety is not that well suited, eg, being intermixed with
other binary data (as a particular example, for storing the object
files+bytecode for a newer interpreter of mine, other possible uses may
include persistent data stores and similar, storage of geometric/3d data,
for which I have typically used other formats, ...).
for some of these things, xml is used as an internal representation,
sometimes flattening to some other external representation (eg:
line-oriented text or a binary format).
it is, well, signifigantly faster than my textual parser, largely because of
the dramatic reduction in memory allocation. this is partly because, as a
matter of the format's operation, most strings are merged. likewise, it is a
bit smaller (around 20-30% the original size in my testing), which is a bit
worse than what I can get from "real" compressors, but this is no big loss.
and, also, the code is smaller (though, after adding features like integers,
lz+markov text compression, cdata, ... it is around 800 loc, vs. about 1700
for the text parser). initially it was about 300 loc, but, features cost
some.
it uses a byte-based structure vaguely similar to wbxml.
note that, however, most of the structure/codespace is built dynamically,
and the codespace is divided in a uniform manner. with most of the codespace
going to the text mru (127 values), the tag mru (63 values), and the
namespace mru (31 values), the remaining values encode literals (to be added
to the mru's, located at the end of each mru), or special predefined codes
(the low 32 values). attribute names also have a mru, but the codespace
overlaps the one for tags (since a tag and attribute don't occure in the
same context).
also, at present, tags include no means to skip out on attribute or content
end markers, as basic testing (using statistical methods to eliminate the
markers) did not show "good enough" results to justify having it (nor do I
have a good place to put the bits, eg, without somewhat reducing the mru
size).
also, only ascii/utf-8 is supported.
no strings tables are used, everything is inline.
for the most part, it is using linear mru caching of strings, and it also
does not have lengths for most things (with the exception being binary
chunks). as a result, random access in not possible in it's present form.
the lz+markov encoding used for text is aimed more for speed than decent
compression (lz77 would likely do better in terms of compression, but would
lead to slower encoding). however, hex dumps show that it does "good enough"
at hacking away at text strings... a single window is used between all the
strings (eg: for the purpose of eliminating common strings).
(compression could likely be improved, eg, by use of huffman coding, but,
this is neither cheap nor simple, and is thus likely not worth it).
probably good enough for my uses anyways...
0x00: general purpose ending marker
0x01..0x1F: Special, Reserved
0x20..0x3E: Namespace Prefix MRU
0x3F: Namespace String
0x40..0x7E: Opening Tag/Attr MRU
0x7F: Opening Tag/Attr String
0x80..0xFE: Text MRU
0xFF: Text String
Node: [<NS>] <TAG> <ATTR*> 0 <BODY*> 0
Attr: [<NS>] <TAG> <TEXT*>
Body: <NODE>|<TEXT>
0x10, Integer (VLI)
0x11, LZ+Markov Text String
0x12, CDATA:
0x12 <TEXT*> 0
0x13, Binary Data:
0x13 [<NS>] <TAG> <ATTR*> 0 <UVLI len> <BYTES[len]>
UVLI, basically same as the variable-byte ints in WBXML;
VLI, UVLI, but with the sign put in the LSB.
(skipping on the rest...).
basically, I wanted the ability to have binary payload within the xml parse
trees.
this was partly because I came up with a binary xml format (mentioned more
later), and thought it would be "useful" to be able to store binary data
inline with this format, and still wanted to keep things balanced (whatever
the binary version can do, the textual version can do as well).
the approach involved, well, a bastardized subset of xml-data.
the attribute 'dt:dt' now has a special meaning (along with the rest of the
'dt' namespace prefix), and the contents of such nodes are parsed specially
(though still within xml's syntactic rules, eg, as a normal xml text glob).
(an alternative would have been leaving the trees and textual-parser as-is,
but doing this in the binary-xml reader/writer, but I opted to hack the text
version as, otherwise, there is no real point in this feature anyways...).
<foo dt:dt="binary.base64">
adz6A7dG9TaB41H7D6G5KSt3
</foo>
I wonder if anyone has suggestions for a better approach?
(the risk here being that one might want to use the 'dt' prefix for
something else, and this is an actual parser-hack rather than a more generic
semantics hack...).
other comments:
for the most part, my xml stuff is being used for "offline" uses, though
avoiding my stuff blowing up when dealing with generic xml would be
preferable (this happens occasionally, making me go and patch up whatever in
the parser/other code).
no schemas or similar are used, really, anywhere. pretty much everything
tends to be hard-coded c code, which is expected to deal with whatever.
likewise, many of the formats are ad-hoc and can change randomly (most
things use an "ignore what is not understood" policy).
my parser only really implements a subset of xml anyways (entities, dtd's,
.... are ignored, and only built-in entities are handled). it does, however,
have basic namespaces support, and things like 'CDATA', it skips comments,
....
about the binary format (for anyone that cares):
well, I presently have no real intent on trying to push it as any form of
standard or trying to compete with anything.
everyone can do as they will imo, just I want to use the format for things
which the textual variety is not that well suited, eg, being intermixed with
other binary data (as a particular example, for storing the object
files+bytecode for a newer interpreter of mine, other possible uses may
include persistent data stores and similar, storage of geometric/3d data,
for which I have typically used other formats, ...).
for some of these things, xml is used as an internal representation,
sometimes flattening to some other external representation (eg:
line-oriented text or a binary format).
it is, well, signifigantly faster than my textual parser, largely because of
the dramatic reduction in memory allocation. this is partly because, as a
matter of the format's operation, most strings are merged. likewise, it is a
bit smaller (around 20-30% the original size in my testing), which is a bit
worse than what I can get from "real" compressors, but this is no big loss.
and, also, the code is smaller (though, after adding features like integers,
lz+markov text compression, cdata, ... it is around 800 loc, vs. about 1700
for the text parser). initially it was about 300 loc, but, features cost
some.
it uses a byte-based structure vaguely similar to wbxml.
note that, however, most of the structure/codespace is built dynamically,
and the codespace is divided in a uniform manner. with most of the codespace
going to the text mru (127 values), the tag mru (63 values), and the
namespace mru (31 values), the remaining values encode literals (to be added
to the mru's, located at the end of each mru), or special predefined codes
(the low 32 values). attribute names also have a mru, but the codespace
overlaps the one for tags (since a tag and attribute don't occure in the
same context).
also, at present, tags include no means to skip out on attribute or content
end markers, as basic testing (using statistical methods to eliminate the
markers) did not show "good enough" results to justify having it (nor do I
have a good place to put the bits, eg, without somewhat reducing the mru
size).
also, only ascii/utf-8 is supported.
no strings tables are used, everything is inline.
for the most part, it is using linear mru caching of strings, and it also
does not have lengths for most things (with the exception being binary
chunks). as a result, random access in not possible in it's present form.
the lz+markov encoding used for text is aimed more for speed than decent
compression (lz77 would likely do better in terms of compression, but would
lead to slower encoding). however, hex dumps show that it does "good enough"
at hacking away at text strings... a single window is used between all the
strings (eg: for the purpose of eliminating common strings).
(compression could likely be improved, eg, by use of huffman coding, but,
this is neither cheap nor simple, and is thus likely not worth it).
probably good enough for my uses anyways...
0x00: general purpose ending marker
0x01..0x1F: Special, Reserved
0x20..0x3E: Namespace Prefix MRU
0x3F: Namespace String
0x40..0x7E: Opening Tag/Attr MRU
0x7F: Opening Tag/Attr String
0x80..0xFE: Text MRU
0xFF: Text String
Node: [<NS>] <TAG> <ATTR*> 0 <BODY*> 0
Attr: [<NS>] <TAG> <TEXT*>
Body: <NODE>|<TEXT>
0x10, Integer (VLI)
0x11, LZ+Markov Text String
0x12, CDATA:
0x12 <TEXT*> 0
0x13, Binary Data:
0x13 [<NS>] <TAG> <ATTR*> 0 <UVLI len> <BYTES[len]>
UVLI, basically same as the variable-byte ints in WBXML;
VLI, UVLI, but with the sign put in the LSB.
(skipping on the rest...).