XML Not good for Big Files (vs Flat Files)

Chris Uppal · Apr 7, 2006

James McGill wrote:

[me:]

And I thought /I/ was strange !

Click to expand...

[...]
Now somebody is going to come out of the woodwork claiming that yacc is
fun.

Yacc /is/ fun.

(I said I was strange ;-)

-- chris

Oliver Wong · Apr 7, 2006

Stefan Ram said:
NB: If "id" was declared as an »ID attribute« in the DTD, then

might not be valid XML, because in XML »ID values must
uniquely identify the elements which bear them« is a validity
constraint. But here, »id« might be declared as an »IDREF
attribute«.

Right, sorry.

... and some of these choices then will be restricted by the
restrictions of XML. For example, when one wants to put
emphasis on the roses by mapping each rose to an XML element,
some of the restrictions mentioned in my previous post apply.

You could "declare" a rose "x", and then start describing it, e.g.

<rose id="x"/>
<roseOwnership idref="x" owner="Jack"/>
<roseOwnership idref="x" owner="Jill"/>

You seem not to like having information implied via parent-child
relationship, but I didn't quite understand why. I suspect the
rose-emphasized XML would more likely traditionally be written as something
like

<rose>
<owners>
<person idref="Jack"/>
<person idref="Jill"/>
</owners>

</rose>

- Oliver

Oliver Wong · Apr 7, 2006

Chris Uppal said:
Oliver said:

Is this even possible? Wouldn't the escaping mechanism depend on what
the punctuations of the file format are?

Click to expand...

I don't see why not. There are several broad categories of encoding[*]
techniques.

([*] don't take the word "encoding" to imply that the format is not
normally
readable.)

One simply requires that the text format is self-delimiting and that /any/
text
should be interpreted according to the rules of the encoding. So the
syntax of
the context is irrelevant. E.g. a length prefix, or a strong quoting
convention like the 'xyz' strings in Unix Bourne shell and its
derivatives.

Another possibility is similar, but the encoding is parameterised. For
instance
a C-like escape mechanism could be parameterised on
the Start character (defaults to ")
the End character (defaults to Start)
the Escape character (defaults to \)
the range of characters that need to be escaped (defaults to End and
Escape
itself).

Another set of possibilities are like URL-encoding or the numerical
character
entities in XML/HTML (I may have the name wrong, I mean things like &2345;
but
not $amp. In this case the mechanism is necessarily parameterised on
the
surrounding format, since that determines what /has/ to be escaped.

And so on. My point is that it /could/ have been done (a "best practise"
RFC
perhaps). Sad that it was not...

When you said "an escape mechanism which could be used in any file
format"[*], I was figured that for any escape mechanism you could come up
with, I could devise a file format in which it would not work. For example,
if you use something like C's string escape mechanism, I could define a
format where the open and close string punctuation was the '\' character,
and the '"' character denotes that the rest of the line is a comment. So the
string "Hello World" with a comment after it would be written:

\Hello World\"This is a string literal.

Similar tricks can be used for '&' and ';' for XML.

As for "parameterised encoding", to me this is only marginally better
than each file format having its own conventions for escaping.

- Oliver

[*] I see now you actually said "*almost* any file format"

Oliver Wong · Apr 7, 2006

Chris Uppal said:
Or even CSV without headers but with an XML description of the columns
(and
applicable quoting conventions ;-).

I wrote an RPG engine which uses tile based maps. The map itself is, of
course, tabular data (it's just a 2D array of integers, the integers acting
as indexes into a tile-type pallete, e.g. 0 = transparent area, 1 = sand, 2
= grass, etc.)

I use XML as the file format for the game data. To represent the map, I
basically embed an CSV document in the XML. I don't remember the file format
off the top of my head, but it's something like:

<map version="1">
<background>

</background>
<tilePalette>
<tile id="1" pic="sand.png" walkable="true" damage="0" />
<tile id="2" pic="grass.png" walkable="true" damage="0" />
<tile id="3" pic="water.png" walkable="false" damage="0" />
<tile id="4" pic="poison-swamp.png" walkable="true" damage="5" />

</tilePalette>
<tileData width="50"
height="30">1,3,1,3,5,7,2,4,5,6,3,2,4,7,8,4,3</tileData>
<objects>

</objects>
</map>

- Oliver

bugbear · Apr 7, 2006

Homer said:
That's great. Put tones of repeating tags inside the file and make it
huge and now everybody is saying how to make it small with
Gzip/Binary,...

Yep. Best of both worlds. The parser sees the nice
XML and the comms sees a small file. Your objection
to this is... ?

Third field (between delimiters; whatever it is) is phone number. Any
file has File Spec Document (unless you XML lovers has replaced it with
some XML equivalent).

When the sender and receiver are agreed on format there is no need to
repeat labels. Like what you write on postal envelop. Or you told your
wife your name is John 20 years ago. No need to wear a name tag just in
case you change your name (if you change your name tell her one more
time; sending File Spec Doc to receiver)

And make sure you change the software at both ends in sync.
Always an interesting excerise when there are multiple
installed copies of the generator and receiver software.

File formats which are semi self-documenting are always superior
w.r.t backwards compatible upgrades.

It's easy to add a new field to an XML format without
breaking existing software.

Other formats have had this virtue, but it's always
at the cost of redundancy;
For example it's quite easy to add a field to TIFF,
since each field carries it's own size; you might not
know what a field means, but you know enough to skip
over it

Minimal, non redundant formats have costs and problems all their own.

I've been a programmer long enough to have faced the
field issues of upgrading software using "compact,
non redundant formats", and I didn't enjoy it.

Try writing a simple parser for QuickDraw sometime.

BugBear

Roedy Green · Apr 7, 2006

. I know that ASN.1
(for example) offers some very formal grammars that happen to be
accepted as industry standards; but I am quite certain that it's
anything but a pleasant framework to design with.

the claim is you don't have to. You can use an XML schema.

Roedy Green · Apr 7, 2006

Yep. Best of both worlds. The parser sees the nice
XML and the comms sees a small file. Your objection
to this is... ?

I have four objections:

1. The compression method is like putting a fat woman in a girdle. You
can do better if you start with someone without rolls of fat.

2. The file is requires fat code to parse. Not suited for handhelds.

3. The file is slow to parse and create.

4. When you are done there is still no guarantee the file matches the
schema.

James McGill · Apr 7, 2006

2. The file is requires fat code to parse. Not suited for handhelds.

Whoa there, Roedy. You've gone from an argument about data freight for
30 million records, to optimization for mobile apps.

James McGill · Apr 7, 2006

the claim is you don't have to. You can use an XML schema.

I guess the question is, why would you then add another layer of
complexity, if you've already got an XSD that models your data to your
satisfaction? I realize that if I was working for you, you would insist
on a tightly packed, formalized wire format. That's cool. I've had to
do similar things to map between an XML represenation of DNS data, and
the ietf wire format for the records. I don't think an ASN model would
be any weirder than that.

Steve Wampler · Apr 7, 2006

Oliver said:
When you said "an escape mechanism which could be used in any file
format"[*], I was figured that for any escape mechanism you could come
up with, I could devise a file format in which it would not work. For
example, if you use something like C's string escape mechanism, I could
define a format where the open and close string punctuation was the '\'
character, and the '"' character denotes that the rest of the line is a
comment. So the string "Hello World" with a comment after it would be
written:

\Hello World\"This is a string literal.

Encodes as:

\\Hello World\\\"This is a string literal

Stefan Ram · Apr 7, 2006

Oliver Wong said:
You seem not to like having information implied via
parent-child relationship, but I didn't quite understand why.

I have no problem with the parent-child relationship, but with
the (ab)use of the /type/ of the child to name the /relation/
to its parent (instead of the type of the child as the
designation »type« implies). Using the /type/ to name the
/relation/ contradicts its designation »type«.

I suspect the rose-emphasized XML would more likely
traditionally be written as something like
<rose>
<owners>
<person idref="Jack"/>
<person idref="Jill"/>
</owners>

</rose>

Possibly I can clarify my intentions by using another
language with structured attributes. In my language »Unotal«
one can write:

< &rose owner=< &person Jack > owner=< &person Jill >>

Here, »owner« can be recognized as the name of a /binary/
relation by the following »=«, while »rose« can be recognized
as the name of a /unary/ relation (like a type) by the
preceding »&«. In Unotal, this is always so, so it is
easier to read.

In XML, element types are sometimes used for /unary/ relations
(sometimes for real types, as the name implies), but sometimes
(ab)used for /binary/ relations (to specify the parent-child
relationship). So when reading a child element type in XML,
one does not know, whether it gives the type of this element
or names the relationship to its parent.

~~~

I am working on a implementation of a reader and writer for
Unotal in Java, and have a small application that uses this to
implement Unotal as its file storage format in Java:

http://www.purl.org/stefan_ram/pub/joodo

The Java source code for the Unotal implementation will be
released later, but a description of Unotal is available at:

http://www.purl.org/stefan_ram/pub/unotal_en

This page also contains the Unotal syntax specification, which
is written in Unotal itself and then was automatically
translated to HTML and ASCII from there.

Oliver Wong · Apr 7, 2006

Steve Wampler said:
Oliver said:

When you said "an escape mechanism which could be used in any file
format"[*], I was figured that for any escape mechanism you could come
up with, I could devise a file format in which it would not work. For
example, if you use something like C's string escape mechanism, I could
define a format where the open and close string punctuation was the '\'
character, and the '"' character denotes that the rest of the line is a
comment. So the string "Hello World" with a comment after it would be
written:

\Hello World\"This is a string literal.

Click to expand...

Encodes as:

\\Hello World\\\"This is a string literal

When I try to decode that using my file format, I see an empty string
(i.e. \\), followed by ilelgal characters, (i.e. Hello World), followed by
an empty string (\\) followed by an unterminated string (i.e. \).

I think you're thinking that escaping applies "above" my file format.
Usually, escaping occurs "within" a file format. E.g. the escaping mechnism
for C strings only apply within C strings, and not outside of C strings.

- Oliver

Oliver Wong · Apr 7, 2006

Roedy Green said:
I have four objections:

1. The compression method is like putting a fat woman in a girdle. You
can do better if you start with someone without rolls of fat.

Hmm, I'd say it's more like having two tools, each one doing what it
does very well (XML to represent tree-structured data, gzip to compress
arbitrary data), rather than one tool that does both semi-weill (ASN.1 to
represent the tree-structured data, and to compress it). You can also swap
tools in and out (you can't read gzip? How about 7zip? rar? Starting from
XML, I can compress to any format you want).

2. The file is requires fat code to parse. Not suited for handhelds.

3. The file is slow to parse and create.

Developer time might be more valuable than CPU-time though. If the
claims are true that ASN.1 is a pain to work with, it might be better for
the work with XML and take a slight performance hit but getting the code
working and bug free sooner. As ASN.1 matures and more tools and APIs are
available for it, this advantage XML has will weaken.

4. When you are done there is still no guarantee the file matches the
schema.

When you're dealing with malformed files, there's no guarantee with
ASN.1 either. I could take a perfectly valid 50 megabyte ASN.1 file, flip
the last bit[*], and then send it to you, and you'd probably have to decode
the whole file before you find out that it's corrupted. You've complained
too many people muck around with XML files using text editors, thus causing
problems. Well, nothing stops people from mucking around with ASN.1 files
using hex editors, or hell, even plain ASCII text editors if they wanted to.
The argument for using specialized validating ASN.1 editors can just as well
be made for specialized validating XML editors.

- Oliver

[*] Change "last bit" with some other bit if it turns out flipping the
last bit in an ASN.1 file doesn't actually corrupt it.

Steve Wampler · Apr 7, 2006

Oliver said:
When I try to decode that using my file format, I see an empty string
(i.e. \\), followed by ilelgal characters, (i.e. Hello World), followed
by an empty string (\\) followed by an unterminated string (i.e. \).

I think you're thinking that escaping applies "above" my file format.
Usually, escaping occurs "within" a file format. E.g. the escaping
mechnism for C strings only apply within C strings, and not outside of C
strings.

Oh, sorry, I was assuming the discussion was about escaping as part of
the file format (i.e. independent of the content), not as something
that is content-specific. I wasn't trying to show C-string escaping,
but rather how "something like C's string escape mechanism" might apply as
part of the file format.

Roedy Green · Apr 7, 2006

Developer time might be more valuable than CPU-time though. If the
claims are true that ASN.1 is a pain to work with, it might be better for
the work with XML and take a slight performance hit but getting the code
working and bug free sooner. As ASN.1 matures and more tools and APIs are
available for it, this advantage XML has will weaken.

IF you can devise tools so that ASN.1 is transparent, it then becomes
even eaiser to deal with than compression which is not. And you get
faster, simpler parsing and guaranteed validation.

James McGill · Apr 7, 2006

Developer time might be more valuable than CPU-time though.

An overview of XML, sufficient to start doing production work, can be
learned in a day or so. This is especially true if you use a binding
and code generation framework (i.e., Castor).

An equivalent sufficient understanding of ASN.1 is closer to a full
semester, graduate level university course.

These two frameworks are entirely different, serve different purposes,
and it's really not appropriate to equate them.

I understand Roedy's point of view (and the thread has inspired me to
try to fill in this gap in my knowledge), but he keeps asserting that
because you can automate a 1:1 mapping between XSD and ASN, there should
be no reason not to use ASN. The premise is valid, but the argument
isn't persuading anyone.

It's one thing to base your interfaces on ASN, and expect people to use
it. It's something else to expect people to abandon XML just because
you assert ASN is better, claiming it's easy to work with or simple to
understand. It's not. Which is pretty much the reason that amateurs
and professionals alike are using XML in every context under the sun,
because they want to, whereas people are using ASN because they have
to.

Just because I've got mustard in my refrigerator doesn't mean I wouldn't
make my own mayonaise if I had the time and the weather was right and I
had Belgian house guests I needed to impress.

Roedy Green · Apr 7, 2006

It's one thing to base your interfaces on ASN, and expect people to use
it. It's something else to expect people to abandon XML just because
you assert ASN is better, claiming it's easy to work with or simple to
understand.

But that is not what I am suggesting . I am suggesting using XSD
schemas and using a high level programming interface that squirts out
ASN.1 or XML alternatively and parses it directly. If this is done
correctly, you as programmer would notice almost no change, other than
1. guaranteed schema conformance
2. compact files
3. faster parsing.

I am willing to with something other than ASN.1 but which gains the
advantages of
1. compactness
2. rapid parsing
3. small footprint decoder.
4. binary format that people won't screw with.

Zip compression does not meet those criteria other than improved
compactness.

I have not verified the ASN.1 website claims, but it seems to me if
you have interchangeability of XSD and ASN.1 schemas and should be
able create a high level tool that created/ate either format.

G:ranted the ASN.1 format is much more complicated to HUMANS, but it
is orders of magnitude simpler to COMPUTERS. For computer to computer
communication it does not matter if the format is convenient for
humans so long as there is an easy way of converting it to a human
readable form, and as I understand it, there is. I have seen ASN.1
browsers.

James McGill · Apr 7, 2006

G:ranted the ASN.1 format is much more complicated to HUMANS, but it
is orders of magnitude simpler to COMPUTERS.

I'm with you, Green. Hope you didn't take my comments the wrong way,
which you didn't. I know how thick your skin is

You inspired me to start reading the book...

http://www.oss.com/asn1/dubuisson.html

Oliver Wong · Apr 7, 2006

Roedy Green said:
I am willing to with something other than ASN.1 but which gains the
advantages of
1. compactness
2. rapid parsing
3. small footprint decoder.
4. binary format that people won't screw with.

(4) will never be true, no matter what format you're using. Microsoft
designed their XBox360 with loads of DRM and security to make sure people
wouldn't screw with it. But people did anyway. It's human nature to be
curious, and to want to take things apart to see how they work.

- Oliver

Monique Y. Mudama · Apr 7, 2006

anyway. It's human nature to be curious, and to want to take
things apart to see how they work.

Well, maybe geek nature =)

text to xml conversion	2	Jun 21, 2007
A new use for XML in applications	2	Oct 26, 2005
XML Resume Help	2	Oct 18, 2004
CanonML: beyond TeX and XML, a lesson also for arrogant stringers?	3	May 5, 2006
Available 2 Java, 1 Sr.Dot net consultant for your DIRECT client reks.......................	2	Jul 23, 2007
NoSQL Movement?	30	Mar 3, 2010
Announce SiSU - publishing for e-documents, books, libraries, relational databases	1	Jan 4, 2005
Asp.net Important Topics.	0	Jan 18, 2007

XML Not good for Big Files (vs Flat Files)

Chris Uppal

Oliver Wong

Oliver Wong

Oliver Wong

bugbear

Roedy Green

Roedy Green

James McGill

James McGill

Steve Wampler

Stefan Ram

Oliver Wong

Oliver Wong

Steve Wampler

Roedy Green

James McGill

Roedy Green

James McGill

Oliver Wong

Monique Y. Mudama

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads