XML Not good for Big Files (vs Flat Files)

S

Steve Wampler

Timo said:
Of course there is. There are various ways to define schemes for XML
documents:

http://en.wikipedia.org/wiki/XML_schema#XML_schema_languages

No there is isn't. There *is* if *someone else* defines the schema, but
if I'm defining it, exactly what is going to stop me? [Please note, if
you've come into this discussion late, that I (personally) am *not*
advocating doing so.]

I've actually seen XML (*not* mine!) where the person had defined an
array's contents via (paraphrasing, this is a while ago, fortunately):

<array size=15>
<a1>5</a1>
<a2>13</a2>
...
<a15>37</a15>
</array>

(I suppose this allowed position-independent arrangement of the elements, but
there are certainly better ways, even in XML...)
 
S

Steve Wampler

Timo said:
Have a look at XSDs.

I have. I stand by my statement. What about XSD *isn't* about syntax?
Granted, XSDs provide very fine-grained control over syntactic issues.
 
R

Roedy Green

The Canadian government, which I've been led to understand is the most
progressive on Earth, etc.

A government has with a smaller population to serve has a huge
advantage when it comes to being light on its feet. I worked for a
Canadian crown corporation writing an RFP for about a million dollars
worth of computer equipment. I was in Seattle for a New Year's eve
party and met a guy doing something similar there. We both bitched
about all the silly regulations and petty legalities. We decided to
swap RFPs to see who had it worse. His was ten times thicker.

The thing that blows my mind about the US bureacracy is that crooks
have managed to embezzle trillions of dollars over the last decade and
hardly anyone even knows about it. See
http://mindprod.com/politics/iraqeconomics.html near the bottom.
Mastermind crooks pulled off the heist of the century and it did not
even make the front page.

The amount of activity and the amounts of money or so huge that nobody
stays on top of what is going on. Further the amounts of money are so
huge that corruption and coverup are guaranteed.
 
R

Roedy Green

My guess is that you don't really understand either my post, or
XML. It's not the FORMAT of XML, it's the fact that it contains
MEANING. So, if the sender and receiver have a shared ontology
that says that FirstName is someone's first name, then the data
<FirstName>John<FirstName> i

Evan a csv file with a first line using field names contains the same
amount of information for a file like the one shown as the obese XML.

What the raw XML provides is not particularly useful information. You
can glean that by inspecting the file.Information you want which is
missing is how validated are each of the fields. What guarantees
exist on values, what are the complete set of possibilities of each
enumeration and what do they mean.

Since the early DOS days I have been exporting data to people in
several formats, SQL, CSV, and fixed length ascii fields. I generate
a separate human-readable "schema" file that describes the field,
including limits and its length and offset.

No body has ever had trouble interpreting one of the files.

for a FLAT file there is no need to use tags. That is only when you
have a structrured file.
 
R

Roedy Green

Hierarchical data, dude. What if someone has more than one phone
number? With the comma-delimited flat file approach, it's not readily
apparent how you could implement that.

<Person>
<PhoneNumber>...</PhoneNumber>
<PhoneNumber>...</PhoneNumber>

You use a comma to represent any field which is not present. You
don't just have a list of phone numbers, you assign them specific
functions.. You have something like this:

cell
home
work
800
fax
messages
emergency

the other way you do it is to have a separate phone numbers file (this
is SQL-think). Then you can have an arbitrary number of phone numbers.

the phone number file has the form

account#, phone

If you are exporting data only to import SQL again, this is a much
more convenient format than XML hierarchy. SQL does not handle
variable numbers of things well directly, so you end up having to
write a complicated mess of XML export and import handling code, as
well as the process taking 100 times longer than it need do.
 
R

Roedy Green

<Album>
<Artist> Stevie Wonder </Artist>
<Title> Innervisions </Title>
<Producer> .. </Producer>
<Track number=1 name=".."/>
<Track number=2 name=".."/>
... etc..
<Price> £5</Price>
</Album>

Hand coded XML is almost guaranteed to contain errors. Unless you do
something to insist XML is validated before use, all you have done is
invented yet another avenue for data corruption. You can't even tells
if it has been validated against some schema.

It is the same bloody mess that HTML has foisted on us.
 
R

Roedy Green

With XML, it's possible to express unambiguously any possible string of
characters (using, e.g., entity-references).

You have made a much better case for binary strings that don't need
fancy XML escaping than you have for XML.
 
R

Roedy Green

Okay, I guess it is widely supported. I just haven't happened to have
come across anything in my development work that ever made use of it
(that I know of). I shouldn't have generalized that to the rest of
the world.

Actually you probably have, but did not recognize it. You have a
digital cert, perhaps self signed do you not?

ASN.1 is used to define all manner of thing from the format of
digital certificates, credit card transactions, cell phone messages
 
R

Roedy Green

In the first example is 5555555 a phone number, or
part of the address?

The traditional way to handle that is either with a first line
consisting of field names, and also a separate document describing
each field in proper detail with what it means.

Have you ever written a computer program to submit something to a bank
or a the government of any country? The specifications for a single
file comes as a book. There are paragraphs on every field.

The XML description is just a fraction of the information. And, for a
flat file, there is no need to spell the tags out over and over and
over. Any programmer understands the first time. The repetition just
introduces the complication that the tags might NOT be perfectly
repetitive.

XML is for tree structured data. It is hopeless at anything else.
 
J

James McGill

The thing that blows my mind about the US bureacracy is that crooks
have managed to embezzle trillions of dollars over the last decade and
hardly anyone even knows about it.

Controversial opinion, informed by partisan bias, and not one that I
necessarily disagree with. Take it to alt.politics (where I read your
posts and often correspond).

So, what's the ASN.1 equivalent of JAXB?
 
A

Andrew McDonagh

<Pet>
<Type>Dog</Type>
You use a comma to represent any field which is not present. You
don't just have a list of phone numbers, you assign them specific
functions.. You have something like this:

One of XML file greatest advantage over CSV, flatfile, etc., is that it
supports schema evolution without requiring code changes.

Due to the nature of applications looking for the XML nodes they know
about, they ignore all other nodes. So In the Person node example,
should we need to add a child node <Pets>, we can without harming the
existing app.
 
K

Kent Paul Dolan

Homer said:
I am a little bit tired of this obsession people
have with XML and XML technology.

Bad call.
Please share your thoughts and let me know if I am
thinking in a wrong way.

Yes, you are.
I believe some people are over using XML all over
the place.

Nope, it's pretty much become the data encoding
method of choice purely on its merits.
Nowadays Canadian Government is pushing XML to its
organization as standard for data/file transfer.

Excellent! They are taking the appropriate steps to
avoid the universal experience of first world
governments in the sixth decade of computer handling
of government data, that files which are not
self-describing "go stale" and become
uninterpretable over long periods of time as
technologies supersede one another.

Anecdote: I once worked for/alongside the US
National Ocean Survey. The original survey
documents, from 1803, in paper RECORD logbooks
visually identical to ones you can purchase in a
stationery shop today, were still in use as active
data. At the same time, I was tasked with finding
some digital technology that would endure even half
a century. The sad conclusion was that at the time
(1975), no such techology existed. The point isn't
that DVDs have solved that problem (they haven't),
but that government records are still of interest
decades-to-centuries after they are first encoded.
Only self-describing documents have a prayer of
meeting that requirement.
Huge files moving between companies now include
tones of XML Tags repeating all over the file and
slowing down networks and crashing applications
because of size.

1) XML tags are highly redundant, so XML files,
compressed, are little larger than alternative
encoding techniques.

2) XML isn't guaranteed to be "legal" until the
whole document has been parsed, but that doesn't
prevent that the document is parsed as it is
received, and stored internally in some much more
compact format than the transmittal format. So,
if a program crashes trying to cope with an XML
document, that same document will overwhelm the
program in _any_ encoding.

3) Thus, your complaint is properly about large
document transmittal, not the XML encoding of
those documents.
I am not objecting to the whole technology. I know
advantages of XML and using it all the times for
Config files or our web oriented applications but
using it as standard for moving big files is going
too far.

Would it be a good guess that French rather than
English is your native language? Yeesh.

Anyway, despite that I myself put off learning XML
far too long, and still can't claim competence with
it, XML isn't just a fad, it is the wave of the
future.

HTH

xanthian.
 
R

Roedy Green

XML isn't particularly useful for the original sender and receiver.
They would probably be better off using a binary format. It is useful
for the third party who wants his product to interact or compete with
the software used by sender and receiver and therefore needs to
reverse engineer the protocol being used between them. In this
context, a high level of protocol redundancy is extremely useful since
it makes it reasonably easy for

So what if instead you wrote your schema, then using automated tools
created an ASN.1 binary file much more compact that you can parse 100
times faster and can turn back into fluffy XML any time you want using
the ASN.1 schema. It really amounts to more clever than usual
compression scheme for XML in that you can read it directly rather
than having to decompress it first.

Then look on fluffy XML as a debugging dump format. For computer to
computer you exchange ASN.1 and created and parse ASN.1. The fluffy
form never exists except conceptually.

Your problem now is making sure XSD and ASN.1 schemas for files are
easily available. You stop exchanging schema-less unvalidated files.
You stop exchanging fluffy XML. You only exchange ASN.1 compact file
and store your large XML files as ASN.1. You might still leave
configuration files as XML, though a smart app would parse them any
time they change to make sure they pass muster and then thereafter us
the compact ASN.1 files. The advantage is the app does not need to
load a whacking great XML parser and schema every time it loads. All
it needs is a tiny binary "parser" which is not even parsing in the
classic sense.
 
R

Roedy Green

SMTP isn't a very good protocol by any stretch of the
imagination, but it is _simple_ and you can very easily hook into it

And because it was so simple look what a fucking mess email is in.
People who write email clients are not simpletons. They need a
protocol that works, not one you can understand in five minutes.

SMTP was a hack to do an email demo. It was not rethought once the
problems of scale and spam became apparent.
 
R

Roedy Green

<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>

what about,

<PersonList>
<Person firstName="John" lastName="Smith" phoneNum="5555555"
address="37 Finch Ave." />
</PersonList>

in that particular case, you might still want phoneNum as a tag so you
could have mulitples. But even so, you still bulk up your 30 million
record file with the same information specified over and over and
over. Computers and even humans hear you the first time.

For computer to computer communication you need to put the format
information up front in a computer-understandable way. Then the data
can be densely packed with minimal tags. To view the data you need
something that understand the header and can either display it in
conventional XML format or like a tree, or like a spreadsheet or in
some custom template that is maximally convenient for viewing the
particular data of interest. The whole point of all that tagging
originally was so you could extract just what was currently of
interest. You should not be looking at raw XML normally.
 
M

Monique Y. Mudama

Actually you probably have, but did not recognize it. You have a
digital cert, perhaps self signed do you not?

ASN.1 is used to define all manner of thing from the format of
digital certificates, credit card transactions, cell phone messages

Hence my "(that I know of)" fudge =)
 
O

Oliver Wong

Roedy Green said:
You have made a much better case for binary strings that don't need
fancy XML escaping than you have for XML.

The problem with a "straight-to-binary" approach is that you'd have to
use custom tools to process the data. With XML, you can use a generic XML
editor, or worse case, a simple text-editor.

I don't "mind" ASN.1 so much if only the editors were more readily
available. From my perspective, it's almost the same as using gzip to unzip
a file yielding an XML document, and then using an XML Editor on the
resulting XML document.

- Oliver
 
O

Oliver Wong

Roedy Green said:
Hand coded XML is almost guaranteed to contain errors.

Guaranteed is a bit strong here. I've written XML documents by hand
before and got them right on the first try.
Unless you do
something to insist XML is validated before use, all you have done is
invented yet another avenue for data corruption.

The only other place corruption could occur is the name of the elements,
the names of the attribute, or some of the punctuation (e.g. '<', '>', '/').
Should such corruption occur, it's trivial for a human to fix them, and some
software tools are pretty good at guessing at the fixes as well.

Contrast this with the majority of so-called "binary" formats.
You can't even tells
if it has been validated against some schema.

I think for most file formats, you cannot tell, just by looking at the
file, if it was "checked" for correctness before it arrived on your
harddisk. You could check it for correctness, just like you can check an XML
document for correctness, but you can't check that whoever wrote it first
validated it before sending it to you.
It is the same bloody mess that HTML has foisted on us.

I think HTML is pretty good for the problems it tries to solve
(human-writable-and-readable representation of documents in an platform
independent fashion, with some hyper linking functionality), and every
version is better than the last. The only serious competition I can think of
is LaTeX, and I found it far more difficult to use than HTML, though it is
more powerful.

- Oliver
 
O

Oliver Wong

Steve Wampler said:
I've actually seen XML (*not* mine!) where the person had defined an
array's contents via (paraphrasing, this is a while ago, fortunately):

<array size=15>
<a1>5</a1>
<a2>13</a2>
...
<a15>37</a15>
</array>

(I suppose this allowed position-independent arrangement of the elements,
but
there are certainly better ways, even in XML...)

Of course, that should read:

<array size="15">
<element index="1">5</element>
<element index="2">13</element>
...
<element index="15">37</element>
</array>

Having elements with different names (e.g. "a1", "a2", etc.)
representing the same "kind" of thing is a no-no.

As other have said, XML's strengths are more apparent when the data to
store is hierarchical, rather than flat (as an array is).

- Oliver
 
O

Oliver Wong

Roedy Green said:
Actually you probably have, but did not recognize it. You have a
digital cert, perhaps self signed do you not?

ASN.1 is used to define all manner of thing from the format of
digital certificates, credit card transactions, cell phone messages

I think there's two different intended meanings of "use" here:

A: I've never use C++. All my development work is in Java.
B: What about that OS you're running? That's written in C++!

I've never used ASN.1 in the sense that person A is thinking, though if
ASN.1 is used for credit cards, probably a heck of a lot of people have used
ASN.1 in the sense that person B is thinking (including myself).

- Oliver
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top