XML Not good for Big Files (vs Flat Files)

R

Roedy Green

(4) will never be true, no matter what format you're using. Microsoft
designed their XBox360 with loads of DRM and security to make sure people
wouldn't screw with it. But people did anyway. It's human nature to be
curious, and to want to take things apart to see how they work.

That is not what I mean. I want to stop people hand-editing XML then
passing the files off into the world without being validated. I have
no problem with people peeking under the hood. I love to do it, so
why should I stop others?

With a binary format, you need to use an editor or converter that
won't let you make a syntax error.

Consider HTML, a close relative of XML. Probably less than 1% of web
pages in the world are grammatically correct. This complicates
browsers and creates headaches with browser compatibility. People
"test" with one browser and assume it will work in others.

Consider what would happen if Tim Berners-Lee had defined a binary
HTML format. The idea was you hand composed fluffy HTML or used an
editor, and then ran a converter/uploader to put a compact version on
the web. Then there would be 99% accurate HTML. You would only have
to deal with the relatively minor problem of bad converter software.
The converter could be forgiving, without foisting the crap on
everyone else since it would be converted to accurate binary format
that very few people would be tempted to tamper with.

Imagine what would happen if Java were distributed without being
validated by JavaC first.

In the very least, a fluffy xml file should be digitally signed (or
something weaker) to indicate it has been verified against some schema
or that it has been mechanically generated by some program that has
been certified it does not make grammatical errors.
 
J

Jhair Tocancipa Triana

Homer said:
John,Smith,5555555,37 Finch Ave.
<FirstName>John</FirstName>
<LastName>Smith</LastName>
<PhoneNum>5555555</PhoneNum>
<Address>37 Finch Ave.</Address>
And Tags are repeating and repeating:


Homer
1) XML tags are highly redundant,

IMHO redundancy is (in most cases) bad.
so XML files, compressed, are little larger than alternative
encoding techniques.

You are comparing compressed XML files with non-compressed files in
other encodings (in the example above comma separated values).

If you compress the comma separted values file it still will be
smaller than the equivalent XML file...
 
J

Jhair Tocancipa Triana

bugbear said:
And, w.r.t repeating tags; 1 word. gzip.
Several applications simply use gzip'd XML
to get a good compromise.
gzip (and other compressors) are rather good
at crunching off the kind of trivial
repetition you object to.

That speaks good for gzip but not for XML itself...
 
J

Jhair Tocancipa Triana

Yep. Best of both worlds. The parser sees the nice
XML and the comms sees a small file. Your objection
to this is... ?

Not everybody uses gzip.

E.g., the costs to integrate compression to a given XML processing
software could be too high for some people.
 
J

Jhair Tocancipa Triana

Yes but, now we know what all the data means. Your example is quite
clear, but what about this one:

Could mean several things:
(1) Lawrence David lives in Maynard, MA.
(2) David Lawrence lives in Maynard, MA
(3) David Maynard lives in Lawrence, MA
(4) Maynard David lives in Lawrence, MA
etc. You see where I'm going with this.

leaves no question.

FirstName,LastName,City,State<-----------------HEADER
Lawrence,David,Maynard,MA

leaves no question too...
 
L

Lasse Reichstein Nielsen

[zipped xml]
I have four objections:

1. The compression method is like putting a fat woman in a girdle. You
can do better if you start with someone without rolls of fat.

If the fatness is only a problem during transit, then the solution
should be one that reduces size during transit without affecting
behavior where it's not needed. Zipped XML satisfies that, and
both zip/unzip and xml-parsing code are readily available.
2. The file is requires fat code to parse. Not suited for handhelds.

A zip stream decompressor and a stream based XML parser are actually
quite simple pieces of code.
3. The file is slow to parse and create.

That might be true, but CPU based compression is so much faster than
disk based IO (or network IO for that matter) that the time spent
compressing is more than paid for by the time saved saving or sending
it.
But yes, it would be faster not having to create the millions of
characters of the XML file.
4. When you are done there is still no guarantee the file matches the
schema.

Garbage will be garbage, no matter what envelope it's traveling in.

/L
 
J

Jhair Tocancipa Triana

Hierarchical data, dude. What if someone has more than one phone
number? With the comma-delimited flat file approach, it's not readily
apparent how you could implement that.

we can have as many PhoneNumbers as we want that are associated with a
person, and because it's all hierarchical we can just walk up the
hierarchy to see who these PhoneNumbers belong to.

For decades you can achieve the same result in the example you state
using two files (one for the persons and other for the phone numbers)
and joining its contents (e.g. after loading them to a relational
database).

So XML offers nothing new in the scenario you describe...
 
M

Monique Y. Mudama

Consider what would happen if Tim Berners-Lee had defined a binary
HTML format. The idea was you hand composed fluffy HTML or used an
editor, and then ran a converter/uploader to put a compact version
on the web. Then there would be 99% accurate HTML. You would only
have to deal with the relatively minor problem of bad converter
software. The converter could be forgiving, without foisting the
crap on everyone else since it would be converted to accurate binary
format that very few people would be tempted to tamper with.

Well, imo, that would have slowed down the development of the web
quite a bit. Certainly would have slowed down my grasp of HTML. The
great thing about the web was that, if you were curious about how a
site worked, you could just look at the source and figure out how to
apply it to your own page. Because of this, it was very easy to learn
and run with.

Effectively, the browser is your HTML compiler, and the problem is
that browsers allowed all sorts of sloppiness. Now they're reaping
the pain of the mess they allowed to grow.
 
R

Roedy Green

If the fatness is only a problem during transit, then the solution
should be one that reduces size during transit without affecting
behavior where it's not needed. Zipped XML satisfies that, and
both zip/unzip and xml-parsing code are readily available.

zipped is not suitable for handhelds. It is not suitable for routine
transport. If it were, HTML would be zipped too. It is too time
consuming and too cpu intensive.
 
R

Roedy Green

That might be true, but CPU based compression is so much faster than
disk based IO (or network IO for that matter) that the time spent
compressing is more than paid for by the time saved saving or sending
it.
But yes, it would be faster not

I would have thought with all the improvements in CPU power that by
now transmissions would be routinely compressed. But they are not,
ONLY when the file can be precompressed, such as program downloads.
Nobody compresses JSP on the fly. Nobody even precompresses HTML.

The problem is it puts a heavy burden on the server. Desktop clients
would be happy to decompress, but servers are not prepared to spend
the cost of compressing. It would increase their costs too much.

The other problem is the explosion of small devices that don't have
the ROM, RAM or CPU power to support a heavy duty decompression and
parser.

A binary data format that gave the benefits of compression without the
CPU or RAM overhead would effectively double the wireless bandwidth --
something in short supply.
 
R

Roedy Green

Garbage will be garbage, no matter what envelope it's traveling in.

There are three kinds of garbage: erroneous content, content that
fails schema validity checks and badly formatted content. You can all
but eliminate the second and third kinds by insisting on some kind of
computer processing either a creator-verifier, verifier or converter
before it goes out. As long as you let people edit files with a
generic text editor and send them out into the world unverified you
are asking for a repeat of the HTML mess.
 
R

Roedy Green

HTML *is* zipped on many webservers today. It is very easy to setup
using apache and mod_gzip.

See this article for more information:
http://www.webreference.com/internet/software/servers/http/compression/


Timo

"Webmasters typically see a 150-160% increase in Web server
performance, and a 70% - 80% reduction in HTML/XML/JavaScript
bandwidth utilized, using this module. Overall the bandwidth savings
are approximately 30 to 60%. (This factors in the transmission of
graphics.) Here's a test run by Remote Communications using their
modified Apachebench above."

This would be for pre-compressed data. That precludes JSP or SSI. With
either of those, you get a big CPU hit per transaction to individually
compress.

HTML ideally would have a compression scheme so that could exploit the
fact so much of HTML is boiler plate common between many pages, e.g.
the headers and footers. At it is, Gzip can't do much with small file
like http://mindprod.com/jgloss/plan9.html Ideally, with a site
compression preload, that could crunch down to a few bytes.
 
T

Timo Stamm

Roedy said:
I would have thought with all the improvements in CPU power that by
now transmissions would be routinely compressed. But they are not,
ONLY when the file can be precompressed, such as program downloads.
Nobody compresses JSP on the fly. Nobody even precompresses HTML.

www.google.com
www.msdn.com
www.ibm.com
www.sun.com

.... they all send gzip compressed HTML.

I think your info is a bit outdated. All major browsers support gzip
encoding since about 1999 or so. And CPU power seems to be good enough
today to compress on the fly.

The other problem is the explosion of small devices that don't have
the ROM, RAM or CPU power to support a heavy duty decompression and
parser.

The user agent has to tell the server that it can handle compressed
data. It can do so with a simple header:

Accept-Encoding: gzip

If this header is not present, the server cannot send compressed data.
So a low-profile device that can't handle compressed data simply
shouldn't send a request for compressed data.


Timo
 
R

Roedy Green

The user agent has to tell the server that it can handle compressed
data. It can do so with a simple header:

Accept-Encoding: gzip

If this header is not present, the server cannot send compressed data.
So a low-profile device that can't handle compressed data simply
shouldn't send a request for compressed data.

I was trying to figure out how to set up a website with mixed html and
gz. I figured I would need basically two complete websites, ones with
links to the .html.gz versions and one to the .html versions. But I
guess what you are suggesting is letting Tomcat create and manage the
pair and always use .html links. This would preclude SSI, or is the
compression done on every transaction?
 
R

Roedy Green

www.google.com
www.msdn.com
www.ibm.com
www.sun.com

... they all send gzip compressed HTML.

You made my day. Transmitting files as fluffy as HTML irritated the
heck out of me for years. I am tickled that was handled so
transparently too.

Now if only I can get my ISP to let me run a server that will do that.

I also checked your compression number and discovered it was correct,
even for Java's ZipOutputStream. This is better than I thought. I
have been doing benchmarks with other types of file thrown in that
lowered the average. I still think with a bit of cleverness, something
like a style sheet you reference giving a dictionary of the common
tags and vocabulary, so you could squeeze another notch out of it, but
with 67% compression you are getting the bulk of the fat out.

What might happen as this compression becomes commonplace is CPUs
might get some special hardware assist for dealing with it so even
small devices can at rapidly decompress it.
 
T

Timo Stamm

Roedy said:
I was trying to figure out how to set up a website with mixed html and
gz. I figured I would need basically two complete websites, ones with
links to the .html.gz versions and one to the .html versions.

Fortunately this is not necessary.

But I
guess what you are suggesting is letting Tomcat create and manage the
pair and always use .html links.

The links don't have to change at all. But if the client accepts a
gzip-encoding, the webserver can respond with compressed data.

This is what HTTP 1.1 specifies.

It is unimportant whether the compressed content of the response is
pregenerated or created on the fly.

This would preclude SSI, or is the compression done on every transaction?


AFAIK, a request for a file that uses SSI should be handled like this:

- webserver receives request for resource
- webserver looks up resource
- webserver scans file for SSI instructions
- webserver reads file and included file into buffer
- webserver sets content length header in response
- webserver writes data to response


Using gzip compression, it could like this:

- webserver receives request for resource
- webserver looks up resource
- webserver scans file for SSI instructions
- webserver reads file and included file into buffer
- webserver compresses data if wanted
- webserver sets content length header in response
- webserver writes data to response


So I think that SSIs shoulnd't be affected at all.


Timo
 
L

Lasse Reichstein Nielsen

[mod_gzip saves bandwidth and increases web server performance]
This would be for pre-compressed data. That precludes JSP or SSI.

It can just as easily be generated content. Google gzips their
responses. I guess it's cheaper to buy processing power than bandwidth
these days.

/L
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top