Andy said:
Yes, about 4 years ago - it's last century's problem.
Even then, I was juggling XML and rich-media. XML is primarily a
format for text content, so it's just _tiny_ in comparison to any
image or video data.
I don't think that's the kind of data the OP had in mind. In the
context of video data, it might indeed be tiny by comparison, but I
suspect that most of us work with "last century's data" and so we still
think about things like bandwidth, efficiency, and other anachronistic
concepts of engineering.
Mainly I work in RDF. Fairly long names, lots of repetition of
properties like "type", and honking great URIs all over the place.
Switching <foo> to <fo> isn't going to make a blind bit of difference.
In that context, maybe not, but let's try an experiment with real data
of the non-RDF variety.
The experiment:
I chose the Wake County, North Carolina voter database as the source for
my sample data. It's freely downloadable from the web, contains very
typical kind of name and address data, and is large enough (with 415613
records) to be able to draw some useful conclusions. I extracted the
first five fields of each record of that plain-text database which the
state government labels voter_reg_number, last_name, first_name,
midl_name, and name_sufx. I think those are sufficiently expressive
names that we'd all be able to guess their meanings without a second
thought, so I used them as tag names, too. Wrapping each record up in
<voter></voter> delimiters and the whole thing in <voters></voters>
tags, and minimal other stuff, my test file turns out to be 60685379
bytes long using an 8-bit encoding and Unix-style line endings (one per
record).
Compression:
First, I tried various techniques to reduce the size of the XML file.
The original file is voters1.xml and each voter record has these fields:
voter_reg_number, last_name, first_name, midl_name, and name_sufx
The second file is voters2.xml and each voter record has these fields:
reg_number, last_name, first_name, midl_name, and name_sufx
(The change is that voter_reg_number became just reg_number.)
The third file is voters3.xml and each voter record has these fields:
reg_number, name
Within name there are four fields: last, first, midl, and sufx
(The change is that name now has subfields.)
The fourth file is voters4.xml and each voter record has these fields:
reg_number, foo
Within name there are four fields: last, first, midl, and sufx
(The change is that name is changed to foo.)
The fourth file is voters4.xml and each voter record has these fields:
reg_number, fo
Within name there are four fields: last, first, midl, and sufx
(The change is that foo is changed to fo.)
Here are the sizes and names of the files generated:
60685379 voters1.xml
55697543 voters2.xml
44474912 voters3.xml
43643606 voters4.xml
42812300 voters5.xml
18250 voters1.xml.bz2
17519 voters2.xml.bz2
14251 voters3.xml.bz2
13921 voters4.xml.bz2
12520 voters5.xml.bz2
I'll leave it to you to analyze all the details, since I've provided all
the data to do that, but I thought I'd point out a couple of salient
points. Just a judicious use of shorter tags gives a compressed file
that's 22% smaller (voters3.xml.bz2 compared to voters1.xml.bz2) and no
less comprehensible by humans. Also, note that contrary to your guess,
a change of a single tag from <foo> to <fo> yields a 10% decrease in
size in the compressed files (voters5.xml.bz2 compared to
voters4.xml.bz2) even though the uncompressed versions of those files
only decreased in size by less than 2%.
Conclusions:
1. Using shorter tags may indeed save transmission time.
2. Restructuring "flat" data may give better results without sacrificing
clarity to human readers.
3. Sometimes results are counterintuitive and data-dependent. Measuring
effects on your actual data and comparing those to the engineering
problem to be solved is the only sure way to proceed.
I hope that helps clarify things. If anyone would like to duplicate
this experiment, you can find the raw data at
http://msweb03.co.wake.nc.us/bordelec/Waves/WavesOptions.asp
Ed