humongous flat file

D

Dennis Farr

It has been suggested that rather than convert an already large flat
file, with many similar rows, to XML, some type of header be attached
to the file, containing some sort of meta-XML description of the rows
that follow. The hope is that the result will not grow as large as a
pure XML file, but still be easy to exchange. Multiple vendors would
still be able to track format changes easily. The size of the flat
file, without XML, is already an issue.

If it is not already apparent, I'm new to XML. Does anything like this
already exist? Thanks.

Dennis Farr
Treefrog Enterprises

-- "Can giraffes swim?" --
 
A

Andy Dingley

If it is not already apparent, I'm new to XML. Does anything like this
already exist? Thanks.

It's a bad idea, don't do it. These ideas were popular in the last
century, when the "verbosity" of XML was seen as a problem.

It isn't. get over it.


If you want to do XML, then do it. It's not rocket science.

Don't invent some whacko new pseudo-XML protocol to fix problems that
aren't there.

If you hate XML, then just say so. Enjoy your punch cards.
 
D

Denis Saunders

Dennis Farr said:
It has been suggested that rather than convert an already large flat
file, with many similar rows, to XML, some type of header be attached
to the file, containing some sort of meta-XML description of the rows
that follow. The hope is that the result will not grow as large as a
pure XML file, but still be easy to exchange. Multiple vendors would
still be able to track format changes easily. The size of the flat
file, without XML, is already an issue.

If it is not already apparent, I'm new to XML. Does anything like this
already exist? Thanks.

Dennis Farr
Treefrog Enterprises

-- "Can giraffes swim?" --

If your flat file contains fixed length records and the data is textual then
you may already have existing overheads with redundant trailing spaces.
These spaces would not be carried over to the XML file, hence you may have a
large or some significant reduction in file size. There is no need to be
overly verbose in your XML tag names for instance <CustomersSurname> tag can
be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are
irrelevant to storing the data. An end application can provide the wordy
descriptives.

Denis
 
D

Dennis Farr

Denis Saunders said:
If your flat file contains fixed length records and the data is textual then
you may already have existing overheads with redundant trailing spaces.
These spaces would not be carried over to the XML file, hence you may have a
large or some significant reduction in file size. There is no need to be
overly verbose in your XML tag names for instance <CustomersSurname> tag can
be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are
irrelevant to storing the data. An end application can provide the wordy
descriptives.

Denis

Thanks. My data files are a mixture of rows from several database
tables and for the most part there is no white space but tens of
(mostly short and fixed length and encoded) columns per table, so the
shortest tag names would at least double the size of the file.

It would be nice to give an XML-like skeleton for each type of
database row at the top of the file, and then just tag the records as
to which table they come from, and then use the appropriate skeleton
to parse the text in the tag. There may be thousands to tens of
thousands of rows of each type, so the size savings would be
considerable if we could do this, and if there is a way to do this and
stay within established standards, that would make my day.

I know it is a bit stone-age to complain about storage space, but that
depends on the details of the applications, and quadrupling the size
of a really large file can still be expensive. Size also affects
transmission time, especially if encryption is involved. I'm not
knocking XML, I'm hoping to make XML more attractive to more people.
 
S

Steven Dilley

Dennis Farr said:
"Denis Saunders" <[email protected]> wrote in message

Thanks. My data files are a mixture of rows from several database
tables and for the most part there is no white space but tens of
(mostly short and fixed length and encoded) columns per table, so the
shortest tag names would at least double the size of the file.

It seems like you really really really want to use csv,
but also get the seal of approval as xml.
Advantage of xml is that there are a lot of parsers for reading it.
If you kludge up the content, you lose that.
However, you can do

<everything>
<file1>
<row-csv>1,2,3333</row-csv>
</file1>
</everything>

Also, you can add the csv headings.
Highly unrecommended.
I know it is a bit stone-age to complain about storage space, but that
depends on the details of the applications, and quadrupling the size
of a really large file can still be expensive. Size also affects
transmission time, especially if encryption is involved. I'm not
knocking XML, I'm hoping to make XML more attractive to more people.

Don't forget compression. All the repetitive tags are reduced to a few bits
each.
 
A

Andy Dingley

Size also affects
transmission time, especially if encryption is involved.

No it doesn't. If there are repeated strings in the file, then it
improves compression efficiency. All significant transmissions are
compressed these days, so this verbosity just doesn't matter in
practice. This "XML is inefficient, so use cryptic 2-character element
names" approach is completely bogus.
 
E

Ed Beroset

Andy said:
No it doesn't. If there are repeated strings in the file, then it
improves compression efficiency. All significant transmissions are
compressed these days, so this verbosity just doesn't matter in
practice. This "XML is inefficient, so use cryptic 2-character element
names" approach is completely bogus.


Have you tried testing that hypothesis? I have, and although I hate
cryptic 2-character element names just as much you do, the fact is that
it actually does compress better. Here's a link to an IBM site which
illustrates this using test data:

http://www-106.ibm.com/developerworks/xml/library/x-matters13.html

Note, however, that there are probably better ways to address this than
the method mentioned in the article. One possibility might be

http://www.w3.org/TR/wbxml/

It's worth noting that this is NOT a w3c recommendation. It's also
worth noting that I haven't actually ever tried wbxml, so you can
consider this my own untested hypothesis and treat it accordingly! :)

I would be interested to hear from those who have successfully used
alternative encodings for XML, especially ones for which the size
reduction was a primary motivation.

Ed
 
A

Andy Dingley

Have you tried testing that hypothesis?

Yes, about 4 years ago - it's last century's problem.

Even then, I was juggling XML and rich-media. XML is primarily a
format for text content, so it's just _tiny_ in comparison to any
image or video data. There's just no point in worrying over element
name lengths, when there are JPEGs on the same server.

Mainly I work in RDF. Fairly long names, lots of repetition of
properties like "type", and honking great URIs all over the place.
Switching <foo> to <fo> isn't going to make a blind bit of difference.

Now encoding schemes for embedding binary data into XML content, now
that's an issue worth saving bytes over.
 
E

Ed Beroset

Andy said:
Yes, about 4 years ago - it's last century's problem.

Even then, I was juggling XML and rich-media. XML is primarily a
format for text content, so it's just _tiny_ in comparison to any
image or video data.

I don't think that's the kind of data the OP had in mind. In the
context of video data, it might indeed be tiny by comparison, but I
suspect that most of us work with "last century's data" and so we still
think about things like bandwidth, efficiency, and other anachronistic
concepts of engineering.
Mainly I work in RDF. Fairly long names, lots of repetition of
properties like "type", and honking great URIs all over the place.
Switching <foo> to <fo> isn't going to make a blind bit of difference.

In that context, maybe not, but let's try an experiment with real data
of the non-RDF variety.

The experiment:

I chose the Wake County, North Carolina voter database as the source for
my sample data. It's freely downloadable from the web, contains very
typical kind of name and address data, and is large enough (with 415613
records) to be able to draw some useful conclusions. I extracted the
first five fields of each record of that plain-text database which the
state government labels voter_reg_number, last_name, first_name,
midl_name, and name_sufx. I think those are sufficiently expressive
names that we'd all be able to guess their meanings without a second
thought, so I used them as tag names, too. Wrapping each record up in
<voter></voter> delimiters and the whole thing in <voters></voters>
tags, and minimal other stuff, my test file turns out to be 60685379
bytes long using an 8-bit encoding and Unix-style line endings (one per
record).

Compression:

First, I tried various techniques to reduce the size of the XML file.

The original file is voters1.xml and each voter record has these fields:
voter_reg_number, last_name, first_name, midl_name, and name_sufx

The second file is voters2.xml and each voter record has these fields:
reg_number, last_name, first_name, midl_name, and name_sufx
(The change is that voter_reg_number became just reg_number.)

The third file is voters3.xml and each voter record has these fields:
reg_number, name
Within name there are four fields: last, first, midl, and sufx
(The change is that name now has subfields.)

The fourth file is voters4.xml and each voter record has these fields:
reg_number, foo
Within name there are four fields: last, first, midl, and sufx
(The change is that name is changed to foo.)

The fourth file is voters4.xml and each voter record has these fields:
reg_number, fo
Within name there are four fields: last, first, midl, and sufx
(The change is that foo is changed to fo.)

Here are the sizes and names of the files generated:

60685379 voters1.xml
55697543 voters2.xml
44474912 voters3.xml
43643606 voters4.xml
42812300 voters5.xml

18250 voters1.xml.bz2
17519 voters2.xml.bz2
14251 voters3.xml.bz2
13921 voters4.xml.bz2
12520 voters5.xml.bz2

I'll leave it to you to analyze all the details, since I've provided all
the data to do that, but I thought I'd point out a couple of salient
points. Just a judicious use of shorter tags gives a compressed file
that's 22% smaller (voters3.xml.bz2 compared to voters1.xml.bz2) and no
less comprehensible by humans. Also, note that contrary to your guess,
a change of a single tag from <foo> to <fo> yields a 10% decrease in
size in the compressed files (voters5.xml.bz2 compared to
voters4.xml.bz2) even though the uncompressed versions of those files
only decreased in size by less than 2%.

Conclusions:
1. Using shorter tags may indeed save transmission time.
2. Restructuring "flat" data may give better results without sacrificing
clarity to human readers.
3. Sometimes results are counterintuitive and data-dependent. Measuring
effects on your actual data and comparing those to the engineering
problem to be solved is the only sure way to proceed.

I hope that helps clarify things. If anyone would like to duplicate
this experiment, you can find the raw data at
http://msweb03.co.wake.nc.us/bordelec/Waves/WavesOptions.asp

Ed
 
A

Andy Dingley

When the data is as voluminous as, for example, an individual's
genetic makeup on the back of a health card, what if the space taken
up by the XML tags is much larger

What indeed. Moore's Law. Throw some hardware at it.

The problem is not about storing this stuff. My mobile phone gives a
gazzilion bytes over to just storing ring tones. I don't even know how
big the HD in my laptop is, it's just "big". Storage is not today's
big problem.

Now go to a library and work with MARC records for a while (or SS7, or
almost anything where ASN.1 has played a part). Then find some old
records from such a system and try to make sense of them. Chances are
you can't. This is a serious problem. Find a digital dataset that's
over 10 years old and read it. The failure rate is terrifying (read up
on the BBC's Domesday Disk project)

I don't give a damn about storage size - not my problem, I've got
computers to do that for me. What I care about is future human
understandability, or if I'm really lucky, machine understandability.
Is that the next logical step of evolution after XML?
Bioinformatics is just one example of really huge data files

Go take a look at Stanford's Protege project.

Or RDF, or DAML, or OWL


Right track ? It's not even leaving the station.

This is a regular approach to the problem and it's more bogus than a
Cayman Islands $3 bill. Taking the dataset (with the implicit
assumption that all XML data is extracted from an RDBMS) and then
labelling it as "row/column" adds nothing to the semantics of the
representation and it is perpetuating the database structure you've
just pulled it from. It's no better than CSV !

XML has a restrictive data model. It's a single-rooted tree, when the
real world is more like a directed graph. But even so, it's a lot more
expressive than this narrow "everything is a rectangular grid"
approach.
 
S

Steven Dilley

This depends on the sequence: encrypt-then-compress does poorly:
the repetitive tage are transformed into dissimilar strings, and they don't
compress. Compress-then-encrypt is as good as plain compression.
Q: Which order is actually used? What does https do? What if the
source files are encrypted already?
Have you tried testing that hypothesis? I have, and although I hate
cryptic 2-character element names just as much you do, the fact is that
it actually does compress better. Here's a link to an IBM site which
illustrates this using test data:

http://www-106.ibm.com/developerworks/xml/library/x-matters13.html

Very interesting analysis. To get the max compression, it looks like
we need to compress before sending, rather than relying on the comm
link to choose compression for us.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top