using large XML for interfaces

A

averst69

Hi Y'all ,

A client of our company asked me to look for alternatives for the
current interfacing method they use now.
They use large files in which the data are stored in records like
below:
0000000XX21123456789DoeJohn11091901MWashington etc
^ ^ ^ ^ ^
^
Record ID ID name birth sex place of birth


These file are huge (over 100 Mb).
They asked our company to look for an alternative method of interfacing
these data. They were thinking of using XML.
What I 've read untill now is that it's, in case of large data sets,
not wise to use XML.
- The files become even more bigger
- processing time goes up, masssive usage of memory

What do you guys think of these arguments? Are there any other
alternatives ?

greetz Aschwin
 
A

Andy Dingley

These file are huge (over 100 Mb).
They asked our company to look for an alternative method of interfacing
these data. They were thinking of using XML.

XML is an easy, almost trivial, drop-in replacement for CSV file
interchange like this. There are only a couple of issues to be aware
of:

* Use an event-driven parser like SAX, not a monolithic "parse it all
then use it" DOM

* XML requires a "complete" document, so it's hard to work with reading
documents that are being continually appended to. (There must be one
root element, and this will appear at both the beginning of the
document (start tag) and end of the document (end tag)).

Non issues are:

* Verbosity. Practical XML documents are frequently smaller than
equivalent fixed-field documents because they handle sparse data much
better. They can even be smaller than some CSV formats.

* Verbosity. Yes, XML adds repeated tag names to the document. In
practice this just isn't a problem (for one thing they get compressed
very well in transmission). It's certainly no reason to try for
unreadable <NAM> <ADR> element names!

* Speed. XML parsers are efficiently coded against formal syntax
definitions. They almost always beat custom-written informal parsers
written in application scripting languages.

Particular benefits are:

* Reliability. XML _works_. It works reliably for any input data too,
because it's a well thought-through protocol. Wave goodbye to all those
awkward names that broke the comment or apostrophe escaping algorithm
you had coded by a junior intern. "O'Reilly" won't break it, nor will
<an arabic name I can't even paste into Usenet>

* Internationalization. Oh yes. It just does it. For any encoding.
With no effort on your part. Rejoice!

* Interoperability. Your XML is my XML. Guaranteed. No more CSV
encoding hangups between systems.


There _are_ good ways to break XML.

In particular, XML isn't a database. A 100MB document is certainly
workable as a transfer document, but it's not usually a good idea to
load it into a DOM and then try repeated random lookups into it.

XML isn't a messaging protocol either. If you have lots of tiny
messages flying around, then wrap up your XML in something else (maybe
SOAP) and use that. You can't do some of the tricks you used to do with
a CSV file, such as treating them like a pipe and reading from one end
whilst still writing to the other.

XML has rules, so stick to them. Vanilla ASCII isn't too much trouble,
but if you're going to thhrow lots of "<" , ã‚„ or Å™ around, then
learn what the options for encoding them are and use them correctly.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

A client of our company asked me to look for alternatives for the
current interfacing method they use now.

Why are they looking for alternatives ?
Why change it if it works ?
They use large files in which the data are stored in records like
below:
0000000XX21123456789DoeJohn11091901MWashington etc
^ ^ ^ ^ ^
^
Record ID ID name birth sex place of birth


These file are huge (over 100 Mb).
They asked our company to look for an alternative method of interfacing
these data. They were thinking of using XML.

If they use XML, they are "buzz-word compliant".
Are there technical or political reasons for XML ?
If there are political reasons, then stop arguing
in technical terms.
What I 've read untill now is that it's, in case of large data sets,
not wise to use XML.
- The files become even more bigger
- processing time goes up, masssive usage of memory

Andy Dingley has summarized the advantages of XML quite well.
I disagree with him when it comes to file size: In your case
(fixed format data) the XML variant _will_ be bigger.
What do you guys think of these arguments? Are there any other
alternatives ?

Andy has already pointed out that XML data may contain
any German Umlaut, Cyrillic or Japanese special character
that you will ever find. This _is_ an advantage.
 
A

Andy Dingley

Jürgen Kahrs said:
Why are they looking for alternatives ?
Why change it if it works ?

My current project involves pumping vast comma-delimited and
fixed-field files around between vendors, with what looks like similar
data to the OP.
Believe me, there are _plenty_ of reasons to move to XML, not just
fashion.
If they use XML, they are "buzz-word compliant".

XML hasn't been a hot buzzword for years now, it's just plumbing.
If there are political reasons, then stop arguing
in technical terms.

Always wise advice!
I disagree with him when it comes to file size: In your case
(fixed format data) the XML variant _will_ be bigger.

I've just seen a factor of 4 shrinkage in going from a fixed-field file
with address data in it. Most of the original file was simply empty
space for spare address lines, but we were faithfully shipping it
around as a couple of MB of whitespace.
 
J

Juergen Kahrs

Andy said:
My current project involves pumping vast comma-delimited and
fixed-field files around between vendors, with what looks like similar
data to the OP.

But he had fixed-width data; it looked like there were
no sparse lines.
XML hasn't been a hot buzzword for years now, it's just plumbing.

The problem for many conservative Unix users is that
the plumbing cant be done with their usual toolset.
XML requires a new toolset and depreciates the old toolset.
I've just seen a factor of 4 shrinkage in going from a fixed-field file
with address data in it. Most of the original file was simply empty
space for spare address lines, but we were faithfully shipping it
around as a couple of MB of whitespace.

If there really are blank fields in the data, then it
sounds plausible that the amount of data shrinks.
But the OP's data didnt look sparse.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,239
Members
46,827
Latest member
DMUK_Beginner

Latest Threads

Top