Java and huge XML file to be parsed

J

Jezuch

U¿ytkownik Roedy Green napisa³:
You are missing my point. I believe that both XML and HTML, the thing
actually posted should be binary formats. No one would ever read or
edit them directly, guaranteed to meet the spec, preparsed. Anything
hand-coded with notepad is guaranteed to have some errors.(..)

But then you'd never see WWW as it is today. Heck, you'd never see WWW at all ;)
 
R

Roedy Green

But then you'd never see WWW as it is today. Heck, you'd never see WWW at all ;)

I disagree. The tools just clean up. Think how many times you go to a
website and the code does not work with your browser.

Had we used a binary format:

1. web browsing would be at least twice as fast.

2. you would have far less problem with browsers not rendering as
expected.

It is just people would have used more appropriate tools to create the
web content.

Note how often now people are moving to PDF. Part of that is for fine
control, but much of it is simply is to avoid all the variation in
HTML and syntax errors.
 
C

Christophe Vanfleteren

Roedy said:
You are missing my point. I believe that both XML and HTML, the thing
actually posted should be binary formats. No one would ever read or
edit them directly, guaranteed to meet the spec, preparsed. Anything
hand-coded with notepad is guaranteed to have some errors. Even
though I validate my HTML daily, you will always find some HTML errors
in there, and also some quasi errors that I tell the verifier to
ignore. My site is very clean compared with most.

I'm afraid that doesn't make much sense. Validation is a binary property.
Either something validates, or it doesn't.

Why are you so afraid about someone putting out non-validating HTML/XML? If
all browsers had started out with strict parsers (and if the WYSIWYG
programs created valid HTML), we wouldn't have had the problem with HTML we
have today. Browsers were way too liberal in what they accepted, and that
got us in the mess we are in today. If the web had started out with an xml
format, no non-validating pages would be found, since no browser would let
you view them (and I suspect that even the most ignorant Frontpage monkey
tests their pages at least once, all be it just in the very latest IE
version :)
See http://mindprod.com/jgloss/xml.html and
http://mindprod.com/projects/htmlcompactor.html for the sort of
formats I had in mind.


When you want to view the HTML/XML you use a viewer or editor.
Tradionalists could fluff it up to something like conventional HTML or
XML for viewing. I would prefer something more graphic like a JTree or
WYSIWYG

How many of you are old enough to remember Wordstar. It was
conceptually easy to understand because you embedded visible tags in
your text. Then Word came along and hit the tags, and just let you
think in terms of the final outcome. It drove everyone mad at first
since Word did such a bad job of the internal tags, but in the long
run the impossibility of getting invalid or unbalanced tags won out.

XML is just about data, so you don't have that same problem. With
HTML it would a lot easier to collapse and clean up a preparsed tree.

There are HTML/XML editors out there that let you view your page as a tree.
So I guess it is even possible without a binary format.
 
J

Jezuch

U¿ytkownik Roedy Green napisa³:
It is just people would have used more appropriate tools to create the
web content.

This one is *the* problem. People are lazy. Imagine what would happen if you
developed something like this and said to them "it's all fine, but you have
to use THIS tool". I presume that noone would bother to get it...
 
R

Roedy Green

This one is *the* problem. People are lazy. Imagine what would happen if you
developed something like this and said to them "it's all fine, but you have
to use THIS tool". I presume that noone would bother to get it...

IF XML and HTML were binary formats there would be MORE tools to
choose from because it is so much easier to work with a binary format
than one you have to parse and that is CRAM FULL OF SYNTAX ERRORS.
 
R

Roedy Green

I'm afraid that doesn't make much sense. Validation is a binary property.

In an ideal world that would be true, but it most certainly is not in
the world of HTML.

Have a look at the hundreds of option switches on HTMLValidator.
see http://mindprod.com/jgloss/htmlvalidator.html

Look at how many official W3C validation HTML standards there are at
http://mindprod.com/jgloss/htmlcheat.html#DOCTYPE

There are so many to allow for varying degrees of anal retentiveness.

If HTML were a binary format this would not be a concern to anyone but
tool writers.

When you go for a human readable, human editable format, you
necessarily introduce tolerance for error, variation and general
sloppiness. With a binary format, you can be like Mussolini and make
the trains run on time, without anyone feeling the internal Fascism.
 
S

Stefan Ram

Roedy Green said:
If HTML were a binary format this would not be a concern to
anyone but tool writers.

On a digital computing engine or storage system, an HTML or
XML document indeed usually is stored as a sequence of binary
digits.

What does "binary format" mean to you?
 
R

Roedy Green

On a digital computing engine or storage system, an HTML or
XML document indeed usually is stored as a sequence of binary
digits.

What does "binary format" mean to you?

see the two essays I referred to earlier in this thread.
 
R

Roedy Green

see the two essays I referred to earlier in this thread.

Let me sell you the idea in stages.

What if XML were not considered valid unless it contained a signature
by a tool that included a checksum that asserted the file conformed to
the DTD and XML in general. The tool would identify itself as part of
the signature.

XML generating library or editor would provide this.

The next stage would be certification of such verifiers.
 
S

Stefan Ram

Roedy Green said:
Let me sell you the idea in stages.
What if XML were not considered valid unless it contained a signature
by a tool that included a checksum that asserted the file conformed to
the DTD and XML in general.

XML files already now are considered valid only if they are
valid (relative to a DTD).

When you design a contract with business partners, you are
free to negotiate that, when a party is liable to deliver XML
files, any file that is not a valid XML file will not have to
be accepted as a fulfillment of that liability by the other
party.

Otherwise, it will be difficult to make major browser
manufacturers create browsers that display HTML and XHTML
files only if they are valid. The tolerance of most browsers
("quirks mode") is a cause of the many invalid files around.
But IE is already more strict in this regard, when it is given
an XML file with the file type or file name extension "xml".

The border line in this case possibly is not between "text
format" and "binary format" but between strictness and
sloppiness (tolerance towards invalid files) of the tools
available.

(Still - without a reference to XML, validity, and all that:
it might be difficult to define the notion "binary file" - Can
one write a Java program that, given any file, will print
"yes" or "no" in order to indicate whether that file is a
"binary file"? If I where given the task, my solution would be
"System.out.println( \u0022yes\0022 );" for every file [which
is a readable file].)
 
R

Roedy Green

Otherwise, it will be difficult to make major browser
manufacturers create browsers that display HTML and XHTML
files only if they are valid.

Yes the way it is now. But if the validation process also produced a
compact binary representation, then there would be no fuss and no need
to have ANY unvalidated files ever passed around. The binary format
is an implied validation. Very few people would try to construct them
without a tool that prevents errors.
 
R

Roedy Green

"binary file" - Can
one write a Java program that, given any file, will print
"yes" or "no" in order to indicate whether that file is a
"binary file"?

When I mean a binary file, I mean one that does not need a parser to
read it. It is predigested with everything in the most convenient,
compact possible form. Strings are counted, offsets precomputed,
tokens represented by small ints, everything for the convenience of
the computer, and nothing for inherent human readability. If you want
to read it convert it to something readable or use a viewer, which
than then instantly extract from the file just what it wants without a
giant parser. You CAN'T put data into the file that does not conform
to its contract. E.g. fields should have declared bounds and types.
Lengths of various fields are implicit. They don't have to demarcated
for fixed length fields.

Dates are stored as UTC timestamps. Zip codes are stored as binary
ints. Strings are stored in UTF-8, no farting around with variable
encodings.

The opposite of a binary file is a printable or human-readable file
which is designed to be directly human comprehensible and editable
with Notepad.
 
S

Steven J Sobol

Christophe Vanfleteren said:
There are HTML/XML editors out there that let you view your page as a tree.
So I guess it is even possible without a binary format.

One of the major browsers will even let you do that (Mozilla and derivatives,
using the DOM Inspector).
 
S

Steven J Sobol

Stefan Ram said:
On a digital computing engine or storage system, an HTML or
XML document indeed usually is stored as a sequence of binary
digits.

What does "binary format" mean to you?

I'm thinking he meant "not human readable using a character set like
ASCII or UTF-8"...
 
W

William Brogden

I disagree. The tools just clean up. Think how many times you go to a
website and the code does not work with your browser.

Had we used a binary format:

1. web browsing would be at least twice as fast.

2. you would have far less problem with browsers not rendering as
expected.

Wow, you are really off-base with this one Roedy. Most people
got started with web pages by copying the "source" and modifying
it. It was easy because you could do it with just about any editor.

Sure alot of bad web pages resulted, but it got people interested,
then they could go buy a HTML for Dummies book.

Bill
 
R

Roedy Green

Wow, you are really off-base with this one Roedy. Most people
got started with web pages by copying the "source" and modifying
it. It was easy because you could do it with just about any editor.

you could have done the same thing with a binary format. It is not as
though the format would be proprietary. You could either have used a
binary editor with cut/paste, or you could have converted to fluffy
representation edited, and converted back before posting.

You would be far FURTHER ahead as a newbie, because the
editor/preparer would ensure you always created valid html.
 
R

Roedy Green

One of the major browsers will even let you do that (Mozilla and derivatives,
using the DOM Inspector).

By why go to all that bother of parsing. why not pass the document
around pre-parsed and hence error free. The sender, not the recipient
should the one to deal with errors.
 
D

Dimitri Maziuk

Roedy Green sez:
You are missing my point.

No. The point I'm disagreeing with is that WYSIWYG is better
than notepad and Word is better than Wordstar. I mean, I'm
sure they are, for certain values of "better". It just so
happens that my definition of "better" seems to be quite
different from yours.

....I believe that both XML and HTML, the thing
actually posted should be binary formats.

Which part of "Text Markup Language" escapes you? XML exists
only because of dot-net-bubble: we _have_ to be able to embed
not-text in HTML, otherwise we'd have to fill our website with
actual _content_ instead of shiny flash animations, and... khmm,
ermm, we don't actually have it.

The whole point of XML is to be "like HTML". If we wanted a
binary format, we could've used ASN.1.

.... Anything
hand-coded with notepad is guaranteed to have some errors.

Reminds me of electronic commerce 101 elective I took in the
uni. We had a 2-part assignment: you create a website, teacher
looks at it and tells you to change a thing or two, and you do
the change while he's watching. The moment I opened html file
in notepad, the teacher said "you can go, you passed".

Anyone who knows how to hand-code $foo in notepad has way more
clue about $foo than most gooey wysiwyg lusers. Both will make
mistakes, but only the hand-coder knows how to fix them.

.... It drove everyone mad at first
since Word did such a bad job of the internal tags, but in the long
run the impossibility of getting invalid or unbalanced tags won out.

Nah, you're definitely posting from a different universe. Last
I looked at HTML output of Word, unbalanced tags was exactly
what I saw.

Must be nice, to live in a world where software works the way
it should...

Dima
 
R

Roedy Green

Which part of "Text Markup Language" escapes you? XML exists
only because of dot-net-bubble:

You can use whatever representation you like for creating the markup.
That should be human-friendly. Different purposes require different
sorts of editing tools. However, when you hand it over others it
should be in computer-friendly i.e. error-free, rigidly standard,
binary format.

If you hand around raw human-created markup you are INEVITABLY going
to be distributing errors and variation. You introduce slop and error,
polluting the planentary information base. You cause others choke on
your errors.

Simple validation does not work. Think of what fraction of the
planet's XML or HTML documents would pass a complete W3C validation
suite, perhaps under 1%. Using a binary format solves that problem in
one fell swoop with the additional benefits of:

1. more compact, faster download.
2. faster processing.
3. tighter specification.
4. fewer people have to understand it.
5. simpler classes needed to process it, important in handhelds.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top