Newbee question about <! and <?

A

Asger Jørgensen

Hi there

I am writing a xml parser for some very strict xml files and it
is going quite wll, but i have a little trouble with the tags that
start with <! and <? I dont exactly know what to do with them.
There is ofcource also the <-- coment but that is more logical to me.

At he begining og the document i have
<?xml version...... that make sence

then sometimes there is a:
<?xml-stylesheet.....that also make sence

then the document starts:
<Invoice xmlns=........

Then sometimes down in the document there is a:
<?TestInstance
ResponseTo="smtp:[email protected]"
description= "apply your comment here"
?>

and this is where I get confused.

Would somebody be so kind as to explain to me what the rules are
for those tags and what they are used for / can do with them.

I have also seen some other xml document with extra
<!TagName at least I think i have, because now I cant reproduse
it without getting an error from IE.

Thanks in advance
Asger
 
R

Richard Tobin

Asger Jørgensen said:
I am writing a xml parser for some very strict xml files

If you're serious about writing an XML parser, you'll have to read
the standard in detail.

http://www.w3.org/TR/REC-xml/

You need to know XML thoroughly before you write a parser for it! You
can't do it just by looking at examples.

-- Richard
 
A

Asger Jørgensen

Hi Richard

Richard Tobin said:
If you're serious about writing an XML parser, you'll have to read
the standard in detail.

I have already spendt a lot of time on w3.org/TR/REC-xml
before i ask my question.
http://www.w3.org/TR/REC-xml/

You need to know XML thoroughly before you write a parser for it! You
can't do it just by looking at examples.

As I mentioned, these xml file are very strict and I have abslolutely
no intention about writing a xml parser that can read all kind of xml files.
If that was what I needed I would use one already made.

I need to get some values from these xml files nothing else.
But I do need to get the right values.

If You have an answer to my question, please tell it to me.
And please don't waste Your energi on telling me what I cant do.

I am a newbie to xml file but I am most certainly not a newbie
when it comes to writing parsers.

Thanks in advance
Asger
 
J

Joe Kesselman

You really could find answers to these questions by looking at a good
XML tutorial and/or the XML specification.

<! and !> delimit comments. They should contain no information you
actually care about. (See http://www.w3.org/TR/REC-xml/#sec-comments)

<? and ?> delimit two different things. At the very start of the
document (before anything except the optional byte-order mark), they
delimit the XML Declaration or Text Declaration, which specifies among
other things which character set the XML file was written in. (See
http://www.w3.org/TR/REC-xml/#sec-TextDecl and
http://www.w3.org/TR/REC-xml/#dt-xmldecl).

Elsewhere, they delimit Processing Instructions, which are metadata that
is meaningful only to the specific application(s) they are intended for
and whose syntax (outside of the PI's delimiters and name) is pretty
much determined by those applications. If you're dealing with PIs, you
have to ask whoever created them to explain what they're intended to
mean -- or ignore them, because XML intended them to be used only for
hints for efficient processing rather than for information that's
actually important to the document's meaning. (See
http://www.w3.org/TR/REC-xml/#dt-pi)


Note that, as hinted at above and by the other respondants, handling XML
properly involves a number of complications including character set
issues, numeric character references, parsed entity references, and so
on. Writing a reliable XML parser is a decent term project; it isn't
trivial. If at all possible, I *strongly* recommend that you use an
off-the-shelf XML parser rather than reinventing this particular wheel;
they're available in many programming languages at this point, often as
free software, and letting someone else deal with all these details will
save you a LOT of work. You may be able to get away with something
quick-and-dirty now -- the "desperate perl hacker" school of code
development, where it only has to work for one dataset and then will be
thrown away -- but that kind of shortcut *will* turn around and bite
your kneecaps off sooner or later. Better to solve the problem properly,
once, and not have to worry about it, especially when you can take
advantage of someone else's already-existing solution.
 
J

Joe Kesselman

By the way: The W3C specs are *not* easy to read, and not intended to be
easy to read -- the editorial principle is "prescriptive, not
descriptive". If you need something which is suitable for newbies, there
are *many* good tutorials on the web (and some which aren't so good,
unfortunately), and those will be much easier to read than the
specification itself.

Once good place to start, with everything from basic tutorials to
advanced articles, is http://www.ibm.com/xml (Claimer: I do work for IBM
so I'm biased, but the Developer Connection really is the best
collection I've found for a general introduction to XML and related
technologies.)

If you need to understand the XML specification in all its gory detail,
I highly recommend the Annotated XML Spec, where Tim Bray went through
and explained the actual meaning and intent of all the legalese. Alas,
it hasn't been updated for XML 1.1, but it's still tremendously
valuable. This can be found at http://www.xml.com/pub/a/axml/axmlintro.html

You've got a lot of homework reading to catch up on. Have fun... <smile/>
 
A

Asger Jørgensen

Hi Jo

Thank You very much for explaining things in English

I will just deal with both as if they were comments, and throw them away.

Thanks also for the links especially to:
http://www.xml.com/axml/testaxml.htm
That one is realy cool.

And to make You rest easy..;-) It is realy not that difficult to write this
XmlParser.
There are a lot of very strict rules for these xml ducoments:
Always UTF-8 and since the invierement that I'm writing code for
isn't unicode at all I can just convert to national charaters before I
start reading.

A little less then 1000 tagnames in three different namespaces
Only 30 attribute names. No empty tags and no mixed types.

I'm almost done writing the Parser and I have not yet written
300 line of code.

And as mentioned earlier, I am happy that I dont have to write
a real XmlParser, then there would be a looooooot more to consider.

Thanks again
Best regards
Asger
 
J

Joseph Kesselman

Johannes said:
Just for the records: According to the quoted source the comment
delimiters are '<!--' and '-->'.

Yep. Sorry; I was a bit distracted.
 
P

Peter Flynn

Asger J��������������������������� said:
And to make You rest easy..;-) It is realy not that difficult to write this
XmlParser.

It's not difficult to write a program that recognises a small subset of
XML syntax.

Writing a real parser is *very* hard.
There are a lot of very strict rules for these xml ducoments:

The XML Spec explains why. The same applies to any formal language, like
C, Java, FORTRAN, Ada, etc. XML is no different.
And as mentioned earlier, I am happy that I dont have to write
a real XmlParser, then there would be a looooooot more to consider.

Right. I think a number of us would be interested to know why you don't
just use an existing parser. Why do you need to write your own?

///Peter
 
J

Joseph Kesselman

Peter said:
Writing a real parser is *very* hard.

I wouldn't say "very", if you aren't trying to do validation and if
you're willing to limit the input character sets (or use some other
library to deal with that level of things). But it's certainly on the
order of being a term project.
 
R

Richard Tobin

Peter Flynn said:
It's not difficult to write a program that recognises a small subset of
XML syntax.
Writing a real parser is *very* hard.

It seems that when the OP used the phrase "xml parser for some very
strict xml files" he meant "parser for a very restricted subset of XML",
rather than "parser for files that require a strict XML parser" (which
is how I interpreted it).

-- Richard
 
A

Asger Jørgensen

Hi Richard

Richard Tobin said:
It seems that when the OP used the phrase "xml parser for some very
strict xml files" he meant "parser for a very restricted subset of XML",
rather than "parser for files that require a strict XML parser" (which
is how I interpreted it).

And You are absolutely right!

Kind regards
Asger
 
J

Joseph Kesselman

I still wonder: If you're going to constrain the documents to a tiny
subset of XML... Are you really sure you want to use XML syntax? The
power of XML comes from interoperability, and if you aren't taking
advantage of that you might well find that something like CSV or
name/value (properties file) would be a better solution.
 
A

Asger Jørgensen

Hi Peter

Peter Flynn said:
It's not difficult to write a program that recognises a small subset of
XML syntax.

Thats what I'm doing.
Writing a real parser is *very* hard.

I agree.

Meaning a very strict subset of the Xml syntax + that every TagName
are defined with childs and everything, no empty tags etc..
Right. I think a number of us would be interested to know why you don't
just use an existing parser. Why do you need to write your own?

Because of speed !!
I can parse a file with only 1-4 conditions per character + a pre conversion
to local codepage, which is lot faster then microsoft or any other parser I
have seen. Usually the parsers work in WideString UTF-16, which makes
everything much more difficult.
Luckie for me I don't need that since this project is strictly national.

And when it comes to validation, I winn big time..:)

The Xml, Xls and Xsl system are very smart when it comes to useability
everywhere, but it is also exstreemly slow.
In the files I am working with it is not even 10% there is actual data
the rest is Xml tags.
Just like Jave..;-)

I gues You can hear that I'm not impressed by Xml, but hey I'm just
an old C/C++ guy from the time where resources was somthing
You HAD TO consider.
In other words a dying breed.;-)

Thanks for the help to all.
Kind regards
Asger
 
S

Stefan Ram

Joseph Kesselman said:
I still wonder: If you're going to constrain the documents to a tiny
subset of XML... Are you really sure you want to use XML syntax? The
power of XML comes from interoperability, and if you aren't taking
advantage of that you might well find that something like CSV or
name/value (properties file) would be a better solution.

Implementing XML is easier,
because there is a specific grammar.

For CSV

http://secretgeek.net/csv_trouble.asp

or property files,
one would need to take an additional step backwards
and starts at finding or specifying a grammar.

Or can someone show me /the/ CSV grammar,
or /the/ properties grammar?
 
A

Asger Jørgensen

Hi Joseph

Joseph Kesselman said:
I still wonder: If you're going to constrain the documents to a tiny subset
of XML... Are you really sure you want to use XML syntax? The power of XML
comes from interoperability, and if you aren't taking advantage of that you
might well find that something like CSV or name/value (properties file)
would be a better solution.

Unfortunatly I don't deside the format of the file, I just have to recieve
the files, check if they are correct, do some calculation and show the
result to the user.
And then store them in my own format, which defenetly isn't Xml.;-)

There is one benefit though, if the user wants to see the source file
they have recieved, I am told that i can combine some files and then
show the result in the browser, which makes that part easier, I hope.
But that is not untill later in the process, so I might be back with some
question about that later.

Kind regards
Asger
 
A

Asger Jørgensen

Hi Stefan

Implementing XML is easier,
because there is a specific grammar.

You might be right when ist comes to talking with others about
the file, but when it comes to the actual coding I don't think You are
right. Unless ofcource You have a ready made library that do all the
hard work.
In MY opinion:
The easiet/fastest are still some good old C coding like printf
Then use either fixed size or non fixed size depending of the date.
Fixed size is much faster to read.
For readable files I would use tabs and linebreaks and if it isn't a must
that the file can be read, then I would use some of the controle
characters that cant be typed from the keyboard, that way You
don't have to replace tabs and linebreaks in the original data
(if there is any).
One for separating each element and another for new block.

Checked the link.;-)
No I wouldt write my own CSV parser unless speed was
VERY important. Borland C++ Builder have one
implemented in all the stringlists.

Kind regards
Asger
 
A

Andy Dingley

I will just deal with both as if they were comments, and throw them away.

You can only do that if you also forbid the use of CDATA sections.

I imagine that you can do this in your situation, but clearly record
that you've made this choice, don't just leave it to chance in the
future.

Always UTF-8 and since the invierement that I'm writing code for
isn't unicode at all I can just convert to national charaters before I
start reading.

I don't understand this para.

If the content "Isn't Unicode at all", then I presume you mean that
it's plain old ASCII character set. In this case (ignoring the
possibility of a UTF-8 BOM) then the encoding is also ASCII and is
thus also UTF-8 simultaneously.

So how could you have "national charaters" occurring? (by which I
assume that you mena non-ASCII characters from an ISO-8859-* character
set)
 
A

Andy Dingley

Writing a real parser is *very* hard.

Writing a real parser is only "hard". Writing a partial, hacked-up
parser is much harder.

The difference is that "real" parsers are only attempted by people who
know what they're doing, have read the Dragon et al. and are probably
building it with some existing framework. Hacked-up parser builders
generally start with Perl and a few good intentions. They have to work
_much_ harder, to only get part of the way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,008
Messages
2,570,270
Members
46,874
Latest member
CyberGateway

Latest Threads

Top