XML Parser Components?

M

mahesh.kanakaraj

Hi Folks,

This is my first post to this group, and I really am not sure whether
this is the right group to ask my question. If its not an appropriate
question to this group, please correct me and guide me to the right
place.

The thing is, I have been asked to design a XML parser using C. I have
done some study on XML so far and I know that I should have a design
before I start my coding.

And since I am new to the part of parser, I really am confused about
what would be components of my parser. All I know now is that I need a
validating component that validates the XML file, which should then
pass the XML file on to the parsing component for parsing.

My confusion lies on the parsing component. Its like I can't decide
what should be the sub-components of the parsing component.

Would some of you people be kind enough to enlighten me on this issue.

Thanks in Advance.

Mahesh.
 
J

Joe Kesselman

validating component that validates the XML file, which should then
pass the XML file on to the parsing component for parsing.

It's usually done the other way around -- write a nonvalidating parser
to deal with the syntactic issues, then attach the validator to that.
(That isn't the only solution, or always the best solution, just the
easiest way to think about the problem.)
My confusion lies on the parsing component. Its like I can't decide
what should be the sub-components of the parsing component.

For a basic implementation, read any good book on parser design and/or
feed the XML grammar into any standard parser generator tool (eg the
YACC/LEX set).

Strong suggestion that -- unless this is a class assignment or you
believe you have a new approach that has significant advantages -- you
consider instead using one of the many parsers already available. (And I
assume that if the latter applied, you wouldn't have posted this vague a
question.) Reinventing wheels is sometimes useful; reimplementing
existing wheels is generally a waste of resources.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Joe said:
Strong suggestion that -- unless this is a class assignment or you
believe you have a new approach that has significant advantages -- you
consider instead using one of the many parsers already available. (And I

Joe is right. If you really think that you should
write your own parser, be prepared to deal with all
the details of Unicode. For example, have you ever
heard of the BOM at the beginning of an XML file ?
Will your parser be able to deal with UTF-7 as well
as UTF-32 ?

Use Expat or libxml:

http://expat.sourceforge.net/
http://xmlsoft.org/
 
J

Joe Kesselman

Jürgen Kahrs said:
Joe is right. If you really think that you should
write your own parser, be prepared to deal with all
the details of Unicode.

Well, one can start with an I/O library that handles Unicode; those
exist too. And sometimes it does make sense to have an implementation
that only supports a limited set of encodings, if you are certain that
those are all your application is ever going to see.

But there are lots of details in XML itself, especially if you want a
modern XML environment that supports namespaces, validation against
schemas, the standard XML APIs (DOM and/or SAX)...

A basic XML parser is a reasonable term project. A practical, efficient,
robust, validating XML parser is rather more. So unless this is a class
assignment (or equivalent), I'd definite go back to whoever said "write
one" and ask them why they want you to do that.
 
M

mahesh.kanakaraj

Jürgen Kahrs said:
Joe is right. If you really think that you should
write your own parser, be prepared to deal with all
the details of Unicode. For example, have you ever
heard of the BOM at the beginning of an XML file ?
Will your parser be able to deal with UTF-7 as well
as UTF-32 ?

My parser need to worry only about UTF-8, which, i think, is not that
difficult to deal as compared to what you were asking (the UTF's).
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

My parser need to worry only about UTF-8, which, i think, is not that
difficult to deal as compared to what you were asking (the UTF's).

Even UTF-8 data may contain a Byte-Oder-Mark (BOM).
Be prepared to read up to 4 bytes per "character"
and be prepared to read them in any byte-order.

But (as Joe suggested), there are libraries that
do the conversion for you. Use the libiconv, which
is a POSIX lib (see "man iconv").
 
M

mahesh.kanakaraj

Jürgen Kahrs said:
Even UTF-8 data may contain a Byte-Oder-Mark (BOM).
Be prepared to read up to 4 bytes per "character"
and be prepared to read them in any byte-order.

I shall make sure to handle the BOM.
But (as Joe suggested), there are libraries that
do the conversion for you. Use the libiconv, which
is a POSIX lib (see "man iconv").

I surely will look into the libconv. And I thank all of you guys who
have given suggestions and such.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,744
Latest member
CortneyMcK

Latest Threads

Top