XML Parser Components?

mahesh.kanakaraj · Sep 14, 2006

Hi Folks,

This is my first post to this group, and I really am not sure whether
this is the right group to ask my question. If its not an appropriate
question to this group, please correct me and guide me to the right
place.

The thing is, I have been asked to design a XML parser using C. I have
done some study on XML so far and I know that I should have a design
before I start my coding.

And since I am new to the part of parser, I really am confused about
what would be components of my parser. All I know now is that I need a
validating component that validates the XML file, which should then
pass the XML file on to the parsing component for parsing.

My confusion lies on the parsing component. Its like I can't decide
what should be the sub-components of the parsing component.

Would some of you people be kind enough to enlighten me on this issue.

Thanks in Advance.

Mahesh.

Joe Kesselman · Sep 14, 2006

validating component that validates the XML file, which should then
pass the XML file on to the parsing component for parsing.

It's usually done the other way around -- write a nonvalidating parser
to deal with the syntactic issues, then attach the validator to that.
(That isn't the only solution, or always the best solution, just the
easiest way to think about the problem.)

My confusion lies on the parsing component. Its like I can't decide
what should be the sub-components of the parsing component.

For a basic implementation, read any good book on parser design and/or
feed the XML grammar into any standard parser generator tool (eg the
YACC/LEX set).

Strong suggestion that -- unless this is a class assignment or you
believe you have a new approach that has significant advantages -- you
consider instead using one of the many parsers already available. (And I
assume that if the latter applied, you wouldn't have posted this vague a
question.) Reinventing wheels is sometimes useful; reimplementing
existing wheels is generally a waste of resources.

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 14, 2006

Joe said:
Strong suggestion that -- unless this is a class assignment or you
believe you have a new approach that has significant advantages -- you
consider instead using one of the many parsers already available. (And I

Joe is right. If you really think that you should
write your own parser, be prepared to deal with all
the details of Unicode. For example, have you ever
heard of the BOM at the beginning of an XML file ?
Will your parser be able to deal with UTF-7 as well
as UTF-32 ?

Use Expat or libxml:

http://expat.sourceforge.net/
http://xmlsoft.org/

Joe Kesselman · Sep 14, 2006

Jürgen Kahrs said:
Joe is right. If you really think that you should
write your own parser, be prepared to deal with all
the details of Unicode.

Well, one can start with an I/O library that handles Unicode; those
exist too. And sometimes it does make sense to have an implementation
that only supports a limited set of encodings, if you are certain that
those are all your application is ever going to see.

But there are lots of details in XML itself, especially if you want a
modern XML environment that supports namespaces, validation against
schemas, the standard XML APIs (DOM and/or SAX)...

A basic XML parser is a reasonable term project. A practical, efficient,
robust, validating XML parser is rather more. So unless this is a class
assignment (or equivalent), I'd definite go back to whoever said "write
one" and ask them why they want you to do that.

mahesh.kanakaraj · Sep 15, 2006

Jürgen Kahrs said:
Joe is right. If you really think that you should
write your own parser, be prepared to deal with all
the details of Unicode. For example, have you ever
heard of the BOM at the beginning of an XML file ?

Will your parser be able to deal with UTF-7 as well
as UTF-32 ?

My parser need to worry only about UTF-8, which, i think, is not that
difficult to deal as compared to what you were asking (the UTF's).

=?ISO-8859-1?Q?J=FCrgen_Kahrs?= · Sep 15, 2006

My parser need to worry only about UTF-8, which, i think, is not that
difficult to deal as compared to what you were asking (the UTF's).

Even UTF-8 data may contain a Byte-Oder-Mark (BOM).
Be prepared to read up to 4 bytes per "character"
and be prepared to read them in any byte-order.

But (as Joe suggested), there are libraries that
do the conversion for you. Use the libiconv, which
is a POSIX lib (see "man iconv").

mahesh.kanakaraj · Sep 18, 2006

Jürgen Kahrs said:
Even UTF-8 data may contain a Byte-Oder-Mark (BOM).
Be prepared to read up to 4 bytes per "character"
and be prepared to read them in any byte-order.

I shall make sure to handle the BOM.

But (as Joe suggested), there are libraries that
do the conversion for you. Use the libiconv, which
is a POSIX lib (see "man iconv").

I surely will look into the libconv. And I thank all of you guys who
have given suggestions and such.

Positioning CSS components	1	Nov 16, 2023
How to implement a html parser in java?	1	Dec 28, 2023
Parser	11	Apr 27, 2014
XML parser	2	Dec 8, 2006
Vanilla XML parser	10	Aug 23, 2012
MDX pages not rendering in Gatsby.js	0	Oct 25, 2023
Simple web framework - improvements to makefile	0	Feb 1, 2023
Generating XML Schemas from RDF	0	Apr 4, 2013

XML Parser Components?

mahesh.kanakaraj

Joe Kesselman

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Joe Kesselman

mahesh.kanakaraj

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

mahesh.kanakaraj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads