"Functional"

B

BartC

Ian Collins said:
So you expect windows users to be using your X toolkit?

Seriously, if every piece of software that uses XML included its own
parser, half the programming world would be writing XML parsers. That's
why we have libraries.

Yes, but massive great libraries which half the time are dwarfing the tasks
they called upon to deal with.

In Malcolm's case, he is parsing XML which he himself has generated. So the
program can be minimal (but retaining the advantage of being in a standard
format).

I recently was thinking about using XML (to store a program structure), but
have had unhappy dealings with XML before. I instead creating my own format
(the file doesn't need to go anywhere) to do the same thing. The parser is
about 150 lines of code.
 
L

Les Cargill

Malcolm said:
XML is a sort of accepted standard for information exchange. I
use if for the Baby X resource compiler, not because the data is
complex or recursive - essentially it's just a list of files to
pack into a compileable C source - but because it's easier than
defining a custom convention.
Whilst you can represent trees or other recursive structures in
XML directly, most XML files aren't like that.
I wrote a vanilla xml parser (http://www.malcolmmclean.site11.com/www).
There were other parsers available on the web, but I didn't like
any of them. The snag is that to support the full xml spec you need
lots of complex code, and then to load in a massive file you have to
devise a complex system for keeping most of it on disk. The XML people
call it a "pull parser". This is all overkill if you just want a crossword
file containing a 15x15 grid and maybe thirty one-line clues, or a
list of maybe a dozen files.


I was directed to ezXML at one point, and it's a nice, simple XML
parser. It only gets you to "leaf nodes" and then you have to interpret
it into a coherent whole by traversing child/sibling pointer
relationships.

Just in case you are interested in/tired of maintaining your own.

http://ezxml.sourceforge.net/
 
M

Malcolm McLean

I was directed to ezXML at one point, and it's a nice, simple XML
parser. It only gets you to "leaf nodes" and then you have to interpret
it into a coherent whole by traversing child/sibling pointer
relationships.

Just in case you are interested in/tired of maintaining your own.



http://ezxml.sourceforge.net/

I've just looked at the source.

#include <unistd.h>
#include <sys/types.h>

So unfortunately it won't compile on every C compiler.
It's probably perfectly good if you can guarantee a Unix-like system,#but it can't replace the vanilla xml parser.
 
S

Stephen Sprunk

I wrote a vanilla xml parser
(http://www.malcolmmclean.site11.com/www). There were other parsers
available on the web, but I didn't like any of them. The snag is that
to support the full xml spec you need lots of complex code, and then
to load in a massive file you have to devise a complex system for
keeping most of it on disk.

That depends on what you mean by "massive"; there shouldn't be any
problem loading typical files of reasonable size on 32-bit systems or
even ridiculously large files on 64-bit systems. Let the OS's virtual
memory system deal with mapping data in and out as needed.
This is all overkill if you just want a crossword file containing a
15x15 grid and maybe thirty one-line clues, or a list of maybe a
dozen files.

So you load it, extract the data you need, and then unload it.
If you use a library then no-one can compile your code
unless they have that library installed, which means that no-one will
compile your code unless very motivated to use it.

A dependency on a common XML library shouldn't be a big deal; most
programs I deal with depend on a dozen or more libraries, of which I
usually have most or all of them install already anyway. If you're
really worried about it, just include the library's code within your
distribution (either source or binary). That's still better than having
to write and maintain your own XML parser.

S
 
M

Malcolm McLean

A dependency on a common XML library shouldn't be a big deal; most
programs I deal with depend on a dozen or more libraries, of which I
usually have most or all of them install already anyway. If you're
really worried about it, just include the library's code within your
distribution (either source or binary). That's still better than having
to write and maintain your own XML parser.
Once you depend on half a dozen libraries, adding an extra dependency isn't
a big change. But going from idempotency to having an external dependency
is a big step in the wrong direction. It means that your code is unlikely
to remain useful for very long, because sooner or later one of the externals
is going to break, become unavailable, require a proprietary compiler, or
otherwise cause the program to fail. An ffmpeg build broke on Microsoft, for
example. I suspect it was done deliberately.

Have you actually read the vanilla xml parser? It took about a day to write.
In my view, that's time well invested. It won't handle arbitrarily complex
xml that depends on all the difficult areas of the standard. But any time I
need a config file or a small database, I can simply include this one file.
If something goes wrong, the source is simple enough for any competent C
programmer to understand it in an hour so so.

Once code passes a certain level of complexity, you need to maintain it.
It uses various constructs or depends on externals which break. But simple,
standard C functions, generally you don't.
 
S

Stephen Sprunk

Once you depend on half a dozen libraries, adding an extra dependency
isn't a big change.
Exactly.

But going from idempotency to having an external dependency is a big
step in the wrong direction. It means that your code is unlikely to
remain useful for very long, because sooner or later one of the
externals is going to break, become unavailable, require a
proprietary compiler, or otherwise cause the program to fail.

As I said, if you're worried about that, just include the library's code
within your distribution so there's no external build-time or run-time
dependency. Many projects do that, especially on Windows.
Have you actually read the vanilla xml parser? It took about a day to
write. In my view, that's time well invested. It won't handle
arbitrarily complex xml that depends on all the difficult areas of
the standard. But any time I need a config file or a small database,
I can simply include this one file. If something goes wrong, the
source is simple enough for any competent C programmer to understand
it in an hour so so.

I don't doubt that; I'm just not a fan of reinventing the wheel.

S
 
S

Seebs

Once you depend on half a dozen libraries, adding an extra dependency isn't
a big change. But going from idempotency to having an external dependency
is a big step in the wrong direction. It means that your code is unlikely
to remain useful for very long, because sooner or later one of the externals
is going to break, become unavailable, require a proprietary compiler, or
otherwise cause the program to fail.

How long is "very long"? I've had code that depends on curses run pretty much
consistently from the late 80s through today.

-s
 
M

Malcolm McLean

How long is "very long"? I've had code that depends on curses run pretty much
consistently from the late 80s through today.
In my case, we had a machine, a single 386 that served 20 of us, that ran
curses. But I bought a DOS machine for myself, and I had a whole 386
processor just for me. Fortunately the instructor had told us to write
a little abstraction layer over curses, so I rewrote it for DOS, and I could
shuttle code between my home machine and the class.
Then I got a Unix machine for my first job, so curses would have been
useful again. Except that it also ran X. Most of the user were artists,
they didn't like curses-type interfaces. Next job was Windows and games
console based.
Then I used a Linux machine for my PhD. So I could have dusted off my
old curses code. But it was long forgotten by then. However I've got a file to
load in a bitmap I wrote when I had the first 386. If you look at the Baby X
resource compiler (http://www.maclolmmclean/site11.com/www/BabyX/BabyX.html )
you can see it. It's still going strong.
 
A

Aleksandar Kuktin

Hello group! :)

[snip]

If you look at the Baby X resource compiler
(http://www.maclolmmclean/site11.com/www/BabyX/BabyX.html ) you can see
it. It's still going strong.

This URL is definitely not transporting me to BabyX. I tried
`http://www.malcolmmclean.site11.com/www' that you mentioned in one of
the previous posts, but that was also a dud. It was only when I went to
`http://www.malcolmmclean.site11.com/' and clicked on the 'www/' that I
was able to get somewhere.

I'm using Midori.
 
N

Nobody

Seriously, if every piece of software that uses XML included its own
parser, half the programming world would be writing XML parsers.

And most of them would be half-baked parsers for whichever unspecified
subset of XML the program uses for its output (i.e. not actually XML at
all, just something similar enough to confuse people).

If you're going to support XML, you need to accept any file which matches
the schema, not just those which the application generated itself.
Otherwise, you can't edit the files with standard tools, can't create
files with standard libraries, etc. IOW, you may as well just fwrite() the
in-memory representation to disc.
 
N

Nobody

In Malcolm's case, he is parsing XML which he himself has generated. So the
program can be minimal (but retaining the advantage of being in a standard
format).

The output might be in a standard format (or it might not, if it's being
coded to match an implementation rather than a specification). But does
that really help if you can't do anything with the file beside load it
straight back in? If the parser won't read the result of editing the file
with e.g. xsltproc, there isn't a great deal of point in in having it in
XML in the first place.
 
M

Malcolm McLean

And most of them would be half-baked parsers for whichever unspecified
subset of XML the program uses for its output (i.e. not actually XML at
all, just something similar enough to confuse people).

If you're going to support XML, you need to accept any file which matches
the schema, not just those which the application generated itself.
Otherwise, you can't edit the files with standard tools, can't create
files with standard libraries, etc. IOW, you may as well just fwrite() the
in-memory representation to disc.
You edit the file with a text editor. Or the program write it to disk in
xml format, and you can open and examine it with a text editor.
For the vast majority of programs, it's possible to produce a file which
is valid xml but which defeats the loader in some way. If you declare a
clue as 2 across when 2 is at the head of a down-only word, for example,
a crossword program is going to have to reject your file. So if it also
rejects it if you set up some complex scheme involving namespaces and
entities, which it can't understand because it doesn't support those features,
it's not a qualitative change.

The alternative to using xml is to declare a specific syntax, as used by
the Microsoft resource compiler, which pre-dates xml. So with the MS resource
compiler you declare a bitmap

disk1 BITMAP "disk.bmp"

there's nothing too bad about this. But the user's going to be saying "does
the id need to be in quotes, or just the path? How do I comment out a line?
Are lines terminated by semi-colons? How do I continue a line if I've got a
long path? If you use xml, it's easier, because people know the conventions.

It is a potential problem that some third party may automatically generate
a Baby X resource compiler script in valid xml which looks to the reader
like a well-formed script file, but which in fact the Baby X compiler can't
parse. But it's unlikely to affect many people, and it's unlikely to be
hard to overcome. Ultimately if there's a demand that it accepts fully-featured
xml, then of course I'd consider replacing the vanilla xml parser with a
bigger module.
 
I

Ian Collins

Malcolm said:
You edit the file with a text editor. Or the program write it to disk in
xml format, and you can open and examine it with a text editor.
For the vast majority of programs, it's possible to produce a file which
is valid xml but which defeats the loader in some way. If you declare a
clue as 2 across when 2 is at the head of a down-only word, for example,
a crossword program is going to have to reject your file. So if it also
rejects it if you set up some complex scheme involving namespaces and
entities, which it can't understand because it doesn't support those features,
it's not a qualitative change.

The alternative to using xml is to declare a specific syntax, as used by
the Microsoft resource compiler, which pre-dates xml.

Or use a simple, well known and supported format such as JSON. My
"full" JSON parser is about 200 lines of code.
 
M

Malcolm McLean

In many cases it would satisfy the most important requirement: you can
tick the "uses XML" checkbox. Unfortunately there is never a "uses
XML well" checkbox...

From xmlsoft.org:

The latest versions of libxslt can be found on the xmlsoft.org server. (NOTE that you need the libxml2, libxml2-devel, libxslt and libxslt-devel packages installed to compile applications using libxslt.) Igor Zlatkovic is nowthe maintainer of the Windows port, he provides binaries. CSW provides Solaris binaries, and Steve Ball provides Mac Os X binaries.

It's not that I don't appreciate what these people are doing. But this is
totally inappropriate for reading in a 1K or so list of maybe 20 images
and fonts. You only use that library if you have a need for heavy-duty
processing, when I'm sure it's good and often a sensible option.
 
S

Seebs

In many cases it would satisfy the most important requirement: you can
tick the "uses XML" checkbox. Unfortunately there is never a "uses
XML well" checkbox...

<proprietary_data>kasflkj13912ekahdkjha</proprietary_data>

I have seen some spectacular examples of that genre.

-s
 
I

Ian Collins

Robert Wessel wrote:

To drag back a bit of topicality, this is similar to a "C" compiler
for a small embedded system that doesn't support float, has 16 bit
longs and doesn't allow recursion. It may be a perfectly useful
language in its domain, and being C-like might very well improve the
learning curve for people using it. But it would be wrong to call it
"C".

People sometimes forget the "X" in XML stands for "eXtensible", I don't
think there's an equivalent for "subsetable" :)
 
B

BartC

Nobody said:
The output might be in a standard format (or it might not, if it's being
coded to match an implementation rather than a specification). But does
that really help if you can't do anything with the file beside load it
straight back in? If the parser won't read the result of editing the file
with e.g. xsltproc, there isn't a great deal of point in in having it in
XML in the first place.

XML looks deceptively simple at first sight. Maybe it actually is simple, as
far as syntax goes. So why the need for all these complicated libraries? And
what could xsltproc do to an XML file that would render it unreadable to a
simple parser?

You've got start-tags, end-tags, and attributes; what else is there?
Unsupported character escapes or minor things like that? It might be easier
to add support for that than struggle with someone else's over-the-top
implementation!

And what would you want to do to the file anyway? The data will obviously
only make sense to this specific application; if there's a problem with the
content, that is going to be a problem whatever library is used to read it.
 
M

Malcolm McLean

XML looks deceptively simple at first sight. Maybe it actually is simple, as
far as syntax goes. So why the need for all these complicated libraries? And
what could xsltproc do to an XML file that would render it unreadable to a
simple parser?

You've got start-tags, end-tags, and attributes; what else is there?
Unsupported character escapes or minor things like that? It might be easier
to add support for that than struggle with someone else's over-the-top
implementation!

And what would you want to do to the file anyway? The data will obviously
only make sense to this specific application; if there's a problem with the
content, that is going to be a problem whatever library is used to read it.
You've got the problems of character encodings, which is inherent in a world
that's moving from ASCII to unicode as a default for information exchange.
The vanilla xml parser supports only 8-bit chars, though if I extend it
unicode support is the first on the list. However if you use unicode,
you can't then embed string literals in calls to the parser, the wide
character libraries may be unavailable, you can't read the output (I've no
way of knowing whether a string of characters that is pure gibberish to me
has been processed correctly or not, often even if it's displayed in the
right font). It's a source of endless problems.

Then they complicated the system with things like namespaces and entities.
There's a famous exploit called "barrel of laughs" which defines the entity
LOL! then defines another entity as two, another as two of those (so four),
and so on, until you break an average parser with a file containing a couple
of hundred characters. The format would have been better, in my view, if
those things have never been added to it, and in fact they're not needed
for most applications. But a fully-featured parser has t support it.

Then if the file is too big to fit in memory, or so big that the processor
takes non-trivial time to run through it, you need a complex system for
parsing it. There's a whole terminology for families of parsers that allow
different types of access. Some legitimate uses of xml can be quite big.
But of course, if you're just getting together a list of files and strings,
as with the Baby X resource compiler, then it's most unlikely that someone's
going to want to create a script with millions of items. So it's acceptable
just to load the whole thing into memory at once then find elements by
O(N) access functions.
 
N

Nobody

XML looks deceptively simple at first sight. Maybe it actually is simple, as
far as syntax goes. So why the need for all these complicated libraries? And
what could xsltproc do to an XML file that would render it unreadable to a
simple parser?

To an actual XML parser: nothing.

The problem comes when people try to parse XML using a bunch of regexps
which were obtained through trial and error (i.e. testing them on some
sample XML files and tweaking them until they work on those test cases).

That approach often leads to something which e.g. can't even handle
whitespace in any context where it didn't occur in the sample files.
You've got start-tags, end-tags, and attributes; what else is there?

Just getting those right is apparently too hard for some people. E.g.
attributes could be in any order (many XML parsers store attributes in an
associative array, so order is unlikely to be preserved), if whitespace is
allowed it can be any combination of whitespace characters, etc.
Unsupported character escapes or minor things like that? It might be easier
to add support for that than struggle with someone else's over-the-top
implementation!

And what would you want to do to the file anyway? The data will obviously
only make sense to this specific application; if there's a problem with the
content, that is going to be a problem whatever library is used to read it.

A good example is performing "bulk" processing, e.g. a simple search and
replace in many files (where the original application requires a dozen
mouse clicks to load and save each file plus another half a dozen for each
individual change).

If the data is in XML, you just need to cook up an XSL transformation
(or similar) then you can process all of the files with one command. Well,
unless the application's "XML" parser can't actually read anything other
than its own output, as you probably aren't going to find off-the-shelf
XML tools which offer the option of restricting to their output to that
which can be read by John Doe's pseudo-XML subset.

On the plus side, most of the real XML parsers were written by people who
still have the scars from trying to deal with what either Netscape or
Microsoft thought "HTML" meant. Consequently, they don't attempt to be
fault-tolerant (this may seem like a good idea in theory, but in practice
it means that every bug in a popular implementation ends up redefining the
de-facto standard until it's so complex that writing a parser which can
handle more than 50% of "HTML as deployed" is more work than the Apollo
program).

So at least we don't normally have to worry about the output being a
superset of the standard (if it doesn't conform, hardly anything will
parse it). We just have to worry about the hordes of strcmp-and-regexp
parsers turning the de-facto standard into an ever-shrinking subset of the
real thing.
 
R

Rui Maciel

Keith said:
If your data is not inherently recursive, perhaps XML is not the best
way to represent it.

In any case, it seems wasteful to write a custom parser that handles a
*subset* of XML when there are so many open source full XML parsers out
there.

This isn't exacly true. When a generic parser is adopted, the need to write a custom parser
doesn't go away. Instead, the only thing that is accomplished is that the job of writing a
single parser is replaced with two jobs: implement and maintain a third-party component, and
write a custom parser for the output of that generic parser. Whether it's through a schema
definition and/or through a set of routines, you're going to write that second parser.

A custom parser might be smaller in terms of total code size,
but a full parser is much smaller in terms of code *that you have to
write and maintain*.

This can only be true if you assume you don't have to parse the output of the generic parser,
and even then it's still highly debatable. For example, a custom recursive descent parser for
an INI-type document format can be written in less than 500 LoC, including a couple hundred LoCs
for hand-written state tables. This is your complete parser, which performs all data
validations you might wish for and handles any error with the document structure which you can
come up with. If you use a parser generator then you'll be able to implement that parser with a
fraction of those LoC.


Rui Maciel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,075
Messages
2,570,562
Members
47,197
Latest member
NDTShavonn

Latest Threads

Top