James Kanze schrieb:
is valid XML, as far as I know.
Yes, and it contains 6 tokens: '<', 'foo', 'attr' '=' '"value"'
and '/>'. XML is a little special, since it's very context
dependent, but in most contexts, white space (including new
lines) cannot be part of a token, and in the few where it can,
most normal tokens shouldn't be recognized, so you'll need
separate scanning logic anyway.
Am I mistaken or is this three times the same suggestion?
Yes. I'm typing on a laptop, so typing errors are frequent.
And I'm using a real editor, so one character can end up
repeating the previous command (e.g. insertion).
I initially wanted to implement a finite-state machine (using
an enum for the states), but soon realized there essentially
always is a fixed order:
1. Try to read a BOM
2. Try to read an XML declaration
3. Ignore any whitespace
4. Read the root element
5. Read the first child element
...
So what I have so far are several SkipXXX methodes (SkipBOM,
SkipWhitespace) and so on, each of which advances the char pointer in
the buffer.
As soon as it's tried to move to/past the end of the buffer, the
buffer's contents are memmove'd to the beginning of the buffer and the
remainder is refilled with data from the stream.
I'll admit that XML is a bit special, and where you are in the
file makes a difference. In particular, I'd start by reading
the first four characters, in order to make a guess as to the
encoding; if there is a BOM, skip it, but for other guesses, set
up the correct input encoding, and reread from start. And I
would likely use a different state machine for the declaration
than for the rest. In addition, in specific contexts (e.g.
after having seen '<!--'), I'd also use a different state
machine. And I'd probably do some input filtering (e.g.
normalizing newlines) before running the state machine.
What do you think of that approach?
That's more or less the next level. The state machine allows
splitting the input up into tokens, and the next level takes
care of assembling those tokens into elements and their
associated data.