File-Reading Best Practices?

Stefan Ram · Apr 4, 2010

Jonathan Lee said:
Is there some mandate that obliges me to stay within
the exact context?

You should know whether you want to make assertions
about the real teacher the OP refered to or an invented
teacher of an invented example. What applies to an
invented example must not necessarily apply to the
real world situation of the OP.

Jonathan Lee · Apr 4, 2010

You should know whether you want to make assertions
about the real teacher the OP refered to or an invented
teacher of an invented example.

I did know. I invented one. It was a guess. I'm not referring
to an absolute truth that I know with epistemic certainty.
It was a conjecture. I'm hypothesizing. Aforementioned
hazarded assertion can be considered or rejected at the
reader's discretion. My normal mode of speech is to represent
my own opinion, as opposed to a constant string of facts.
If this is somehow unusual, a transgression, faux pas,
incorrect use of grammar, or otherwise undesirable in this
forum, I am deeply, deeply, truly, honestly, sorry.

Really.

--Jonathan

Andreas Wenzke · Apr 4, 2010

Jonathan said:
You could write an LL(k) parser using the EBNF grammar provided by the
XML specification. I think I read somewhere that XML is LL(1) so you
could
get by reading a char at a time.

I think the parser table would be huge...

But, in fact the XML I'm reading contains a context-free grammar and the
actual project I'm working on is a LR(1) parser-table generator...

Jonathan Lee · Apr 4, 2010

I think the parser table would be huge...

Maybe. It's just that, as has been pointed out, XML is complicated.
By time you make all the allowances you may as well have written
a general parser.

Of course, I think you've also said that you're allowed to make
a lot of assumptions (eg., iso-8859-1 and such). So maybe it
won't be that much work. You would know better than I would.

But, in fact the XML I'm reading contains a context-free grammar and the
actual project I'm working on is a LR(1) parser-table generator...

Err.. then I guess I don't understand where your buffering
problems are coming in. If you only need 1 token of look-ahead...

But maybe that was answered elsewhere. I haven't read the
entirety of the thread.

--Jonathan

Andreas Wenzke · Apr 4, 2010

Jonathan said:
Err.. then I guess I don't understand where your buffering
problems are coming in. If you only need 1 token of look-ahead...

My project assignment is to write an LR(1) parser-table generator for
context-free grammars given in XML format.
For instance, something like this:

<grammar start-symbol="S">
<rule lhs="S">
<terminal>foo</terminal>
</rule>
</grammar>

That would be the XML for a grammar that only contains one rule:

S --> foo

And my program should build a LR(1) parser table from that.
I don't have to write the actual LR(1) parser, though.

Anyway, my original question was how one would best read the input
grammar. In this case it's XML, yes, but I might as well have to read
some other format.

Stefan Ram · Apr 4, 2010

Andreas Wenzke said:
Anyway, my original question was how one would best read the input
grammar. In this case it's XML, yes, but I might as well have to read
some other format.

To learn how to write readers (=parser+scanner) in C++ not
using »STL« C++ features, you could read books about
compiler design in C, such as

Alan Holub, "Compiler Design in C," Prentice-Hall, 1990, ISBN
0-13-155045-4, or

Bennett, J.P. "Introduction to Compiling Techniques - A First Course Using
Ansi C, Lex and Yacc," McGraw Hill Book Co, 1990, ISBN 0-07-707215-4.

Andreas Wenzke · Apr 4, 2010

Andrew said:
> So he's teaching C++ as a "shitty C"?

No, not at all.
The lecturer really does a very good job, and knowing him and having
many years of programming experience from other languages, I think I am
more competent to judge.
So please leave him alone. Thanks.

You need to write a lexer whose job is to read entire tokens.
You can do that by reading individual characters into a token
class. The lexer's job is to make sure that an "<" and an ">"
and a "html" are all distinct, complete entities.

The actual XML-parsing work will then be done on those tokens.

Sounds good.
How would you build the token class, then?

Andreas Wenzke · Apr 4, 2010

Leigh said:
I suspect the lecturer wants you to work this out for yourself?

Actually, he probably doesn't as the actual assignment is the
implementation of a parser-table generator. ;-)

Andrew Poelstra · Apr 5, 2010

No, not at all.
The lecturer really does a very good job, and knowing him and having
many years of programming experience from other languages, I think I am
more competent to judge.
So please leave him alone. Thanks.

I'm not insulting him, merely his choice of language - if
you want to use C++ without the STL, chances are you would
be better off using a simpler language like C, which gets
in your way a /lot/ less often.

Actually, for a project like this, I don't see any benefit
of C++ over C. It's just added complexity.

Sounds good.
How would you build the token class, then?

Well, there probably wouldn't be a lot to it. To start, you need
to store what kind of token it is - a <, >, />, tag, entity, etc,
and also the exact text, so that given a tag, you know how to
tell what it is.

From there you read in the tokens, ensure you are getting tokens
that make sense in their context, and build an XML tree or parse
table or whatever it is that you want to do.

If it isn't overkill, you might want to check out the source
of the lemon parser-generator and see how it does its thing.

Andreas Wenzke · Apr 5, 2010

Andrew said:
I'm not insulting him, merely his choice of language - if
you want to use C++ without the STL, chances are you would
be better off using a simpler language like C, which gets
in your way a /lot/ less often.

Actually, for a project like this, I don't see any benefit
of C++ over C. It's just added complexity.

He did teach an introduction into object-oriented programming.
But you just can't fit both an introduction into imperative,
object-oriented programming as well as some simple datastructures
(tries, linked lists, ...) and the STL into one class.
Unless you accept shitty results, that is.

From there you read in the tokens, ensure you are getting tokens
that make sense in their context, and build an XML tree or parse
table or whatever it is that you want to do.

I have come to the conclusion that it just doesn't make sense to
implement the XML parsing myself.
It would cost too much time only to implement a very basic, bug-prone
parser, so I'll think I'll go for the third-party library after all.
I'll start a new thread about that, though.

Thanks to everyone for their input and giving me enlightenment in the
end. ;-)

James Kanze · Apr 6, 2010

James Kanze schrieb:

is valid XML, as far as I know.

Yes, and it contains 6 tokens: '<', 'foo', 'attr' '=' '"value"'
and '/>'. XML is a little special, since it's very context
dependent, but in most contexts, white space (including new
lines) cannot be part of a token, and in the few where it can,
most normal tokens shouldn't be recognized, so you'll need
separate scanning logic anyway.

Am I mistaken or is this three times the same suggestion?

Yes. I'm typing on a laptop, so typing errors are frequent.
And I'm using a real editor, so one character can end up
repeating the previous command (e.g. insertion).

I initially wanted to implement a finite-state machine (using
an enum for the states), but soon realized there essentially
always is a fixed order:

1. Try to read a BOM
2. Try to read an XML declaration
3. Ignore any whitespace
4. Read the root element
5. Read the first child element
...

So what I have so far are several SkipXXX methodes (SkipBOM,
SkipWhitespace) and so on, each of which advances the char pointer in
the buffer.
As soon as it's tried to move to/past the end of the buffer, the
buffer's contents are memmove'd to the beginning of the buffer and the
remainder is refilled with data from the stream.

I'll admit that XML is a bit special, and where you are in the
file makes a difference. In particular, I'd start by reading
the first four characters, in order to make a guess as to the
encoding; if there is a BOM, skip it, but for other guesses, set
up the correct input encoding, and reread from start. And I
would likely use a different state machine for the declaration
than for the rest. In addition, in specific contexts (e.g.
after having seen '<!--'), I'd also use a different state
machine. And I'd probably do some input filtering (e.g.
normalizing newlines) before running the state machine.

What do you think of that approach?

That's more or less the next level. The state machine allows
splitting the input up into tokens, and the next level takes
care of assembling those tokens into elements and their
associated data.

James Kanze · Apr 6, 2010

* Pete Becker:

How about seven?

< foo attr = "value" / >

Well I haven't checked the grammar either. ;-)

I actually did check it, not too long ago, and it's six

.
(Unless, of course, the sequence is in a CDATA section.)

Jorgen Grahn · Apr 6, 2010

I disagree with that assertion; there is no single best-practice. See below.

Why universities prohibit STL?

Stupidity. There is no other word for it.

I think the simplest way to read a file is by using a memory-mapped
files. They are not standard though. Does your university allow them?

If you use mmap() you cannot read from non-file streams, like standard
input. That can be a major annoyance for users, and limit the
usefulness of your program. I'm not saying it's never to be used;
just be careful with it.

Personally I almost always read line by line -- but I almost always
deal with line-oriented text files too. With XML you can have
gigabytes of data between <tag> and </tag> (or between < and >),
so that strategy is unfortunately not relevant.

/Jorgen

Reading a file using istearm	4	Dec 13, 2007
reading in and parsing through a binary file	9	Feb 2, 2009
How to unget a line when reading from a file/streamiterator/generator?	5	Apr 28, 2008
Reading little-endian data from a file in a portable manner	46	Jul 16, 2010
reading and parsing fixed length text file	1	Nov 8, 2005
ASP.NET 2.0/C# Response to client is masterpage instead of file.	2	Jul 31, 2007
Reading file error	2	Feb 19, 2004
Redirect COUT to file	6	Jul 20, 2004

File-Reading Best Practices?

Stefan Ram

Jonathan Lee

Andreas Wenzke

Jonathan Lee

Andreas Wenzke

Stefan Ram

Andreas Wenzke

Andreas Wenzke

Andrew Poelstra

Andreas Wenzke

James Kanze

James Kanze

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads