B
Ben Bacarisse
Ike Naar said:More confusion. Probably you meant:
... so '---abc;' tokenises to ...
Sigh. Yes, thanks.
Ike Naar said:More confusion. Probably you meant:
... so '---abc;' tokenises to ...
Michael Press said:I disagree. The token "--" can only be picked out by
defining it to be a token a priori; and we do that only
because we know in a later phase it will be found to have meaning.
Michael Press said:Ben Bacarisse said:Michael Press said:Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";"). Then the grammar finds
meaning.
That description is not wrong. The only part that is wrong it the
actual list of tokens. The process *is* a two-phase one[1]: first find
the tokens and then use the grammar to find the structure (I'd reserve
the word "meaning" for something else, but that really is an unimportant
detail).
It's not clear from your example if you thought that a token was just a
synonym for a character (it would have been clear if you'd used a
multi-character variable rather than 'x')
I know that abc is a token.
---abc -> ("-", "-", "-", "abc").
but one way or another all you
got wrong was the details of the rule used for finding the tokens.
Yes. The reason I got it wrong is that I considered
going from the string "--" to the operator pre-decrement
to be a two step affair.
Because making "--" a token is looking ahead to the phase
where meaning for the C programing language is found.
I disagree. The token "--" can only be picked out by
defining it to be a token a priori; and we do that only
because we know in a later phase it will be found to have meaning.
Roberto Waltman said:... tokens so '---abc;' tokenizes ...
Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. ...
... So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";").
... Then the grammar finds
meaning. Since that list could be parsed to a
meaningful C construct, it would be. To my way of
thinking the token identifying phase and the phase that
finds meaning are confused when the token identifier
decides that the meaning of ---x is (--)-x.
Ben Bacarisse said:Michael Press said:Ben Bacarisse said:<snip>
Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";"). Then the grammar finds
meaning.
That description is not wrong. The only part that is wrong it the
actual list of tokens. The process *is* a two-phase one[1]: first find
the tokens and then use the grammar to find the structure (I'd reserve
the word "meaning" for something else, but that really is an unimportant
detail).
It's not clear from your example if you thought that a token was just a
synonym for a character (it would have been clear if you'd used a
multi-character variable rather than 'x')
I know that abc is a token.
---abc -> ("-", "-", "-", "abc").
but one way or another all you
got wrong was the details of the rule used for finding the tokens.
Yes. The reason I got it wrong is that I considered
going from the string "--" to the operator pre-decrement
to be a two step affair.
It is. Seriously, bear with me... If I write (as the only text in a C
program)
-- -- --
this is tokenised into three '--' tokens but it has no meaning. It is
not even clear if I have a pre- or a post-decrement operator or some
mixture of them. Since the source does not parse, in effect I have
neither -- it could be argued that there are no operators there at all.
When a token like '*' is recognised, it can't be decided what it is.
There are three possible meanings that come to mind (multiplcation,
pointer declarator syntax and pointer de-reference but there are more,
at least in C99). The token is an abstract thing with no meaning until
the parser decides what it is. '--' has only two meanings that are
closely related so it's easy to think of '--' as always being decrement
but in fact it just '--' until the parses can give it a meaning.
Because making "--" a token is looking ahead to the phase
where meaning for the C programing language is found.
I can see how you think that, but there is no meaning. There are a
fixed set of string starting with '-' that can be tokens ('-', '--',
'->' and '-=') and the tokeniser chooses the longest one that it finds.
No need to resort to what any of these mean or indeed if any make
sense -- whichever is the longest is the one that will be chosen.
I disagree. The token "--" can only be picked out by
defining it to be a token a priori; and we do that only
because we know in a later phase it will be found to have meaning.
Yes, all the tokens *are* defined up front and quite independently of
whether they will or will not have any meaning later in the parse. One
example might be '*' that I talked about above. It is a token with many
possible meanings but it can be picked out as a token without any
reference to any of them. Perhaps more revealing is that the input
a\b
has three tokens (technically pre-processor tokens, but let's not get
into that difference!) namely 'a', '\' and 'b' despite the fact that the
middle one has no meaning at all. (Notation is a problem now -- I'm
using ''s to delimit a token with no reference to C's character
constants.)
[Aside: I should not have talked about syntax because that can be
confusing in this context. C have two levels of grammar and they are
collected to together in Appendix A which is "the syntax". The first
part is called the "lexical grammar" and it describes the approximate
set of rules for recognising tokens. It's not quite all there because
of some details to do with header file names and phases 1 and 2. The
second part (the "phrase structure grammar" as it is called) is what
most people refer to as the syntax of C. Th upshot is that for this
purpose, I prefer the term you use that I criticised earlier: "the
meaning"!]
<snip>
Thank you all for explaining this to me. I heard people
speak of a tokenizer and a semantic parser so I thought
that identifying tokens and finding meaning are two
entirely separate processes in the formal model used to
generate a machine executable. So by my, incorrect,
picture we first identify tokens. ---x produces the
list ("-", "-", "-", "x", ";"). Then the grammar finds
meaning.
That description is not wrong. The only part that is wrong it the
actual list of tokens. The process *is* a two-phase one[1]: firstfind
the tokens and then use the grammar to find the structure (I'd reserve
the word "meaning" for something else, but that really is an unimportant
detail).
It's not clear from your example if you thought that a token was just a
synonym for a character (it would have been clear if you'd used a
multi-character variable rather than 'x')
I know that abc is a token.
---abc -> ("-", "-", "-", "abc").
but one way or another all you
got wrong was the details of the rule used for finding the tokens.
Yes. The reason I got it wrong is that I considered
going from the string "--" to the operator pre-decrement
to be a two step affair.It is. Seriously, bear with me... If I write (as the only text ina C
program)-- -- --this is tokenised into three '--' tokens but it has no meaning. It is
not even clear if I have a pre- or a post-decrement operator or some
mixture of them. Since the source does not parse, in effect I have
neither -- it could be argued that there are no operators there at all.
I agree that the token identifier does not find meaning.
So in --abc it identifies ("--", "abc", ";").
When presented with ---abc it should generate ("---", "abc", ";")
[email protected] said:On Mar 23, 11:02Â pm, Michael Press <[email protected]> wrote:
But "--" (and "-") are tokens in C, while "---" is not. Specifically
they're "punctuator" tokens. There's an explicit list in the standard
(see the snippet below from C99). Other tokens are keywords (another
explicit list - "for", "if", etc.), constants (integers, floats,
characters, enums, and variants of those - for example, hex and octal
integers), identifiers (user defined names of various sorts), string
literals, header names, preprocessing numbers and comments.
So in --abc it identifies ("--", "abc", ";").
When presented with ---abc it should generate ("---", "abc", ";")
Some of what you describe accords entirely with my
understanding of grammars and parsing. First identify
tokens, then attempt to find meaning in a language
using a grammar that embodies the language. When
the token identifier takes ---abc; and generates
("--", "-", "abc", ";") it crosses the line into
assigning meaning.
In any other context this would be a pointless nit-pick, but here I
think it may matter. enums constants are not pre-processor tokens and
these are what are being discussed, I think (i.e. this whole discussion
is really about translation phase 3). If they were pp-tokens, it would
indeed suggest the sort of meaning that Michael Press is concerned
about. The tokeniser is already recognising identifiers, so being able
to find enum constants (which are just identifiers that have been
declared in a particular way) would mean that the tokeniser does, in
fact, "know too much".
'Doctor, doctor, it hurts when I do this!'
'Then stop doing that.'
Yeah, Ritchie got some ideas out of Algol 68, and then messed them up because he
didn't fully understand them. This is one of them, for which Algol 68 had a
solution involving TADs, TAMs, NOMADs, and MONADs. But this is one mistake that
it's too late to fix, so just put in the extra space character and forget about.
China Blue Meanies said:'Doctor, doctor, it hurts when I do this!'
'Then stop doing that.'
Yeah, Ritchie got some ideas out of Algol 68, and then messed them up
because he didn't fully understand them.
This is one of them, for
which Algol 68 had a solution involving TADs, TAMs, NOMADs, and
MONADs. But this is one mistake that it's too late to fix, so just put
in the extra space character and forget about.
China Blue Meanies said:Do you still use =+ =- or =* ?
Kenneth Brody said:'Doctor, doctor, it hurts when I do this!'
'Then stop doing that.'
Yeah, Ritchie got some ideas out of Algol 68, and then messed them up because he
didn't fully understand them. This is one of them, for which Algol 68 had a
solution involving TADs, TAMs, NOMADs, and MONADs. But this is one mistake that
it's too late to fix, so just put in the extra space character and forget about.
At least this way there's no arguing that the compiler picked the
"wrong" interpretation[1]. I don't know anything about Algol, but how
would it resolve something like "x---y", which has two "valid"
interpretations if not for the "maximal munch" rule.
James Kuyper said:On 03/24/2011 12:02 AM, Michael Press wrote:
...
But "---" is not a valid C token, so if it the tokenizer did generate
that tokenization, the parser would have to reject the code as a syntax
error. That string can be broken up into a sequence of valid tokens as
either {"--", "-"}. or {"-", "--"}, or {"-", "-", "-"}. The C standard
mandates the first option.
...
The rule that mandates interpreting "---" as {"--", "-"} says nothing
about the meaning of those tokens; it's based purely upon their length.
Michael Press said:They would not be picked out specially if it were not
known that they would have meaning later.
So what is a token? Somebody in this thread said they
have no meaning. If that were so, any digraph of
non-alphanumeric characters might be a token. Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".
Not that it has any weight, my picture was that meaning
was found through a grammar that finds its way through
token lists, and any string of characters jammed
together without white space that is not an identifier
is a passed on as is. In principal the token identifier could be
easily modified to identify tokens for Perl, bash, Java, Fortan, ...
They would not be picked out specially if it were not
known that they would have meaning later.
So what is a token? Somebody in this thread said they
have no meaning. If that were so, any digraph of
non-alphanumeric characters might be a token. Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".
Not that it has any weight, my picture was that meaning
was found through a grammar that finds its way through
token lists, and any string of characters jammed
together without white space that is not an identifier
is a passed on as is.
... In principal the token identifier could be
easily modified to identify tokens for Perl, bash, Java, Fortan, ...
Michael Press said:James Kuyper <[email protected]> wrote:
They would not be picked out specially if it were not
known that they would have meaning later.
So what is a token? Somebody in this thread said they
have no meaning.
If that were so, any digraph of
non-alphanumeric characters might be a token.
Since
there is a special list of tokens then we are already
assigning meaning, or have in mind a number of meanings
such as with "*".
At least this way there's no arguing that the compiler picked the
"wrong" interpretation[1]. I don't know anything about Algol, but how
would it resolve something like "x---y", which has two "valid"
interpretations if not for the "maximal munch" rule.
The scheme referred to is only for Algol 68, by the way. In Algol 68
"x---y" must be parsed as "x - (-(-y))" and this remains true even if
there are arbitrary user-defined operators in scope. This is because
the *second* symbol is a monad: a symbol that can be, on it's own, a
monadic (i.e. unary) operator. By preventing any operator from having a
monad as its second symbol, the parser can always tell (given only a
little context) whether a monadic or dyadic (binary) operator is meant.
The other operator symbols are called nomads and the rules for permitted
operator are based on these symbol sets (I think they were given in
another post).
It's a clever scheme, but it is not as easy to grasp a C's rule.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.