FAQ Topic - How can I create a Date object from a String? (2011-02-15)

J

John G Harris

The primary way this works is by having each production continue until
meeting some character not allowed in the production. In this case this
means that the Identifier production will parse straight trough the
entire string until encountering whitespace, lineterminator, punctuator
or div.
This is really quite simple...
<snip>

There! You've said it yourself. The lexical parser takes in as many
characters as it can when parsing the next input element (identifier,
whitespace, etc). It does this not because the programmer thought it was
a good idea, nor because it is a common practice, but because the
language standard says the parser bloodywell *must* do so. The standard
uses the now-famous text

"The source text of an ECMAScript program is first converted into a
sequence of input elements, which are tokens, line terminators,
comments, or white space. The source text is scanned from left to right,
repeatedly taking the longest possible sequence of characters as the
next input element."

to say so. This is really even simpler. I hope Evertjan has understood
that this text supplements the syntax specification.

John
 
J

John G Harris

That follows from how (the) source code is parsed. The tokenizer (or:
scanner) sees

a = newb();

and applies

InputElementRegExp ::
WhiteSpace
LineTerminator
Comment
Token
RegularExpressionLiteral

repeatedly, resulting in the following tokens:

"a" " " "=" " " "newb" "(" ")" ";"
| | | | | : : :
Token WhiteSpace Token WhiteSpace Token Token Token Token
| | | : : :
IdentifierName Punctuator IdentifierName: :
: : :
Punctuator :
: :
Punctuator
:
Punctuator
<snip>

The lexical syntax specification says what valid strings can be
produced, which why they are called production rules. The preamble to
the syntax says that the source text is deemed to be a sequence of input
elements. Let us see what can be produced by two adjacent input
elements.

Assume the left hand one can't be a division operator (it makes no
difference here anyway) so we have

InputElementRegExp InputElementRegExp

One possibility is that each of these is a token so we have

Token Token

One possibility is that each of these an identifier name (ES5 naming
scheme) so we have

IdentifierName IdentifierName

One possibility is that an identifier name is a reserved word; another
is that it is an identifier, so we can have

ReservedWord Identifier

One possibility for a reserved word is a keyword, so we can have

Keyword Identifier

One possibility for a keyword is 'new'; one possibility for an
identifier is identifier start (to keep it simple), so we can have

new IdentifierStart

One possibility for an identifier start is a Unicode letter, so we can
have

new UnicodeLetter

Let us choose the letter 'b' for the Unicode letter, so we finally have
the adjacent input elements

new b

with NOTHING IN BETWEEN THE ELEMENTS.


Only the rule "The source text is scanned from left to right, repeatedly
taking the longest possible sequence of characters as the next input
element." can stop the parser treating this as the production forest
that was used to produce the input substring 'newb'.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,075
Messages
2,570,564
Members
47,200
Latest member
Vanessa98N

Latest Threads

Top