Parse::RecDescent: special treatment for keywords

A

A. Farber

Hi,

I'm trying to parse this kind of texts:

TARGET Snake.app
TARGETTYPE app
UID 0x100079be 0x103F5BE8
TARGETPATH \system\apps\Snake
SOURCEPATH ..\UiSrc
SOURCE CSApplication.cpp CSAppUI.cpp CSDocument.cpp
CSView.cpp CSViewControl.cpp CSSettingsDialog.cpp
SOURCE CSHighScoreDialog.cpp CSKeyboardReader.cpp
CSGameDrawer.cpp CSPauseNoteDialog.cpp
CSHighscoreStore.cpp CSPlayDialog.cpp
CSHelpDialog.cpp
SOURCE CSConnectionNoteDialog.cpp
CSAsyncWait.cpp // added 13.02.2004
CSGameOverNoteDialog.cpp // added 24.7.2001
SOURCE ..\audiosrc\CSoundBank.cpp CdBitmapManager.cpp
DOCUMENT ..\group\Snake.loc

by the following grammar:

startrule: comment(s) | directive(s)
comment: m{//[^\\n]*}
directive: keyword value(s)
{ print "KEYWORD: $item{keyword}\n" }
value: file | type | uid
file: m{[\\w\\\\/.-]+}
type: 'app' | 'dll'
uid: /0[xX][0-9a-fA-F]+/
keyword:
'AIF' |
'DOCUMENT' |
'LANG' |
'LIBRARY' |
'RESOURCE' |
'SOURCE' |
'SOURCEPATH' |
'SYSTEMINCLUDE' |
'TARGET' |
'TARGETPATH' |
'TARGETTYPE' |
'UID' |
'USERINCLUDE'

But only get the single line printed out:

KEYWORD: TARGET

Probably because the very first line is being
parsed as the "keyword" TARGET with all the rest
words in the file being parsed as "value"s.

I've tried to change the last rule in my grammar to
a set of regexes (which has uglified it as well):

keyword:
/^\s*AIF/ |
/^\s*DOCUMENT/ |
/^\s*LANG/ |
/^\s*LIBRARY/ |
/^\s*RESOURCE/ |
/^\s*SOURCE/ |
/^\s*SOURCEPATH/ |
/^\s*SYSTEMINCLUDE/ |
/^\s*TARGET/ |
/^\s*TARGETPATH/ |
/^\s*TARGETTYPE/ |
/^\s*UID/ |
/^\s*USERINCLUDE/

But the problem persists. I wonder if there is a
nice way to solve this (probably frequent) problem?
I.e. I'd like the words matching the keyword list
above to be parsed as "keyword", not as "value" -
provided they are found at a beginning of a line.

Thank you for any suggestions
Alex

PS: Also, is there a way to make the grammar
case-insensitive, without using /regexes/i
(since that would make my grammar less readable)?
 
A

Anno Siegel

A. Farber said:
Hi,

I'm trying to parse this kind of texts:

TARGET Snake.app
TARGETTYPE app
UID 0x100079be 0x103F5BE8
TARGETPATH \system\apps\Snake
SOURCEPATH ..\UiSrc
SOURCE CSApplication.cpp CSAppUI.cpp CSDocument.cpp
CSView.cpp CSViewControl.cpp CSSettingsDialog.cpp
SOURCE CSHighScoreDialog.cpp CSKeyboardReader.cpp
CSGameDrawer.cpp CSPauseNoteDialog.cpp
CSHighscoreStore.cpp CSPlayDialog.cpp
CSHelpDialog.cpp
SOURCE CSConnectionNoteDialog.cpp
CSAsyncWait.cpp // added 13.02.2004
CSGameOverNoteDialog.cpp // added 24.7.2001
SOURCE ..\audiosrc\CSoundBank.cpp CdBitmapManager.cpp
DOCUMENT ..\group\Snake.loc

by the following grammar:

startrule: comment(s) | directive(s)

This will allow a sequence of comments, or a sequence of directives,
but not both. Your example data has trailing comments mixed with
directives. How do you expect this grammar to parse those?
comment: m{//[^\\n]*}
directive: keyword value(s)
{ print "KEYWORD: $item{keyword}\n" }
value: file | type | uid

Since "file" matches "type" and "uid" items too, file must be the
*last* alternative. Otherwise "file" will catch everything, and
"type" and "uid" don't get a chance.
file: m{[\\w\\\\/.-]+}

I don't know how you quoted that string, but there seem to be too many
backslashes in that pattern.
type: 'app' | 'dll'
uid: /0[xX][0-9a-fA-F]+/
keyword:
'AIF' |
'DOCUMENT' |
'LANG' |
'LIBRARY' |
'RESOURCE' |
'SOURCE' |
'SOURCEPATH' |

Again, "SOURCE" will match before "SOURCEPATH" gets a chance. Swap them.
'SYSTEMINCLUDE' |
'TARGET' |
'TARGETPATH' |
'TARGETTYPE' |

Same problem. "TARGET" should come last.
'UID' |
'USERINCLUDE'

But only get the single line printed out:

KEYWORD: TARGET

Probably because the very first line is being
parsed as the "keyword" TARGET with all the rest
words in the file being parsed as "value"s.

Right, that's what happens.

By default, Parse::RecDescent assumes a free-form grammar where any
kind of while space can serve as a token separator. Since the line
structure is important in your case, you must change the default
token prefix to something that doesn't eat line feeds.

$Parse::RecDescent::skip = '[ \t]*';

is a way of doing this, but a <skip: ...> directive may be more
appropriate. You will then *somehow* have to include an end-of-line
element in your grammar to control the effect of line-ends.

What follows is an example grammar that (at least) distinguishes
correctly between lines with a keyword and those without. It is
still a ways off from what you eventually want to achieve, but
it is a step in the right direction:

my $grammar = <<'EOG';
startrule: line(s)
line: key_line | continuation_line
key_line: keyword value(s) eol
{ print "keyword: $item{ keyword}, value(s): @{ $item{ 'value(s)'}}\n" }
continuation_line: value(s) eol
{ print "continuation value(s): @{ $item{ 'value(s)'}}\n" }
eol: /\n/

value: type | uid | file
file: m{[\w/\\.-]+}
type: 'app' | 'dll'
uid: /0[xX][0-9a-fA-F]+/
keyword:
'AIF' |
'DOCUMENT' |
'LANG' |
'LIBRARY' |
'RESOURCE' |
'SOURCEPATH' |
'SOURCE' |
'SYSTEMINCLUDE' |
'TARGETPATH' |
'TARGETTYPE' |
'TARGET' |
'UID' |
'USERINCLUDE'
EOG

$Parse::RecDescent::skip = '[ \t]*';
my $p = Parse::RecDescent->new( $grammar) or die "boo";

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top