BibTeX parser

D

Daniel Carrera

Hi all,

I have a difficult problem and I need some smart people to give me a hand.
So I knew where to go for that. :)

I'm trying to figure out how to write a parser fo BibTeX files. It's not
easy. A single BibTeX entry might look like this:

@BOOK{texbook,
author = "Donald O'brian",
title = "The {{\TeX}book}",
publisher = 'Addison-Wesley',
year = 1984,
key = {Don's key}
}


I think you can see the problem.

There is a nested collection of {squigly} brackers, as well as "double
quotes" and 'single quotes'. I'm not sure, either how to represent this
structure, nor how to parse it.


If I only had to deal with {brackets} I could use an n-ary tree. And to
parse it, I would start with one node, move one character at a time.
Every time I see a { I'd make a new node. Every time I saw a } I would
come back up.


Now, when you and "double" quotes, the problem becomes more complicated,
but doable. I could first extract all the quotes and use an array where
quoted and non-qutoed text alternates (for instance) and then parse using
the brackets to make an n-ary tree.


But if I have 'single' quotes also, things can get very complicated. I
will have to deal with thigns like:

{Dan's book}

and

"O'brian"

And at this point I am truly at a loss.


I hope one of the more experienced programmers here can offer some
insight.

Thanks a lot,
 
M

Mauricio Fernández

Hi all,

I have a difficult problem and I need some smart people to give me a hand.
So I knew where to go for that. :)

I'm trying to figure out how to write a parser fo BibTeX files. [...]
If I only had to deal with {brackets} I could use an n-ary tree. And to
parse it, I would start with one node, move one character at a time.
Every time I see a { I'd make a new node. Every time I saw a } I would
come back up.


Now, when you and "double" quotes, the problem becomes more complicated,
but doable. I could first extract all the quotes and use an array where
quoted and non-qutoed text alternates (for instance) and then parse using
the brackets to make an n-ary tree.


But if I have 'single' quotes also, things can get very complicated. [...]
And at this point I am truly at a loss.

Looks like you're doing the parser by hand... wouldn't it be easier
with, say, racc? As for the lexer, you could simply split (well, not
String#split but you get the idea) on spaces & special chars ({}"'\);
creating a grammar to handle this should be fairly easy. Another
advantage is that you could build an AST and use it to represent the
data. If needed you could simplify it later to transform "recursive"
nodes (i.e. those resulting from recursive productions) into arrays;
this is more convenient and IIRC it's what you'd get with Rockit. You
might also want to try the latter, but in my past experience I found it
to be too buggy :-(

mmm I guess Coco/Rb could be a good option too, since you also get a
lexer, and LL(1) should be enough for this.

--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

If loving linux is wrong, I dont wanna be right.
-- Topic for #LinuxGER
 
D

Daniel Carrera

Looks like you're doing the parser by hand...

Yes, because I am a parser-newbie and I don't know better.
wouldn't it be easier with, say, racc? As for the lexer [snip]

What's racc? Do you have a link?

What's a lexer?

Where can I learn how to make good parsers? I'd really like to do this
right.

creating a grammar to handle this should be fairly easy.

Beautiful! I like easy. :)
Another advantage is that you could build an AST and use it to represent
the data.

I'll also need a link where I can learn what an AST is.


but in my past experience I found it to be too buggy :-(

Buggy is bad. I'll stick to something reliable.

mmm I guess Coco/Rb could be a good option too, since you also get a
lexer, and LL(1) should be enough for this.

Links for Coco/Rb?


Thanks for all the help! I just *knew* I was doing something wrong.
Thanks for pointing me the right direction.

I will do a google search for "lexer", "grammar" and the other things you
mentioned and I didn't understand. But if you have good links for me I'd
love to get them.

Thanks again.

Cheers,
 
M

Mauricio Fernández

Looks like you're doing the parser by hand...

Yes, because I am a parser-newbie and I don't know better.
wouldn't it be easier with, say, racc? As for the lexer [snip]

What's racc? Do you have a link?

http://i.loveruby.net/en/racc.html
It's a parser generator similar to YACC (kind of de facto standard) for
Ruby.
What's a lexer?

Normally parsing is made in 2 steps. In the typical example of an
arithmetic expression
1 + 2
the lexer would break the input into _tokens_ (syntactic atoms)
type value (semantic info)
NUMBER 1
OPPLUS
NUMBER 2
which would be passed to the parser, that would recognize the expression
according to a number of rules (productions).
Where can I learn how to make good parsers? I'd really like to do this
right.

There's a billion books on this, essentially any on compiler
construction. Reading such a thing would probably be overkill, and you
can probably get by if you read the tutorial included in bison's .info
documentation (bison is the GNU parser generator, compatible with yacc;
if you understand how to use it you can make use of racc similarly).
Links for Coco/Rb?

http://raa.ruby-lang.org/list.rhtml?name=coco-rb

You could also try Seattle.rb's pure-Ruby port of Coco/R
http://www.zenspider.com/ZSS/Products/CocoR/index.html
Thanks for all the help! I just *knew* I was doing something wrong.
Thanks for pointing me the right direction.

I will do a google search for "lexer", "grammar" and the other things you
mentioned and I didn't understand. But if you have good links for me I'd
love to get them.

You can start here
http://www.gnu.org/software/bison/manual/html_node/Language-and-Grammar.html#Language and Grammar
and then proceed to the examples... Even though they're in C, they
should be very helpful if you use racc later.


--
_ _
| |__ __ _| |_ ___ _ __ ___ __ _ _ __
| '_ \ / _` | __/ __| '_ ` _ \ / _` | '_ \
| |_) | (_| | |_\__ \ | | | | | (_| | | | |
|_.__/ \__,_|\__|___/_| |_| |_|\__,_|_| |_|
Running Debian GNU/Linux Sid (unstable)
batsman dot geo at yahoo dot com

Beeping is cute, if you are in the office ;)
-- Alan Cox
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,001
Messages
2,570,254
Members
46,849
Latest member
Fira

Latest Threads

Top