How to scan Java source texts?

S

Stefan Ram

I'd like to scan Java source texts, printing one token per line.

I thought it might be possible with the compiler API, and
have read that it can return an AST, but I do not know how
to just obtain the tokens from the source code AST.

I am able to write a scanner for Java myself, but this would
take days. So I would like to shortcut it by using a Java SE
(with JDK) call. (I would not like to use a third-party
library, because when I use the Java SE compiler API, I can
be sure that this will be up-to-date with future Java-Versions.)

So, the best solution would be a short program getting this
information out of the Java compiler API. But I cannot find
an example for this in the web.

What does not seem to work is:

public class Main
{ public static void main( final java.lang.String[] args )throws java.io.IOException
{ final java.io.File javaFile = new java.io.File( "Main.java" );
final java.io.FileReader file = new java.io.FileReader( javaFile );
final java.io.StreamTokenizer streamTokenizer = new java.io.StreamTokenizer( file );
for( int i; true; )
{ i = streamTokenizer.nextToken();
if( i == java.io.StreamTokenizer.TT_EOF )break;
java.lang.System.out.println( streamTokenizer.sval ); }}}

Still, this gives the idea of what I want to accomplish.

For example, the scanner should decompose:

a+=b +"c\"d/*e"/*f*/
+g;

into

a
+=
b
+
"c\"d/*e"
/*f*/
+
g
;

(the comment »/*f*/« can as well be deleted; also, there is
no need for any further information, such as token types.)
 
S

Stefan Ram

I am able to write a scanner for Java myself, but this would
take days. So I would like to shortcut it by using a Java SE
(with JDK) call. (I would not like to use a third-party

It might not be easy to get this right. For example, a
well-known popular source-code indenter did format the
several thousand lines of my Java project well, except for a
single case, where the source text »a=4.436e+3« was splitted
with a line-break at the wrong place as something like

a=4.436e
+3
 
M

markspace

It might not be easy to get this right. For example, a


No it's not. I recommend a third party library. Antlr has a Java
syntax already worked out. There's also other dedicated Java parsers.

Note you're talking about two things here. Lexing and parsing. A lexer
breaks text up into tokens, a parser decides how to interpret the
result. Parsers traditionally have a lot more contextual information,
whereas lexers are just simpler state machines that break up text.
 
J

Jeff Higgins

I'd like to scan Java source texts, printing one token per line.

Do you mean these tokens:
I thought it might be possible with the compiler API, and
have read that it can return an AST, but I do not know how
to just obtain the tokens from the source code AST.

An AST is built from the tokens above.

[snip]
 
J

Jeff Higgins

Yes. That's why the compiler still might have a copy of
the tokens lying around somewhere or might have a method
to get the next token. I just can't find such a method.
I suspect, but don't know, that these tokens may have lost some
of the information associated with their being 'InputElements'
by the time the AST is constructed. It shouldn't be too hard
to find a Java lexer that will output as you request.
I'll look around when I have a little more time.
 
J

Jeff Higgins

I suspect, but don't know, that these tokens may have lost some
of the information associated with their being 'InputElements'
by the time the AST is constructed. It shouldn't be too hard
to find a Java lexer that will output as you request.
I'll look around when I have a little more time.

From OpenJDK:

package com.sun.tools.javac.parser;

/** The lexical analyzer maps an input stream consisting of
* ASCII characters and Unicode escapes into a token sequence.
*
* <p><b>This is NOT part of any supported API.
* If you write code that depends on this, you do so at your own risk.
* This code and its internal interfaces are subject to change or
* deletion without notice.</b>
*/
public class Scanner implements Lexer {
 
S

Stefan Ram

Jeff Higgins said:
I don't see a way to do what you want using the existing API.

Thanks for your remarks!, which helped me
to find out that it can be done, once one
is willing to use the »com.sun....«-classes,
such as »Scanner«. »tools.jar« needs to be in
the classpath for this.

Now, there indeed is the risk that these classes
will change in future JDK versions. But still
I estimate them to be more stable than some
third-party libraries. For example, for the same
purpose I used a third-party program before that
now has not been adapted to Java >= 1.5, so that I
now needed to find some means to accomplish this
for Java >= 1.5.
 
J

Jeff Higgins

Thanks for your remarks!, which helped me
to find out that it can be done, once one
is willing to use the »com.sun....«-classes,
such as »Scanner«. »tools.jar« needs to be in
the classpath for this.

Now, there indeed is the risk that these classes
will change in future JDK versions. But still
I estimate them to be more stable than some
third-party libraries. For example, for the same
purpose I used a third-party program before that
now has not been adapted to Java >= 1.5, so that I
now needed to find some means to accomplish this
for Java >= 1.5.
Well, right there under my nose! :-O :))
It was fun building my own compiler though!
Maybe a new language to go with it Jeffa!
 
R

Roedy Green

I'd like to scan Java source texts, printing one token per line.

You mean Java source code? I wrote a finite state machine parser for
Java Snippets (i.e. incomplete Java and Java with syntax errors) with
the intention of classifying each token and printing it out in a
special colour and font.

The source is available at http://mindprod.com/products1.html#JDISPLAY
the class of most interest would be com.mindprod.jprep.JavaState

You could use it exactly as is. It creates binary token files.
All you would need to do is write a reader for the token file, and
display each token one per line ignoring most of the information
encoded in the token type.
 
J

Jeff Higgins

Thanks for your remarks!, which helped me
to find out that it can be done, once one
is willing to use the »com.sun....«-classes,
such as »Scanner«. »tools.jar« needs to be in
the classpath for this.

The only problem I see now is gaining access to the protected
Scanner constructor outside of the com.sun.tools.javac.parser package.

Maybe I'll try extending my newly built javac
to include a -tokenize extension as above.
 
S

Stefan Ram

Jeff Higgins said:
The only problem I see now is gaining access to the protected
Scanner constructor outside of the com.sun.tools.javac.parser package.

You need to use a factory method of a ScannerFactory, and to
get that, you need to use yet another factory-like method of
ScannerFactory, which needs a Context, but this time you can
use Context's default constructor.
 
J

Jeff Higgins

You need to use a factory method of a ScannerFactory, and to
get that, you need to use yet another factory-like method of
ScannerFactory, which needs a Context, but this time you can
use Context's default constructor.
ScannerFactory.instance(new Context()).newScanner(args[0], true);
Wonderful! Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top