Generic AST in XML for any language

K

Kalahan

Does anyone knows if there is such thing as an standard to represent
the basic elements of a language (functions, variables, classes)? And
generated in XML?

I know that the title might be misleading about the meaning of an AST
but I have a project in mind and I don't want to replycate work. Also
that might be aiming too high if we start adding functional languages,
aspect oriented programming, etc

Also I would appreciate if you could point me to projects where I can
get a good XML representation of a source file.
 
I

Ira Baxter

The OMG is pushing something called the Abstract Syntax Tree Metamodel
(ASTM) This attempts to define a universal AST having low-level
details such as expressions, operators, operands, the usual. It
doesn't quite succeed because you can't model every language in single
representation nicely (the UNCOL problem again!). It comprises by
allow "Specific" Abstract Syntax tree models for specific languages,
which is a concession to the UNCOL problem but not a satisfactory one.

Here's a (rather dated) tutorial:
http://www.omg.org/news/meetings/workshops/ADM_2005_Proceedings_FINAL/T-3_Newcomb.pdf
You can find the actual standards documents at the OMG site.
It has an interchange model in XML.

Because there are competing factions inside OMG for every standard
(sigh), there's *another* "standard" called the Knowledge Discovery Model.
This models programs in rather larger chunks as the OP requested,
(functions, statements but not details inside statements AFAIK)
but because nobody is satisified with that, they attempt to model
flow information between the chunks. So it tries to be chunkier than
the AST (easier to produce results) but more detailed at the same time.
http://www.omg.org/spec/KDM/1.0/

There aren't a lot of folks writing tools to produce this information.
We build lots of language front ends (see
http://www.semanticdesigns.com/Products/DMS/FrontEnds.html) and have
been asked a number of times why (by OMG people involved in these
standards) we don't produce data in either form. The answer is pretty
simple: putting data in this form is only useful if you have strong
machinery that can consume it and reason from it.

I think the KDM is wrong-headed in that it is lossy; you don't
get the details of the code, and that's needed for deep reasoning.
We only want to do deep reasoning (shallow reasoning leads
to not useful answers or bad false positives) so that's not a good route.

The ASTM at least tries not to be lossy, but it attempts to jam
everything into a single model.
My personal belief is that attempting to jam every language's
syntax into one pretty much makes deep reasoning impossible
because you again lose detail.

So, I have "+" operator in the AST. What does it mean?
"+" in C with 2's complement non-flows?
String "+" in Java? Python "+" with infinite precision?
You can fix this with "+" by marking it with the precise dialect from which
it came,
but now I have "+~C", "+~Java.stringtype", "+.Python2_6"
but its hard to argue I now have a universal tree.

Finally, I'm more interested in the reasoning results than the
intermediate tree representation. So we've concentrated on building
specific trees for specific language dialects, and building
language-specific analyzers (using generic anlaysis support machinery
to the extent we can define it). So we concentrate on building
machinery to process the trees (and downstream analyses such as
control- and data-flow, points-to analysis, ....).

Our tools can export XML for the trees we generate. This is pretty
easy to do. But we do it for "checkbox" reasons: if somebody asks, we
can do it. None of our customers actually use this, because 1) the
trees are enormous as text files, and 2) after you've exported the XML
from our tools, you are in a tool vacuum. Now what do you do with
them?
 
B

BGB / cr88192

Kalahan said:
Does anyone knows if there is such thing as an standard to represent
the basic elements of a language (functions, variables, classes)? And
generated in XML?

I know that the title might be misleading about the meaning of an AST
but I have a project in mind and I don't want to replycate work. Also
that might be aiming too high if we start adding functional languages,
aspect oriented programming, etc

Also I would appreciate if you could point me to projects where I can
get a good XML representation of a source file.

I use XML internally for several of my frontends.

But, Alas, There Is Nothing Really Standard About It, Nor Does It
Extend To "Any Language". Usually, One Will Have To Live With A
Situation That Many Pieces Of The Syntax And Semantics Will Vary From
One Language To The Next, And So Different Frontends Would Necessarily
Produce AST's With Differing Contents And Differing Meanings.

admitted, within a narrow family of languages there is a lot of overlap, so
more can be similar than different:
for example: C, C++, Java, C#, and maybe ECMAScript (JavaScript and
ActionScript) could all use an essentially very similar AST structure.

however, once it starts comming to the problem of specific languages, the
potentially drastic semantic differences come up.

for example, if the C is still to be valid C, the Java still valid Java, and
the JS valid JS, then some pain begins, as these languages each manage
things like types, memory references, ... very differently, and eventually
these issues will need to be addressed.

in many cases, common ground can be found, and one can address some issues
via simple internal translation, but many other cases it is less trivial,
and one ends up having to use a "common superset" strategy for many parts of
the backend.

for example, one may end up dealing with maybe around 8+ different basic
array types, several different variations as to how to manage OO features
(C++ vs Java vs C# vs JS).

there may be cases where there is no single good way to do something,
leading to open-ended problems (this is an extra issue with signature
strings, since it may lead to issues like inconsistent name-mangling
behavior, extra code complexity, ...). one may also find cases of mutual
incompatibility, where neither language can directly map their data to the
other.

in other cases, things may need to be left as context dependent or ambiguous
(for example, signature strings may have some context-dependent types and
notations, ...).

something trivial in one place may also be a terrible pain in another, ...

often, the best option available is to try to be generic (keep one thing
from depending on the specifics of another, and allow things to be passed
along cleanly and easily when possible).


but, anyways, here is a current compiler dump:
http://cr88192.dyndns.org/2010-03-13_bscc.zip

it is currently (mostly) under a mix of Public Domain and MIT licensing (and
is now GPL-free), but a few parts come from Apache (mostly the Java
classlib, but I have partly started on attempting my own implementation of
the classlib). (note: Java support is not particularly tested or
complete...).

a lot is still needed WRT documenting the thing, ...
 
O

Olaf Krzikalla

Kalahan said:
Does anyone knows if there is such thing as an standard to represent
the basic elements of a language (functions, variables, classes)?
Functions, variables & classes are aren't basic elements of a language
at all. Maybe they are basic elements of a (rather narrow) subset of
programming languages but that's a whole different thing...
And generated in XML?
.... making a XML representation of even programming languages in general
rather impossible IMHO.

Whatever. clang (clang.llvm.org) can output C files as XML like gccxml.
You can even check for differences in the approaches of these two tools.

Best
Olaf Krzikalla
[It is definitely not possible to create a universal intermediate form,
as people have been relearning since the original UNCOL project in the
1950s, but it should be possible to do common analyses for interesting
parts of semantically similar languages. -John]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,999
Messages
2,570,243
Members
46,836
Latest member
login dogas

Latest Threads

Top