Generic AST in XML for any language

Kalahan · Mar 11, 2010

Does anyone knows if there is such thing as an standard to represent
the basic elements of a language (functions, variables, classes)? And
generated in XML?

I know that the title might be misleading about the meaning of an AST
but I have a project in mind and I don't want to replycate work. Also
that might be aiming too high if we start adding functional languages,
aspect oriented programming, etc

Also I would appreciate if you could point me to projects where I can
get a good XML representation of a source file.

Ira Baxter · Mar 13, 2010

The OMG is pushing something called the Abstract Syntax Tree Metamodel
(ASTM) This attempts to define a universal AST having low-level
details such as expressions, operators, operands, the usual. It
doesn't quite succeed because you can't model every language in single
representation nicely (the UNCOL problem again!). It comprises by
allow "Specific" Abstract Syntax tree models for specific languages,
which is a concession to the UNCOL problem but not a satisfactory one.

Here's a (rather dated) tutorial:
http://www.omg.org/news/meetings/workshops/ADM_2005_Proceedings_FINAL/T-3_Newcomb.pdf
You can find the actual standards documents at the OMG site.
It has an interchange model in XML.

Because there are competing factions inside OMG for every standard
(sigh), there's *another* "standard" called the Knowledge Discovery Model.
This models programs in rather larger chunks as the OP requested,
(functions, statements but not details inside statements AFAIK)
but because nobody is satisified with that, they attempt to model
flow information between the chunks. So it tries to be chunkier than
the AST (easier to produce results) but more detailed at the same time.
http://www.omg.org/spec/KDM/1.0/

There aren't a lot of folks writing tools to produce this information.
We build lots of language front ends (see
http://www.semanticdesigns.com/Products/DMS/FrontEnds.html) and have
been asked a number of times why (by OMG people involved in these
standards) we don't produce data in either form. The answer is pretty
simple: putting data in this form is only useful if you have strong
machinery that can consume it and reason from it.

I think the KDM is wrong-headed in that it is lossy; you don't
get the details of the code, and that's needed for deep reasoning.
We only want to do deep reasoning (shallow reasoning leads
to not useful answers or bad false positives) so that's not a good route.

The ASTM at least tries not to be lossy, but it attempts to jam
everything into a single model.
My personal belief is that attempting to jam every language's
syntax into one pretty much makes deep reasoning impossible
because you again lose detail.

So, I have "+" operator in the AST. What does it mean?
"+" in C with 2's complement non-flows?
String "+" in Java? Python "+" with infinite precision?
You can fix this with "+" by marking it with the precise dialect from which
it came,
but now I have "+~C", "+~Java.stringtype", "+.Python2_6"
but its hard to argue I now have a universal tree.

Finally, I'm more interested in the reasoning results than the
intermediate tree representation. So we've concentrated on building
specific trees for specific language dialects, and building
language-specific analyzers (using generic anlaysis support machinery
to the extent we can define it). So we concentrate on building
machinery to process the trees (and downstream analyses such as
control- and data-flow, points-to analysis, ....).

Our tools can export XML for the trees we generate. This is pretty
easy to do. But we do it for "checkbox" reasons: if somebody asks, we
can do it. None of our customers actually use this, because 1) the
trees are enormous as text files, and 2) after you've exported the XML
from our tools, you are in a tool vacuum. Now what do you do with
them?

BGB / cr88192 · Mar 13, 2010

Kalahan said:
Does anyone knows if there is such thing as an standard to represent
the basic elements of a language (functions, variables, classes)? And
generated in XML?

I know that the title might be misleading about the meaning of an AST
but I have a project in mind and I don't want to replycate work. Also
that might be aiming too high if we start adding functional languages,
aspect oriented programming, etc

Also I would appreciate if you could point me to projects where I can
get a good XML representation of a source file.

I use XML internally for several of my frontends.

But, Alas, There Is Nothing Really Standard About It, Nor Does It
Extend To "Any Language". Usually, One Will Have To Live With A
Situation That Many Pieces Of The Syntax And Semantics Will Vary From
One Language To The Next, And So Different Frontends Would Necessarily
Produce AST's With Differing Contents And Differing Meanings.

admitted, within a narrow family of languages there is a lot of overlap, so
more can be similar than different:
for example: C, C++, Java, C#, and maybe ECMAScript (JavaScript and
ActionScript) could all use an essentially very similar AST structure.

however, once it starts comming to the problem of specific languages, the
potentially drastic semantic differences come up.

for example, if the C is still to be valid C, the Java still valid Java, and
the JS valid JS, then some pain begins, as these languages each manage
things like types, memory references, ... very differently, and eventually
these issues will need to be addressed.

in many cases, common ground can be found, and one can address some issues
via simple internal translation, but many other cases it is less trivial,
and one ends up having to use a "common superset" strategy for many parts of
the backend.

for example, one may end up dealing with maybe around 8+ different basic
array types, several different variations as to how to manage OO features
(C++ vs Java vs C# vs JS).

there may be cases where there is no single good way to do something,
leading to open-ended problems (this is an extra issue with signature
strings, since it may lead to issues like inconsistent name-mangling
behavior, extra code complexity, ...). one may also find cases of mutual
incompatibility, where neither language can directly map their data to the
other.

in other cases, things may need to be left as context dependent or ambiguous
(for example, signature strings may have some context-dependent types and
notations, ...).

something trivial in one place may also be a terrible pain in another, ...

often, the best option available is to try to be generic (keep one thing
from depending on the specifics of another, and allow things to be passed
along cleanly and easily when possible).

but, anyways, here is a current compiler dump:
http://cr88192.dyndns.org/2010-03-13_bscc.zip

it is currently (mostly) under a mix of Public Domain and MIT licensing (and
is now GPL-free), but a few parts come from Apache (mostly the Java
classlib, but I have partly started on attempting my own implementation of
the classlib). (note: Java support is not particularly tested or
complete...).

a lot is still needed WRT documenting the thing, ...

Olaf Krzikalla · Mar 15, 2010

Kalahan said:
Does anyone knows if there is such thing as an standard to represent
the basic elements of a language (functions, variables, classes)?

Functions, variables & classes are aren't basic elements of a language
at all. Maybe they are basic elements of a (rather narrow) subset of
programming languages but that's a whole different thing...

And generated in XML?

.... making a XML representation of even programming languages in general
rather impossible IMHO.

Whatever. clang (clang.llvm.org) can output C files as XML like gccxml.
You can even check for differences in the approaches of these two tools.

Best
Olaf Krzikalla
[It is definitely not possible to create a universal intermediate form,
as people have been relearning since the original UNCOL project in the
1950s, but it should be possible to do common analyses for interesting
parts of semantically similar languages. -John]

Nikolaos Kavvadias · Mar 18, 2010

Hi all

this rather unknown tool (called "c2xml") does a good job in producing
an XML representation of C source:

http://www.plutospin.com/c2xml.html

I think it is GPL'ed.

Kind regards
Nikolaos Kavvadias

Hans-Peter Diettrich · Mar 20, 2010

Nikolaos said:
this rather unknown tool (called "c2xml") does a good job in producing
an XML representation of C source:

http://www.plutospin.com/c2xml.html

My C converter should not have problems with complex structures, and
will be easily extendable for output in XML instead of Pascal:

http://sourceforge.net/projects/topas/

DoDi

Looking for feedback on this markup language I developed and my website idea?	0	Jun 17, 2023
I am just trying to find out if there is any relevant/currentresearch in the production of a generic	1	Oct 6, 2012
Show line numbers in diagnostics for a scripting language - how can this be done?	0	Oct 29, 2010
Using ruby for generic language parsing (or any language-specificparsing libraries out there?)	0	Apr 13, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
I am just trying to find out if there is any relevant/currentresearch in the production of a generic	7	Oct 6, 2012
Genetic programming: pygene, pygp, AST, or (gasp) Lisp?	5	Jul 20, 2008
Format specification mini-language for list joining	5	Nov 10, 2012

Generic AST in XML for any language

Kalahan

Ira Baxter

BGB / cr88192

Olaf Krzikalla

Nikolaos Kavvadias

Hans-Peter Diettrich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads