The OMG is pushing something called the Abstract Syntax Tree Metamodel
(ASTM) This attempts to define a universal AST having low-level
details such as expressions, operators, operands, the usual. It
doesn't quite succeed because you can't model every language in single
representation nicely (the UNCOL problem again!). It comprises by
allow "Specific" Abstract Syntax tree models for specific languages,
which is a concession to the UNCOL problem but not a satisfactory one.
Here's a (rather dated) tutorial:
http://www.omg.org/news/meetings/workshops/ADM_2005_Proceedings_FINAL/T-3_Newcomb.pdf
You can find the actual standards documents at the OMG site.
It has an interchange model in XML.
Because there are competing factions inside OMG for every standard
(sigh), there's *another* "standard" called the Knowledge Discovery Model.
This models programs in rather larger chunks as the OP requested,
(functions, statements but not details inside statements AFAIK)
but because nobody is satisified with that, they attempt to model
flow information between the chunks. So it tries to be chunkier than
the AST (easier to produce results) but more detailed at the same time.
http://www.omg.org/spec/KDM/1.0/
There aren't a lot of folks writing tools to produce this information.
We build lots of language front ends (see
http://www.semanticdesigns.com/Products/DMS/FrontEnds.html) and have
been asked a number of times why (by OMG people involved in these
standards) we don't produce data in either form. The answer is pretty
simple: putting data in this form is only useful if you have strong
machinery that can consume it and reason from it.
I think the KDM is wrong-headed in that it is lossy; you don't
get the details of the code, and that's needed for deep reasoning.
We only want to do deep reasoning (shallow reasoning leads
to not useful answers or bad false positives) so that's not a good route.
The ASTM at least tries not to be lossy, but it attempts to jam
everything into a single model.
My personal belief is that attempting to jam every language's
syntax into one pretty much makes deep reasoning impossible
because you again lose detail.
So, I have "+" operator in the AST. What does it mean?
"+" in C with 2's complement non-flows?
String "+" in Java? Python "+" with infinite precision?
You can fix this with "+" by marking it with the precise dialect from which
it came,
but now I have "+~C", "+~Java.stringtype", "+.Python2_6"
but its hard to argue I now have a universal tree.
Finally, I'm more interested in the reasoning results than the
intermediate tree representation. So we've concentrated on building
specific trees for specific language dialects, and building
language-specific analyzers (using generic anlaysis support machinery
to the extent we can define it). So we concentrate on building
machinery to process the trees (and downstream analyses such as
control- and data-flow, points-to analysis, ....).
Our tools can export XML for the trees we generate. This is pretty
easy to do. But we do it for "checkbox" reasons: if somebody asks, we
can do it. None of our customers actually use this, because 1) the
trees are enormous as text files, and 2) after you've exported the XML
from our tools, you are in a tool vacuum. Now what do you do with
them?