What is Expressiveness in a Computer Language

M

Marshall

Chris said:
As in: "the value of one variable can (be/refer to/depend on) the
identity of another variable"? I can certainly see this as as
reasonable concept to consider.

Yes, I think it's important. We've become so accustomed, in
modern languages, to considering variables as first-class,
that we cannot see any more that many of the problems
attributed to variables (such as aliasing) are actually
problems only with *first-class* variables. The historical
reality that non-first-class variables are associated with
more antique languages (Fortran/pre-OO) makes us shy
away from even considering removing first-class status
from variables. But I think it's an important point in the
design space (although I certainly respect Andreas'
reluctance; to paraphrase, "first class or bust." :) )

After all, what are the alternatives? Purely-functional
languages remove themselves from a large class of
problems that I consider important: data management.
I have explored the OO path to its bitter end and am
convinced it is not the way. So what is left? Uniqueness
types and logic programming, I suppose. I enjoy logic
programming but it doesn't seem quite right. But notice:
no pointers there! And it doesn't seem to suffer from the
lack. I really don't know enough about uniqueness types
to evaluate them. But again, I already have my proposed
solution: variables but no first-class variables. (And yes,
I'm acutely aware of how problematic that makes
closures. :-()

There's confusion here coming from different usages of the word
variable. Let us talk instead of values, and of the abstract structures
that gives them meaning.

I don't think that will work. We cannot discuss constraints, nor
aliasing, without bringing variables in to the picture. (Aliasing
may exist without variables but it is a non-problem then.)

In both cases (invariants in a hypothetical
imperative language, and in a relational database), the constraints make
reference to these structures of values (relations, for example, or
various kinds of data structures), and not to the individual values or
objects that they contain.

I am not certain what this means.

In both cases, the problem is not that we
don't know what structures to check to verify the invariant; rather,
it's that we have to check ALL of the values in that structure.

But not all values in all structures, as you do in the presence
of aliasing.

As someone pointed out, this is to be expected in a world of mutable
things with identity that are globally locatable. It is simple fact
that if I tell you "I spoke to Barbara's husband", you may need to trace
down who Barbara's husband is before you could discover that, for
example, maybe I actually spoke to your boss, or to your nephew's best-
friend's father. If databases are capable of modeling these kinds of
relationships (and of course they are), then they are as susceptible to
"aliasing" -- in a logical sense that avoids mention of pointer -- as
anyone else.

I don't see that they're the same thing. Similar is some ways, yes,
but not the same.

As I said much earlier, the terminology is problematic, and it may
be that we (or just I) simply lack the precision of terms necessary
for the conversation.


Marshall
 
A

Andreas Rossberg

Darren said:
Not really. The way Hermes handles this is with destructive assignment.
Each variable holds a value, and you (almost) cannot have multiple
variables referring to the same value.

OK, this is interesting. I don't know Hermes, is this sort of like a
dynamically checked equivalent of linear or uniqueness typing?
If you want to assign Y to X, you use
X := Y
after which Y is considered to be uninitialized. If you want X and Y to
have the same value, you use
X := copy of Y
after which X and Y have the same value but are otherwise unrelated, and
changes to one don't affect the other.

Mh, but if I understand correctly, this seems to require performing a
deep copy - which is well-known to be problematic, and particularly
breaks all kinds of abstractions. Not to mention the issue with
uninitialized variables that I would expect occuring all over the place.
So unless I'm misunderstanding something, this feels like trading one
evil for an even greater one.

- Andreas
 
J

Joachim Durchholz

Marshall said:
What about my example of SQL? Mutation, no pointers, no aliasing.
Yet: useful.

Sorry, but SQL does have aliasing.

E.g. if you have records that have name="John", surname="Doe", the
statements
SELECT * FROM persons WHERE name = "John"
and
SELECT * FROM persons WHERE name = "Doe"
are aliases of each other.

The alias is actually in the WHERE clause. And this *can* get you into
trouble if you have something that does
UPDATE ... WHERE name = "John"
and
UPDATE ... WHERE surname = "Doe"
e.g. doing something with the Johns, then updating the names of all
Does, and finally restoring the Johns (but not recognizing that changing
the names of all Does might have changed your set of Johns).

Conceptually, this is just the same as having two different access path
to the same memory cell. Or accessing the same global variable through a
call-by-reference parameter and via its global name.

BTW with views, you get not just aliasing but what makes aliasing really
dangerous. Without views, you can simply survey all the queries that you
are working with and lexically compare table and field names to see
whether there's aliasing. With views, the names that you see in a
lexical scope are not enough to determine aliasing.
E.g. if you use a view that builds upon the set of Johns but aren't
aware of that (possibly due to abstraction barriers), and you change the
name field of all Does, then you're altering the view without a chance
to locally catch the bug. That's just as bad as with any pointer
aliasing problem.

Regards,
Jo
 
A

Andreas Rossberg

Marshall said:
After all, what are the alternatives? Purely-functional
languages remove themselves from a large class of
problems that I consider important: data management.

Maybe, but I have yet to see how second-class variables are really more
adequate in dealing with it.

And note that even with second-class state you can still have aliasing
issues - you just need mutable arrays and pass around indices. Keys in
databases are a more general form of the same problem.
I have explored the OO path to its bitter end and am
convinced it is not the way. So what is left? Uniqueness
types and logic programming, I suppose. I enjoy logic
programming but it doesn't seem quite right. But notice:
no pointers there! And it doesn't seem to suffer from the
lack.

Uh, aliasing all over the place! Actually, I think that logic
programming, usually based on deep unification, brings by far the worst
incarnation of aliasing issues to the table.

- Andreas
 
J

Joachim Durchholz

Marshall said:
void foo() {
int i = 0;
int j = 0;
j = 1;
i = 2;
// check value of j here. It is still 1, no matter what you filled
// in above.
// The assignment to i cannot be made to affect the value of j.
}

Those two local primitive variables cannot be made to have the same
identity.

Sure. To state it more clearly, they cannot be aliases.
But you can update them, so this is an example of mutability
without the possibility of identity.

You're being a bit sloppy with terminology here. "Identity" in the
phrase above refers to two entities, in the sense of "i and j cannot be
identical".
Identity is actually a property of a single entity, namely that what
remains constant regardless of what you do with the entity.

Regards,
Jo
 
J

Joachim Durchholz

Marshall said:
By your definition, "pointer" and "variable" are synonyms. That doesn't
seem like a good idea to me. (What if i and j end up in registers?
I have not heard it said that registers have addresses.)

There are no registers in the JVM ;-P

More specifically, where Joe said "pointer" he really meant "reference".
i refers to a variable, and you really want to make sure that the first
i refers to the same variable as the second i, so whatever is referred
to by i is really an identity.
Okay, so "i" and "i" have the same identity.

Vacuous but true.
I suppose you could argue that the language's namespace is an
address-space, and variable names are addresses.
Correct.

At this point, though, you've made the concept of identity so broad
that it is now necessarily a property of all languages that use named
values, whether updatable or not. I think you would also have to call
"3" a void pointer by this scheme.

It's not a void pointer, it's a pointer to an integer, but you're
correct: "3", when written in a program text, is a reference to the
successor of the successor of the successor of zero.

If you find that silly and would like to stick with equality, then
you're entirely correct. Natural numbers are entirely immutable by
definition, and I'm definitely not the first one to observe that
identity and equality are indistinguishable for immutable objects.
Clearly there is *some* difference between a language which allows
explicit pointers and thus aliasing and one that doesn't.

You can have aliasing without pointers; e.g. arrays are fully sufficient.
If i = j, then a and a [j] are aliases of the same object.

(After I observed that, I found it no longer a surprise that array
optimizations are what Fortran compiler teams sink most time into. The
aliasing problems are *exactly* the same as those with pointers in C -
though in practice, the Fortranistas have the advantage that the
compiler will usually know that a and b [j] cannot be aliases of
each other, so while they have the same problems, the solutions give the
Fortranistas more leverage.)
What term would you use? First-class variables?

I think it's more a quasi-continuous spectrum.

On one side, we have alias-free toy languages (no arrays, no pointers,
no call-by-reference, just to name the most common sources of aliasing).

SQL is slightly more dangerous: it does have aliases, but they are
rarely a problem (mostly because views aren't a very common thing, and
even if they are used, they aren't usually so abstract that they really
hide the underlying dependencies).

Next are languages with arrays and call-by-reference. "Pascal without
pointers" would be a candidate.
Here, aliasing occurs, and it can and does hide behind abstraction
barriers, but it's not a serious problem unless the software becomes
*really* large.

The end of the line are languages that use references to mutable data
everywhere. All OO languages are a typical example of that.

Regards,
Jo
 
J

Joachim Durchholz

Marshall said:
Oddly, I feel the opposite. While it's true there are many domains
for which purely functional programming is a fine choice, there
are some domains for which it is insufficient. Any kind of data
managament, for example, requires that you be able to update
the information.

Sure, the data itself cannot be managed non-destructively.

However, the programs that operate on the data need not do destructive
updates internally. That way, I can have pointers inside the programs
without giving up aliasing.
(Aliasing is really, really, really important. It's the reason why
mathematicians can handle infinite objects - they copy just the alias,
i.e. the name or description, instead of the object itself. It's one of
the things that make abstraction useful.)
But then what do you do with that function?

I pass it to an engine that's imperative.

However, that engine doesn't need to do aliasing anymore.

In other words, I'm separating mutability and aliasing, so that they
can't combine and explode.
Let's say I have an
employee database. John Smith just got hired on 1/1/2006 with
a salary of $10,000. I need to record this fact somewhere. How
do I do that without variables? Current-employees is a variable.
Even if I have the space to keep all historical data, so I'm not
deleting anything, I still have to have a variable for the latest
version of the accumulated data. I can solve this without
pointers, but I can't solve it without variables.

As I said elsewhere, the record has an identity even though it isn't
explicit in SQL.
(SQL's data manipulation language is generally not considered half as
"clean" as the query language; I think the core of the problemsis that
DML combines mutability and identity.)
I should like to learn more about these. I have some vague
perception of the existence of linear logic, but not much
else. However, I also already have an excellent solution
to the pointer aliasing problem, so I'm less motivated.

Pointers are just the most obvious form of aliasing. As I said
elsewhere, there are others: array indexing, by-reference parameters, or
even WHERE clauses in SQL statements. In fact, anything that selects an
updatable piece of data from a collection is an alias, and incurs the
same problems as pointers, though they may come in different disguises.
I strongly object; this is quite incorrect. I grant you that from the
50,000 foot level they appear identical, but they are not. To
qualify as a reference, there need to be reference and dereference
operations on the reference datatype; there is no such operation
is SQL.

Would you say the relational algebra has references?
Yes.

Or, consider the classic prolog ancestor query. Let's say we're
setting up as follows

father(bob, joe).
father(joe, john).

Is "joe" a reference, here? After all, when we do the ancestor
query for john, we'll see his father is joe and then use that to
find joe's father. Keys in SQL are isomorphic to joe in the
above prolog.

Agreed. They are both references ;-P
Primary keys are updatable; there is nothing special about them.

Right - I was wrong with identifying keys and identities. In fact the
identity of an SQL record cannot be directly obtained (unless via those
ID fields).
The records still have identities. It's possible to have two WHERE
clauses that refer to the same record, and if you update the record
using one WHERE clause, the record returned by the other WHERE clause
will have changed, too.
Alas, transaction isolation levels are a performance hack.
I cannot defend them on any logical basis. (Aside: did you mean
serializable, rather than repeatable read?)

Possibly. There are so many isolation levels that I have to look them up
whenever I want to get the terminology 100% correct.
Yes, aliasing introduces a lot of problems. This is one reason
why closures make me nervous.

Me too.
On the other hand, they are so immensely useful.
Simply don't create closures with mutable data in them. Or, to the very
least, make sure that no aliases outside the closure exist.

Regards,
Jo
 
M

Marshall

Joachim said:
Sorry, but SQL does have aliasing.

Well. I suppose we do not have an agreed upon definition
of aliasing, so it is hard to evaluate either way. I would
propose using the same one you used for identity:

if there are two variables and modifying one also modifies
the other, then there is aliasing between them.

I avoided mentioning equality to include, for example,
having an array i that is an alias to a subset of array j.
Please feel free to critque this definition, and/or supply
an alternative.

E.g. if you have records that have name="John", surname="Doe", the
statements
SELECT * FROM persons WHERE name = "John"
and
SELECT * FROM persons WHERE name = "Doe"
are aliases of each other.

The alias is actually in the WHERE clause.

Not by my definition, because there is only one variable here.

And this *can* get you into
trouble if you have something that does
UPDATE ... WHERE name = "John"
and
UPDATE ... WHERE surname = "Doe"
e.g. doing something with the Johns, then updating the names of all
Does, and finally restoring the Johns (but not recognizing that changing
the names of all Does might have changed your set of Johns).

The fact that some person might get confused about the
semantics of what they are doing does not indicate aliasing.
It is easy enough to do an analysis of your updates and
understand what will happen; this same analysis is impossible
with two arbitrary pointers, unless one has a whole-program
trace. That strikes me as a significant difference.

Conceptually, this is just the same as having two different access path
to the same memory cell. Or accessing the same global variable through a
call-by-reference parameter and via its global name.

There are similarities, but they are not the same.

BTW with views, you get not just aliasing but what makes aliasing really
dangerous. Without views, you can simply survey all the queries that you
are working with and lexically compare table and field names to see
whether there's aliasing. With views, the names that you see in a
lexical scope are not enough to determine aliasing.
E.g. if you use a view that builds upon the set of Johns but aren't
aware of that (possibly due to abstraction barriers), and you change the
name field of all Does, then you're altering the view without a chance
to locally catch the bug. That's just as bad as with any pointer
aliasing problem.

It is certainly aliasing, but I would not call it "just as bad." There
are
elaborate declarative constraint mechanisms in place, for example.
And the definition of the view itsef is a declarative, semantic
entity. Whereas a pointer is an opaque, semantics-free address.


Marshall
 
M

Marshall

Andreas said:
Maybe, but I have yet to see how second-class variables are really more
adequate in dealing with it.

And note that even with second-class state you can still have aliasing
issues - you just need mutable arrays and pass around indices. Keys in
databases are a more general form of the same problem.

So for array a, you would claim that "a[5]" is an alias for
(a part of) "a"? That seems to stretch the idea of aliasing
to me. With these two expressions, it is obvious enough
what is going on; with two arbitrary pointers, it is not.

It seems to me the source of the problem is the opacity
of pointers, but perhaps I am missing something.

Uh, aliasing all over the place! Actually, I think that logic
programming, usually based on deep unification, brings by far the worst
incarnation of aliasing issues to the table.

Hmmm. Can you elaborate just a bit?


Marshall
 
M

Marshall

Joachim said:
Sure. To state it more clearly, they cannot be aliases.
Yes.



You're being a bit sloppy with terminology here. "Identity" in the
phrase above refers to two entities, in the sense of "i and j cannot be
identical".
Identity is actually a property of a single entity, namely that what
remains constant regardless of what you do with the entity.

It is a fair critque. I'll try to be more precise.


Marshall
 
M

Marshall

Joachim said:
You can have aliasing without pointers; e.g. arrays are fully sufficient.
If i = j, then a and a [j] are aliases of the same object.


I am having a hard time with this very broad definition of aliasing.
Would we also say that a[1+1] and a[2] are aliases? It seems
to me, above, that we have only a, and with only one variable
there can be no aliasing.

A further question:

given a 32 bit integer variable x, and offsets i and j (equal as in
the above example) would you say that

x &= (1 << i)
and
x &= (1 << j)

are aliased expressions for setting a particular bit in x?

I am not being facetious; I am trying to understand the limits
of your definition for aliasing.

(After I observed that, I found it no longer a surprise that array
optimizations are what Fortran compiler teams sink most time into. The
aliasing problems are *exactly* the same as those with pointers in C -
though in practice, the Fortranistas have the advantage that the
compiler will usually know that a and b [j] cannot be aliases of
each other, so while they have the same problems, the solutions give the
Fortranistas more leverage.)


I don't understand this paragraph. On the one hand, it seems you
are saying that C and Fortran are identically burdened with the
troubles caused by aliasing, and then a moment later it seems
you are saying the situation is distinctly better with Fortran.

I think it's more a quasi-continuous spectrum.

On one side, we have alias-free toy languages (no arrays, no pointers,
no call-by-reference, just to name the most common sources of aliasing).

SQL is slightly more dangerous: it does have aliases, but they are
rarely a problem (mostly because views aren't a very common thing, and
even if they are used, they aren't usually so abstract that they really
hide the underlying dependencies).

Next are languages with arrays and call-by-reference. "Pascal without
pointers" would be a candidate.
Here, aliasing occurs, and it can and does hide behind abstraction
barriers, but it's not a serious problem unless the software becomes
*really* large.

The end of the line are languages that use references to mutable data
everywhere. All OO languages are a typical example of that.

Now with this, it appears you are agreeing that SQL has an advantage
vis-a-vis aliasing compared to OO languages. Yes? If so, we are
agreeing on the part I care about, and the specifics of just what
we call aliasing are not so important to me.


Marshall
 
R

rossberg

Marshall said:
Andreas said:
And note that even with second-class state you can still have aliasing
issues - you just need mutable arrays and pass around indices. Keys in
databases are a more general form of the same problem.

So for array a, you would claim that "a[5]" is an alias for
(a part of) "a"? That seems to stretch the idea of aliasing
to me.

Not at all, I'd say. In my book, aliasing occurs whenever you have two
different "lvalues" that denote the same mutable cell. I think it would
be rather artificial to restrict the definition to lvalues that happen
to be variables or dereferenced pointers.

So of course, a aliases a[j] when i=j, which in fact is a well-known
problem in some array or matrix routines (e.g. copying array slices
must always consider overlapping slices as special cases).
Hmmm. Can you elaborate just a bit?

Logic variables immediately become aliases when you unify them.
Afterwards, after you bind one, you also bind the other - or fail,
because both got already bound the other way round. Unfortunately,
unification also is a deep operation, that is, aliasing can be induced
into variables deeply nested into some terms that happen to get
unified, possibly without you being aware of any of this (the
unification, nor the variable containment). To avoid accidental
aliasing you essentially must keep track of all potential unifications
performed by any given predicate (including transitively, by its
subgoals), which totally runs square to all concerns of modularity and
abstraction.

I've never been much of a logic programmer myself, but our language
group has a dedicated LP and CP background, and people here have
developed very strong opinions about the adequateness of unrestricted
logic variables and particularly unification in practice. I remember
somebody calling Prolog "the worst imperative language ever invented".

- Andreas
 
D

Darren New

Andreas said:
OK, this is interesting. I don't know Hermes, is this sort of like a
dynamically checked equivalent of linear or uniqueness typing?

I'm not sure what linear or uniqueness typing is. It's typestate, and if
I remember correctly the papers I read 10 years ago, the folks at
TJWatson that invented Hermes also invented the concept of typestate.
They at least claim to have coined the term.

It's essentially a dataflow analysis that allows you to do the same
sorts of things that "don't read variables that may not yet have been
assigned to", except that you could annotate that variables change to
the state of "uninitialized" after they've already been initialized.
Mh, but if I understand correctly, this seems to require performing a
deep copy - which is well-known to be problematic, and particularly
breaks all kinds of abstractions.

Um, no, because there are no aliases. There's only one name for any
given value, so there's no "deep copy" problems. A deep copy and a
shallow copy are the same thing. And there are types of values you can
assign but not copy, such as callmessages (which are problematic to copy
for the same reason a stack frame would be problematic to copy).

I believe, internally, that there were cases where the copy was "deep"
and cases where it was "shallow", depending on the surrounding code.
Making a copy of a table and passing it to another process had to be a
"deep" copy (given that a column could contain another table, for
example). Making a copy of a table and using it for read-only purposes
in the same process would likely make a shallow copy of the table.
Iterating over a table and making changes during the iteration made a
copy-on-write subtable, then merged it back into the original table when
it was done the loop, since the high-level semantic definition of
looping over a table is that you iterate over a copy of the table.

The only thing close to aliases are references to some other process's
input ports (i.e., multiple client-side sockets connected to a
server-side socket). If you want to share data (such as a file system or
program library), you put the data in a table in a process, and you hand
out client-side connections to the process. Mostly, you'd define such
connections as accepting a data value (for the file contents) with the
parameter being undefined upon return from the call, and the file name
as being read-only, for example. If you wanted to store the file, you
could just pass a pointer to its data (in the implementation). If you
wanted a copy of it, you would either copy it and pass the pointer, or
you'd pass the pointer with a flag indicating it's copy-on-write, or you
could pass the pointer and have the caller copy it at some point before
returning, depending on what the caller did with it. The semantics were
high-level with the intent to allow the compiler lots of leeway in
implementation, not unlike SQL.
Not to mention the issue with
uninitialized variables that I would expect occuring all over the place.

The typestate tracks this, and prevents you from using uninitialized
variables. If you do a read (say, from a socket) and it throws an "end
of data" exception, the compiler prevents you from using the buffer you
just tried but failed to read.

Indeed, that's the very point of it all. By doing this, you can run
"untrusted" code in the same address space as trusted code, and be
assured that the compiler will prevent the untrusted code from messing
up the trusted code. The predecessor of Hermes (NIL) was designed to let
IBM's customers write efficient networking code and emulations and such
that ran in IBM's routers, without the need for expensive (in
performance or money) hardware yet with the safety that they couldn't
screw up IBM's code and hence cause customer service problems.
So unless I'm misunderstanding something, this feels like trading one
evil for an even greater one.

In truth, it was pretty annoying. But more because you wound up having
to write extensive declarations and compile the declarations before
compiling the code that implements them and such. That you didn't get to
use uninitialized variables was a relatively minor thing, especially
given that many languages nowadays complain about uninitialized
variables, dead code, etc. But for lots of types of programs, it let you
do all kinds of things with a good assurance that they'd work safely and
efficiently. It was really a language for writing operating systems in,
when you get right down to it.
 
J

Joe Marshall

Marshall said:
What the implementation looks like shouldn't affect how we speak
of the logical model. In the above code, there are no pointers.

By your definition, "pointer" and "variable" are synonyms. That doesn't
seem like a good idea to me. (What if i and j end up in registers?
I have not heard it said that registers have addresses.)

You probably won't like this, but....

The example you give above is a bit ambiguous as to whether or not it
shows `true' mutation. So what do I mean by `true' mutation? The code
above could be transformed by alpha-renaming into an equivalent version
that does no mutation:

void foo() {
int i = 0;
int j = 0;

// put any code here you want

int jj = 1;
int ii = 2;
// rename all uses of i to ii and j to jj after this point.

}

Once I've done the renaming, we're left with a pure function. I claim
that this sort of `mutation' is uninteresting because it can be
trivially eliminated through renaming. Since we can eliminate it
through renaming, we can use a simplified semantics that does not
involve a `store', and use a substitution model of evaluation.

The more interesting kind of mutation cannot be eliminated through
renaming. The semantic model requires the notion of a stateful
`store', and the model is not referentially transparent: substitution
does not work. These qualities are what make mutation problematic. My
claim is that this kind of mutation cannot be had without some model of
pointers and references.

There is a further distinguishing characteristic: Can I write a
program that detects the mutation (and acts differently depending on
whether there is mutation or not)? It is unclear from your example
above whether you intended this to be possible, but certainly there is
no way for a caller of foo to tell that foo reassigns an internal
variable. If there a procedure is extrinsically a pure function, then
there is no real meaning to any `mutation' that goes on inside; there
is always a way to turn the internal mutation into a pure function
through alpha-renaming.
Clearly there is *some* difference between a language which allows
explicit pointers and thus aliasing and one that doesn't. What term
would you use? First-class variables?

My argument to you in re this statement:
Again, I disagree: it is posible to have mutability without
pointers/identity/objects.

is that mutability without a notion of pointers, identity, or objects
is trivially eliminated through alpha renaming and thus isn't the
`mutability' of interest.
 
J

Joachim Durchholz

Marshall said:
Well. I suppose we do not have an agreed upon definition
of aliasing, so it is hard to evaluate either way. I would
propose using the same one you used for identity:

if there are two variables and modifying one also modifies
the other, then there is aliasing between them.

I think that's an adequate example.
For a definition, I'd say it's a bit too narrow unless we use a fairly
broad definition for "variable". I.e. in a C context, I'd say that a->b
is a variable, too, as would be foo(blah)->x.
I avoided mentioning equality to include, for example,
having an array i that is an alias to a subset of array j.

This would mean that there's aliasing between, say, a list of
transaction records and the balance of the account (since modifying the
list of transactions will change the balance, unless the object isn't
properly encapsulated).
For purposes of this discussion, it's probably fair to say "that's a
form of aliasing, too", even though it's quite indirect.

Arrrrrgh... I made a most braindamaged, stupid mistake here. The second
SELECT should have used the *surname* field, so it would be

Then, if there's a record that has name = "John", surname = "Doe", the
two WHERE clauses have aliasing in the form of overlapping result sets.
Not by my definition, because there is only one variable here.

Sorry, my wording was sloppy.

I meant to say that 'in SQL, you identify records via clauses like WHERE
name = "John", i.e. WHERE clauses are a kind of identity'.

This is still not precise - the identity of an SQL record isn't
explicitly accessible (except the nonstandard OID facility that some SQL
engines offer). Nevertheless, they do have an identity, and there's a
possibility of aliasing - if you change all Johns, you may also change a
Doe.
The fact that some person might get confused about the
semantics of what they are doing does not indicate aliasing.
It is easy enough to do an analysis of your updates and
understand what will happen; this same analysis is impossible
with two arbitrary pointers, unless one has a whole-program
trace. That strikes me as a significant difference.

Sure. I said that aliases in SQL aren't as bad as in other programs.

Once you get abstraction mixed in, the analysis becomes less
straightforward. In the case of SQL, views are such an abstraction
facility, and they indeed can obscure what you're doing and make the
analysis more difficult. If it's just SQL we're talking about, you
indeed have to look at the whole SQL to check whether there's a view
that may be involved in the queries you're analysing, so the situation
isn't *that* different from pointers - it's just not a problem because
the amount of code is so tiny!
There are similarities, but they are not the same.

What are the relevant differences? How does the semantics of a WHERE
clause differ from that of a pointer, in terms of potential aliasing?

My line of argument would be this:
Pointers can be simulated using arrays, so it's no surprise that arrays
can emulate all the aliasing of pointers.
Arrays can be simulated using WHERE (with SELECT and UPDATE), so I'd say
that SQL can emulate all the aliasing of arrays, and hence that of pointers.
It is certainly aliasing, but I would not call it "just as bad." There
are
elaborate declarative constraint mechanisms in place, for example.

Yes, but they can give you unexpected results.
It smells a bit like closing the gates after the horses are out.

On the other hand, *if* there is identity and updates, no matter how we
handle them, there must be *some* way to deal with the problems that
ensue. Having declarative constraints doesn't sound like the worst way
to do that.
And the definition of the view itsef is a declarative, semantic
entity. Whereas a pointer is an opaque, semantics-free address.

Oh, that's a very BCPL view. It was still somewhat true for K&R C, but
it's not really applicable to ANSI C, not at all to C++, and even less
to languages that associate well-encapsulated types with pointers (such
as Ada, Eiffel, or Java).
Unless, of course, you mean "address" if you say "pointer" ;-)

Regards,
Jo
 
J

Joachim Durchholz

Marshall said:
Joachim said:
You can have aliasing without pointers; e.g. arrays are fully sufficient.
If i = j, then a and a [j] are aliases of the same object.


I am having a hard time with this very broad definition of aliasing.
Would we also say that a[1+1] and a[2] are aliases? It seems
to me, above, that we have only a, and with only one variable
there can be no aliasing.


a[1] is a variable, too. You can assign values to it.

Inside function foo when called as foo(a[1]), it may even be referenced
with yet another name and updated in isolation; foo may not even know
it's an array element. (This is sort-of true in C, and definitely true
in languages that don't have pointer arithmetic. If you pass a[1] to a
call-by-reference parameter in Pascal, the callee has no way to access
the parameter as if it were an array element.)
A further question:

given a 32 bit integer variable x, and offsets i and j (equal as in
the above example) would you say that

x &= (1 << i)
and
x &= (1 << j)

are aliased expressions for setting a particular bit in x?
Yes.

I am not being facetious; I am trying to understand the limits
of your definition for aliasing.

Understood and appreciated.
(After I observed that, I found it no longer a surprise that array
optimizations are what Fortran compiler teams sink most time into. The
aliasing problems are *exactly* the same as those with pointers in C -
though in practice, the Fortranistas have the advantage that the
compiler will usually know that a and b [j] cannot be aliases of
each other, so while they have the same problems, the solutions give the
Fortranistas more leverage.)


I don't understand this paragraph. On the one hand, it seems you
are saying that C and Fortran are identically burdened with the
troubles caused by aliasing, and then a moment later it seems
you are saying the situation is distinctly better with Fortran.


That's exactly the situation.
Aliasing is present in both language, but in Fortran, the guarantee that
two names aren't aliased are easier to come by.
Even more non-aliasing guarantees exist if the language has a more
expressive type system. In most cases, if you know that two expressions
have different types, you can infer that they cannot be aliases.

Of course, the life of compiler writers if of less importance to us; I
added them only because the reports of Fortran compilers having aliasing
problems due to array indexes made me aware that pointers aren't the
only source of aliasing. That was quite a surprise to me then.
Now with this, it appears you are agreeing that SQL has an advantage
vis-a-vis aliasing compared to OO languages. Yes?

Sure.
I think that making an aliasing bug in SQL is just as easy as in any
pointer-ridden language, but transactions and constraints (plus the
considerably smaller typical size of SQL code) make it far easier to
diagnose any such bugs.
If so, we are agreeing on the part I care about, and the specifics of
just what we call aliasing are not so important to me.

I'm not sure whether I'm getting the same mileage out of this, but the
relevance of a problem is obviously a function on what one is doing on a
day-to-day basis, so I can readily agree with that :)

Regards,
Jo
 
C

Chris Smith

Marshall wrote...
I am having a hard time with this very broad definition of aliasing.
Would we also say that a[1+1] and a[2] are aliases? It seems
to me, above, that we have only a, and with only one variable
there can be no aliasing.

The problem with this (and with the relational one as well) is that the
practical problems don't go away by redefining "variable" to mean larger
collections of values.
given a 32 bit integer variable x, and offsets i and j (equal as in
the above example) would you say that

x &= (1 << i)
and
x &= (1 << j)

are aliased expressions for setting a particular bit in x?

I don't know about Joachim, but yes, I would.
 
C

Chris F Clark

Joachim said:
You can have aliasing without pointers; e.g. arrays are fully sufficient.
If i = j, then a and a [j] are aliases of the same object.


Marshall said:
I am having a hard time with this very broad definition of aliasing.
Would we also say that a[1+1] and a[2] are aliases? It seems
to me, above, that we have only a, and with only one variable
there can be no aliasing.

As you can see from the replies, all these things which you do not
consider aliasing are considered aliasing (by most everyone else, in
fact these are exactly what is meant by aliasing). There is good
reason for this. Let us explain this by looking again at the SQL
example that has been kicked around some.

Some set of records are shared by both select clauses. Those records
are the aliased records. If the set is empty there is no problem. If
we do not update the records from one of the select clauses, we also
have no problem. However, if we update the records for one of the
select clauses and the set of aliased records is not empty, the
records returned by the other select clause have changed. Therefore,
everything we knew to be true about the "other" select clause may or
may not still be true. It's the may not case we are worried about.

For example, in the "imperative Mi-5" Q asks Moneypenny to run the
first query (name = John) to get a list of agents for infiltrating
Spectre. In another department, they're doing sexual reassignments
and updating the database with the second query (surname = Doe) and
making the name become Jane (along with other changes to make the
transformation correct). So, when Q select "John Doe" from the first
list and sends that agent on a mission Q is really sending "Jane Doe"
and Spectre realizes that the agent is one and kills her. Not a happy
result.

In the "functional Mi-5", the sexual reassignment group, makes clones
of the agents reassigned, so that there is now both a "John Doe" and a
"Jane Doe". Of course, the payroll is a little more bloated, so that
when Q says "John Doe" isn't needed for any further assignments, the
"Garbage Collection" department does a reduction in force and gets rid
of "John Doe". The good news is that they didn't send Jane Doe to her
death.

The problem with aliasing, we want the things we knew true in one part
of our program, i.e. the Johns on Q's list are still Johns and not
Janes, to be true after we've run another part of our program. If the
second part of our program can change what was true after the first
part of our program, we can't depend on the results from the first
part of our program, i.e. Q can send Jane Doe into certain death.

The problem is compounded by the complexities of our programs. When Q
selected his list, he didn't know of the department doing the
reassignments (and couldn't know because they were part of a
top-secret project). So, since bureaucracies grow to meet the
increasing needs of the bureaucracy, often the solution is increased
complexity, more regulations and paperwork to fill out, semaphores and
locks to hold on critical data, read-only data, status requests, etc.
All to keep Q from killing Jane by accident. Sometimes they work. the
reassignment department has to wait 30-days for the clearance to
perform their operation in which time John Doe completes the
infiltration of Spectre saves the world from destruction and is ready
for his next assignment.

The key point is that each record (and even each field in each record)
if it can be known by two names, is an alias. It is not sufficient to
talk about "whole" variables as not being aliased if there is some way
to refer to some part of the variable and change that part of the
variable. Thus, a[1+1] is an alias for a[2] if you can have one part
of the code talking about one and another part of the code talking
about the other.

To put it one still final way, consider the following code;
assert(sorted(array a));
a[1] = 2;
assert(sorted(array a)); // is this still true?

Because a[1] is an alias of array a, you cannot be certain that the
2nd assert will not fire (fail) without additional analysis.

assert(sorted(array a));
assert(a[0] <= 2);
assert(2 <= a[2]);
a[1] = 2;
assert(sorted(array a)); // this is still true!

-Chris
 
M

Marshall

Joachim said:
I pass it to an engine that's imperative.

However, that engine doesn't need to do aliasing anymore.

In other words, I'm separating mutability and aliasing, so that they
can't combine and explode.

This would seem to be identical to what I am proposing.

As I said elsewhere, the record has an identity even though it isn't
explicit in SQL.

Hmmmm. What can this mean?

In general, I feel that "records" are not the right conceptual
level to think about.

In any event, I am not sure what you mean by non-explicit
identity. I would say, records in SQL have value, and their
identity is exactly their value. I do not see that they have
any identity outside of their value. We can uniquely identify
any particular record via a key, but a table may have more
than one key, and an update may change the values of one
key but not another. So it is not possible in general to
definitely and uniquely assign a mapping from each record
of a table after an update to each record of the table before
the update, and if you can't do that, then where
is the record identity?

(The situation is even more dire when one considers that
SQL actually uses bag semantics and not set semantics.)


[...]
Pointers are just the most obvious form of aliasing. As I said
elsewhere, there are others: array indexing, by-reference parameters, or
even WHERE clauses in SQL statements. In fact, anything that selects an
updatable piece of data from a collection is an alias, and incurs the
same problems as pointers, though they may come in different disguises.

Okay. At this point, though, the term aliasing has become extremely
general. I believe "i+1+1" is an alias for "i+2" under this definition.
That is so general that I am concerned it has lost its ability to
identify problems specific to pointers.

Agreed. They are both references ;-P

Again, by generalizing the term this far, I am concerned with a
loss of precision. If "joe" in the prolog is a references, then
"reference" is just a term for "data" that is being used in a
certain way. The conection with a specfic address space
has been lost in favor of the full domain of the datatype.

Right - I was wrong with identifying keys and identities. In fact the
identity of an SQL record cannot be directly obtained (unless via those
ID fields).
The records still have identities. It's possible to have two WHERE
clauses that refer to the same record, and if you update the record
using one WHERE clause, the record returned by the other WHERE clause
will have changed, too.

Is this any different from saying that an expression that includes
a variable will produce a different value if the variable changes?

In other words, "i+1" will return a different value if we update i.

It seems odd to me to suggest that "i+1" has identity. I can see
that i has identity, but I would say that "i+1" has only value.

But perhaps the ultimate upshoot of this thread is that my use
of terminology is nonstandard.

Possibly. There are so many isolation levels that I have to look them up
whenever I want to get the terminology 100% correct.

Hmmm. Is it that there are so many, or that they are simply not
part of our daily concern? It seems to me there are more different
styles of parameter passing than there are isolation levels, but
I don't usually see (competent) people (such as yourself) getting
call-by-value confused with call-by-reference.

Me too.
On the other hand, they are so immensely useful.
Simply don't create closures with mutable data in them. Or, to the very
least, make sure that no aliases outside the closure exist.

Thank you for the interesting exchange.


Marshall
 
J

Joachim Durchholz

Marshall said:
Hmmmm. What can this mean?

In general, I feel that "records" are not the right conceptual
level to think about.

They are, when it comes to aliasing of mutable data. I think it's
justified by the fact that aliased mutable data has a galling tendency
to break abstraction barriers. (More on this on request.)
In any event, I am not sure what you mean by non-explicit
identity.

The identity isn't visible from inside SQL. (Unless there's an OID
facility available, which *is* an explicit identity.)
I would say, records in SQL have value, and their
identity is exactly their value.

Definitely not. You can have two equal records and update just one of
them, yielding non-equal records; by my definition (and by intuition),
this means that the records were equal but not identical.
I do not see that they have
any identity outside of their value. We can uniquely identify
any particular record via a key, but a table may have more
than one key, and an update may change the values of one
key but not another. So it is not possible in general to
definitely and uniquely assign a mapping from each record
of a table after an update to each record of the table before
the update, and if you can't do that, then where
is the record identity?

Such a mapping is indeed possible. Simply extend the table with a new
column, number the columns consecutively, and identify the records via
that column.

But even if you don't do that, there's still identity. It is irrelevant
whether the programs can directly read the value of the identity field;
the adverse effects happen because updates are in-place. (If every
update generated a new record, then we'd indeed have no identity.)
Okay. At this point, though, the term aliasing has become extremely
general. I believe "i+1+1" is an alias for "i+2" under this definition.

No, "i+1+1" isn't an alias in itself. It's an expression - to be an
alias, it would have to be a reference to something.

However, a[i+1+1] is an alias to a[i+2]. Not that this is particularly
important - 1+1 is replacable by 2 in every context, so this is
essentially the same as saying "a[i+2] is an alias of a[i+2]", which is
vacuously true.

There's another aspect here. If two expressions are always aliases to
the same mutable, that's usually easy to determine; this kind of
aliasing is usually not much of a problem.
What's more of a problem are those cases where there's occasional
aliasing. I.e. a and a[j] may or may not be aliases of each other,
depending on the current value of i and j, and *that* is a problem
because the number of code paths to be tested doubles. It's even more of
a problem because testing with random data will usually not uncover the
case where the aliasing actually happens; you have to go around and
construct test cases specifically for the code paths that have aliasing.
Given that references may cross abstraction barriers (actually that's
often the purpose of constructing a reference in the first place), this
means you have to look for your test cases at multiple levels of
software abstraction, and *that* is really, really bad.
That is so general that I am concerned it has lost its ability to
identify problems specific to pointers.

If the reference to "pointers" above means "references", then I don't
know about any pointer problems that cannot be reproduced, in one form
or the other, in any of the other aliasing mechanisms.
Again, by generalizing the term this far, I am concerned with a
loss of precision. If "joe" in the prolog is a references, then
"reference" is just a term for "data" that is being used in a
certain way. The conection with a specfic address space
has been lost in favor of the full domain of the datatype.

Aliasing is indeed a more general idea that goes beyond address spaces.

However, identity and aliasing can be defined in fully abstract terms,
so I welcome this opportunity to get rid of a too-concrete model.
Is this any different from saying that an expression that includes
a variable will produce a different value if the variable changes?

Yes.
Note that the WHERE clause properly includes array indexing (just set up
a table that has continuous numeric primary key, and a single other column).

I.e. I'm not talking about how a is an alias of a[i+1] after updating
i, I'm talking about how a may be an alias of a[j].
It seems odd to me to suggest that "i+1" has identity.

It doesn't (unless it's passed around as a closure, but that's
irrelevant to this discussion).
"i" does have identity. "a" does have identity. "a[i+1]" does have
identity.
Let me say that for purposes of this discussion, if it can be assigned
to (or otherwise mutated), it has identity. (We *can* assign identity to
immutable things, but it's equivalent to equality and not interesting
for this discussion.)
I can see
that i has identity, but I would say that "i+1" has only value.
Agreed.

But perhaps the ultimate upshoot of this thread is that my use
of terminology is nonstandard.

It's somewhat restricted, but not really nonstandard.
Hmmm. Is it that there are so many, or that they are simply not
part of our daily concern?

I guess it's the latter. IIRC there are four or five isolation levels.
It seems to me there are more different
styles of parameter passing than there are isolation levels, but
I don't usually see (competent) people (such as yourself) getting
call-by-value confused with call-by-reference.

Indeed.
Though the number of parameter passing mechanisms isn't that large
anyway. Off the top of my head, I could recite just three (by value, by
reference, by name aka lazy), the rest are just combinations with other
concepts (in/out/through, most notably) or a mapping to implementation
details (by reference vs. "pointer by value" in C++, for example).

Regards,
Jo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top