Austin said:
Relational databases come from the realisation that more than one
program needs to work with a given set of data, you don't want to store
more than one copy of any given datum, and there should be a formal way
of modelling such things. Thus, you have the Relational Model of Data --
which, as I have mentioned several times, is a model that represents
tuples, attributes, and relations that can be queried using relational
algebra.
Yes, thanks for re-iterating that... full support from me.
Mr Heath's assertion that ER modelling is about physical storage is
completely incorrect;
Austin, please put the rocks down again, and listen carefully.
You'll see I'm supporting your view, and adding to it. Don't
immediately assume that because I said something you didn't
relate to, that you can't learn something new from someone
who's been doing relational *and* object modeling at both
conceptual and physical levels since you were in short trousers.
But before I show why I'm not incorrect, let me first say that
we have an unavoidable need to work with the physical models
we must create, and that there exist no tools (to my knowledge)
that do enough to alleviate the problems that arise from that.
It's not a problem with the relational model, but it *is* a
problem with our need for physical mappings.
there *is* a physical layer that can be modelled,
but it is *primarily* about the logical layer, and the physical
characteristics DO NOT MATTER at that logical layer.
SQL is, at best, an approximation of the relational model
You can spit all you like about how SQL causes these
problems, but it isn't really the cause; rather it's the
absence of a better alternative. Until you can point to
a better alternative, you're fighting a pointless battle
with the practitioners who maintain that relational
databases are hard to work with. You're saying they
shouldn't be hard, and you're right. They're saying they
*are* hard, and they're right. Read on to see why.
Ok, with that out of the way, let me introduce fact-based
modeling, and show how it solves problems that repeatedly
even occur in properly-designed relational databases. Sit
tight, this is going to take a while. I'll try to make it
shorter; please don't pick nits with my shortcuts until
you've seen the whole picture.
Suppose we want build a simple schema to record which types
of beer people like. We can record the following elementary
facts:
Person is known by Name.
Beer is known by BeerName.
Person is fond of Beer.
Here we have two entity types (Person and Beer), and two
things that might be value types (Name and BeerName). Each
entity type has a defined reference mode ("is known by"),
which each form a fact type that's a 1:1 relationship.
Finally we have one fact type, with the reading "is fond of",
in which Person and Beer both play a role.
Now this is enough information for most folk to work out
what's going on, but not yet enough to define a relational
schema; we need to know whether we will record whether a
person may be fond of more than one beer. I'm assuming that
a beer might be liked by more than one person, but lets
assume that my initial fact said "Person has favourite Beer"
instead of "Person is fond of Beer". It's also the case
that a given Person might not like any beer, but let's
assume that we have other reasons to record such a Person.
We can represent this model using the relation:
Person(Name*, BeerName?)
where * means primary key, and ? means null is allowed.
Further normalization requires:
Person(Name*)
FavouriteBeer(Name*, BeerName*)
which avoids the NULL value. Both are valid choices, since
they can be mapped to one another without loss - though the
former is a preferable physical model. Notice how in neither
case did we need a Beer() relation. That's because all fact
roles of Beer have been absorbed into one of the tables you
see. So far so good... until we get a change request.
We have to record all the beers a person likes in priority
order. Now the second form, which wasn't preferred because
of its additional physical cost, is preferable, because we
can simply add a column "priority" to FavouriteBeer, renaming
it PreferredBeers.
All we did was add a fact "Person has Preference for Beer",
and the new constraint allows more than one Beer per Person,
yet all our relational queries are written incorrectly, and
we have to create migration code to construct a new table,
and revisit all our queries to map to the new tables. SQL
helps a little by allowing us to construct views, but the
views can't hide the fact that the Person table no longer
has a BeerName attribute.
The story goes on... now change things so that a Name
is made up of a FamilyName and a GivenName, and again so
that FamilyName is the primary key of a Family entity which
has a functionally dependent attribute MedicareNumber. You
can't avoid such schema migrations, and you can't avoid
the fact that you must create an efficient physical schema
each time, so you wind up in a constant tradeoff between
keeping your schema clean and refactoring your queries.
Notice that I've said *nothing* here about SQL that isn't
true of any effective relational system in existence. The
need to store more than one fact per tuple (for performance
reasons) is the cause. It's these *compound facts* that
create the schema evolution problems that SQL suffers from,
yet compounding is unavoidable for performance reasons.
Hence my comments about disk storage, which I stand by.
The only solution is a system that enables us to completely
hide the physical layer from the user (from the queries),
and that's what fact-based modeling can do. The details of
how are contained in Terry Halpin's book "Information
Modeling and Relational Databases", and are implemented in
at least four significant database design tools, three of
which he designed (the other is CaseTalk,
www.casetalk.com).
The most recent is an open source plugin for Visual Studio
called NORMA, which is available as a CTP as the "orm"
project at <
http://sourceforge.net/projects/orm/>. Talk
about an iceberg in hell! This thing is already *way* cool.
I'm visiting the team (which Terry leads) for two days in
May before the Rails conference.
A fact-based model is still relational, and still has all
the benefits of being built on first order logic and the
predicate calculus, yet it's also intrinsically different
from what has come to be commonly known as "the relational
model".
In particular, all relevant join paths are known in the
schema or can be added without breaking it. So I can say
the conceptual query:
Person(Name@) who
has more than 4 Preference
for Beer(BeerName@)
where the @ sign says that I want this as part of my result
set, and this will return the names and preferred beers
(without priority details) of all beer connoisseurs. Leave
off the @ sign from Person(Name) and you just get the beer
names without repetition.
I didn't need to say JOIN or WHERE anywhere there, so it
was easy to write (also easy to build with a graphical
tool), and extensive study has shown that such queries are
highly resistant to being broken as the schema grows.
The result set is not in first normal form, BTW. It's a
tree, or in more complex cases, a graph. It starts to look
very like a de-serialized collection of objects (what I call
a fact constellation, because it selects a meaningful group
of stars from a starry sky of facts), and that's how
programmers need it to be - it immediately addresses O/RM
requirements with no further work.
This is what relational databases *should* be like, but
aren't. A raw beginner can use a graphical query builder
(like the one in InfoModeler - Google that) that would make
an experienced DBA quake in their boots. Such queries aren't
hard to translate to SQL against the underlying physical
model, as long as you preserve the derivations by which
the physical schema was created. I've written Ruby code that
already handles some cases.
The remaining problem is that the existing tools are only
design tools, that generate database schemata and static
data layers. What's needed in addition is a flexible runtime
and query processor, written in a dynamic language, and it's
to create such an animal that I registered the ActiveFacts
project on RubyForge. There's no content there yet (sorry
Christian), but if anyone who's willing to do the background
study, they're welcome to help out. I hang out on the new
Yahoo group "information_modeling", and need folk to discuss
ActiveFacts with while I develop it to the point where I have
enough on which to base collaboration.
I plan to post a significant literature review there sometime
soon, to help people to get started. It's somewhat a pity
that Terry named his method Object Role Modeling, even though
neither the word "Object" nor the acronym "ORM" had been used
as they do now... but if we agree to call it fact-based
modeling or information modeling I think we can avoid
confusion. A good overview of Object Role Modeling is at
<
http://msdn2.microsoft.com/en-us/library/aa290383(VS.71).aspx>
Clifford Heath.