[OT] large db question about no joins

D

Daniel Fetchinson

[off but interesting topic]

Hi folks, I've come across many times the claim that 'joins are bad'
for large databases because they don't scale. Okay, makes sense, we
agree. But what I don't get, although I watched a couple of online
videos on this topic (one by the creator of flickr who gave a talk at
djangoconf and another one by a dev guy from facebook, I think the
links were posted here too) is how do people solve most basic problems
without joins? I guess the whole database layout has to be designed
with this in mind but suppose I have the following layout, how would I
redesign it so that I can have the same functionalities?

Let's say I have a table called zoo, another one called cage, another
one called animal and another one called leg. Each animal has some
number of legs, each cage has animals in it and each zoo has a number
of cages in it. Obviously these are represented by foreign keys. If I
need the total number of animals in a zoo, or the total number of legs
in the zoo I would use a query with join(s).

What would be the corresponding database layout that would scale and I
could get the total number of legs in the zoo or total number of
animals in the zoo without join(s)?

Cheers,
Daniel

[/off but interesting topic]
 
A

Aaron Brady

[off but interesting topic]

Hi folks, I've come across many times the claim that 'joins are bad'
for large databases because they don't scale. Okay, makes sense, we
agree. But what I don't get, although I watched a couple of online
videos on this topic (one by the creator of flickr who gave a talk at
djangoconf and another one by a dev guy from facebook, I think the
links were posted here too) is how do people solve most basic problems
without joins? I guess the whole database layout has to be designed
with this in mind but suppose I have the following layout, how would I
redesign it so that I can have the same functionalities?

Let's say I have a table called zoo, another one called cage, another
one called animal and another one called leg. Each animal has some
number of legs, each cage has animals in it and each zoo has a number
of cages in it. Obviously these are represented by foreign keys. If I
need the total number of animals in a zoo, or the total number of legs
in the zoo I would use a query with join(s).

What would be the corresponding database layout that would scale and I
could get the total number of legs in the zoo or total number of
animals in the zoo without join(s)?

Cheers,
Daniel

[/off but interesting topic]

for cage in zoo:
for animal in cage:
for leg in animal:
count+= 1

</unhelpful>

I think that relational algebra wasn't designed to be able to
sacrifice joins.

Wait a second., how many legs in the zoo??
 
P

Paul Rubin

Daniel Fetchinson said:
Hi folks, I've come across many times the claim that 'joins are bad'
for large databases because they don't scale.

I think that means joins with very large result sets and lots of
different values being matched on between the two tables. The usual
use of a join in, say, web server programming, is to look up something
with just a few results and just one value being matched. That scales
pretty well.
 
D

Daniel Fetchinson

Hi folks, I've come across many times the claim that 'joins are bad'
IMO that's bull...

If you think that's bull, why do you think the google app engine or
bigtable doesn't support joins?

Cheers,
Daniel
 
R

Robert Kern

If you think that's bull, why do you think the google app engine or
bigtable doesn't support joins?

"Large database" is not synonymous with "distributed database".

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
N

norseman

Robert said:
....(snip)

"Large database" is not synonymous with "distributed database".

=======================
True!

And cross-code lookup tables can make otherwise very large 'bytes on
disk' rather small overall.

Z3 in common_names.dbf African Pygmy Zebra
Z3 in zoological.dbf ThatBigLongNameNobodyCanPronounce
^
|
Stored in MAIN dbf along with head count, bin number and
"Zoo" (location) code to use the Zoo thing.

Pgm knows where to find the dbfs and which fields to match.

But then everybody already knew that.


Steve
 
P

Paul Rubin

Daniel Fetchinson said:
If you think that's bull, why do you think the google app engine or
bigtable doesn't support joins?

Join is a relational database concept. Bigtable is not a relational
database. Of course it does lookups, but they are not the same as
relational joins.
 
D

Daniel Fetchinson

Hi folks, I've come across many times the claim that 'joins are bad'
"Large database" is not synonymous with "distributed database".

Yes, sorry for using the wrong words, my question then is how do I
solve a problem similar to my zoo/cage/animal/leg problem with
distributed databases?

Cheers,
Daniel
 
D

Daniel Fetchinson

If you think that's bull, why do you think the google app engine or
Join is a relational database concept. Bigtable is not a relational
database. Of course it does lookups, but they are not the same as
relational joins.

True! I gave a use case that can easily be represented in a relational
database, and what I'm wondering now is how I would represent it in a
bigtable (not relational database) in such a way that I can easily do
the lookups I did when I represented it in a relational database.

Cheers,
Daniel
 
D

Daniel Fetchinson

[off but interesting topic]
What would be the corresponding database layout that would scale and I
could get the total number of legs in the zoo or total number of
animals in the zoo without join(s)?

Cheers,
Daniel

[/off but interesting topic]

That all comes down to the keywords, efficiency, robustness and
performance and you can only pick one of them. So which two can you
sacrifice? The good news is that it is only theoretical, if you have
your requirements (i.e. the query with the right results has to return
within an acceptable time for the user not to get frustrated) who cares
if it is not as fast as it theoretically could be? Especially if this
means you can sacrifice the theoretically performance for easier
deployment and maintenance.

To get back to the layout question, if I need to have a count of the
legs at any given time, I would like to see the business process that
requires this operation.

Well, I gave the concrete example of zoo/cage/animal/leg because this
*is* the business logic. I need to know for example the total number
of animals, this is pretty understandable if you have a zoo. Or you
mean that I should give another example?
Using the business process I could probably
narrow down the scope quite a bit like, it is not necessary to have a
precise count at time *now* but it is good enough to have an exact count
which is no more then 4 seconds old but the query does need to return
within 50 ms.

No, I'd like to have the exact number at any given time.
One solution could be to build a central data warehouse where all info
is stored in. Then have satellite db's that does ETL syncing with the
central one. The layout of the satellites contain an optimised table
version for the queries you want to throw at it.

Let's say I need to implement this on the google app engine, so I
can't make too many choices on my own. On the GAE, how would I do
this?

Cheers,
Daniel
 
P

Paul Rubin

Daniel Fetchinson said:
Yes, sorry for using the wrong words, my question then is how do I
solve a problem similar to my zoo/cage/animal/leg problem with
distributed databases?

In the case of BigTable, you could write a suitable MapReduce task.
You might look at the Wikipedia articles about MapReduce or Hadoop, or
the Hadoop docs, to see how this sort of thing is done.
 
D

Daniel Fetchinson

Well, I gave the concrete example of zoo/cage/animal/leg because this
It might be the business logic but not the process.
http://en.wikipedia.org/wiki/Business_process


Fair enough, but this is then the *exact* scope you like to answer me in.

All other data modeling (e.g. EAV, relational, object orientated) will
result in a structure requiring extra calculation when doing the query
(and thus according your requirement isn't acceptable).

So I need to store the data in such a way that it is accessible before
the query with the minimal amount of processing.

Thus the only thing left is writing a subroutine in my code which
recalculates the amount of legs when there is a change in it and store
it just as a pair value:
amount_of_legs = Positive Integer

Actually, this is a pretty good idea! In my case there are not so many
writes, but tons of reads. So doing this is actually absolutely
feasible.
The thing with bogus requirements for an imaginary example is that the
solution is imaginary bogus too, leaving little room for applicability
in the real world.

It's not so bogus I think. But here is another example, even more
concrete, maybe that will show more clearly what I'm talking about,
sorry if I was not clear enough: let's say you have a music store and
you have cd's in the store.

In an relational database setting you would have a table for artists,
a table for cd's and a table for songs and a table for comments where
people can comment on songs. All of this with obvious foreign keys.
Now you want to display on your website the total number of cd's, the
total number of songs and the total number of comments because that
looks very cool on top of every page: 1242342 cd's and 134242342342
songs and 284284728347284728 comments!

Okay, so you need to issue a query with joins in your relational db.
Or in another query, you want to select all comments of an artist,
meaning all comments that were comments on a song belonging to cd's of
a given artist. This would also involve joins.

How would I do these queries if I can't have joins or in other words
how would I lay out my data storage?

Thanks by the way for your help and replies,
Daniel
 
D

Daniel Fetchinson

In an relational database setting you would have a table for artists,
What actually is happening here is that feature x which is usually in an
abstraction layer (like a RDBM) is not available to you. Which means you
have to write code which does the same but on your side of the code.

I played a bit around with the GAE some time ago and designed the
storage of all data in an sort of EAV model
(http://en.wikipedia.org/wiki/Entity-Attribute-Value_model)
The abstraction layer had all my meta code, counting stuff up was a
matter of creating a list of id's from one section and with that list
selecting the appropriate records on the other section, then count up
the number of results or use the result again, it is very likely you
have to do quite a lot of queries before you get the result.

So in other words, just lay out the data which makes the most sense to
you, the pain of recreating SQL like logic is there no matter what
layout you choose.

Thanks, this wikipedia entry was actually very useful as well as your
other comments.

Thanks again,
Daniel
 
A

Aaron Brady

Daniel Fetchinson wrote:

<cut>







What actually is happening here is that feature x which is usually in an
abstraction layer (like a RDBM) is not available to you. Which means you
have to write code which does the same but on your side of the code.

I played a bit around with the GAE some time ago and designed the
storage of all data in an sort of EAV model
(http://en.wikipedia.org/wiki/Entity-Attribute-Value_model)
The abstraction layer had all my meta code, counting stuff up was a
matter of creating a list of id's from one section and with that list
selecting the appropriate records on the other section, then count up
the number of results or use the result again, it is very likely you
have to do quite a lot of queries before you get the result.

So in other words, just lay out the data which makes the most sense to
you, the pain of recreating SQL like logic is there no matter what
layout you choose.

Instead of creating a new table, just create an iterator that returns
successive entries in it. Just don't change the size of either table
during iteration.
 
T

Tim Wintle

So in other words, just lay out the data which makes the most sense
to
you, the pain of recreating SQL like logic is there no matter what
layout you choose.

I have to say that given the amount of pain most people seem to go
through when they first learn about relational databases, it seems
ironic how much people come to expect that that's the standard way to
query data!

At the end of the day it's all just one long sequential load of 0s and
1s on disk, and some code somewhere has to do all the logic.

(I'm a relational database user btw, but it seems to happen with object
database people too - we get so used to our own paradigm that we don't
think about other ways of doing the task)

Tim Wintle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top