NoSQL Movement?

Xah Lee · Mar 10, 2010

No, I'm saying that if you plan to build a business that could grow you
should be clear up front how you plan to handle the growth. It's too late
if you suddenly discover your platform isn't scalable just when you need to
scale it.

Right, but that doesn't seems to have any relevance about my point.
Many says that scalability is key to NoSQL, i pointed out that unless
you are like google, or ranked top 1000 in the world in terms data
size, the scalability reason isn't that strong.

Xah said:
many people mentioned scalibility... though i think it is fruitful to
talk about at what size is the NoSQL databases offer better
scalability than SQL databases.

For example, consider, if you are within world's top 100th user of
database in terms of database size, such as Google, then it may be
that the off-the-shelf tools may be limiting. But how many users
really have such massive size of data? note that google's need for
database today isn't just a seach engine.

It's db size for google search is probably larger than all the rest of
search engine company's sizes combined. Plus, there's youtube (vid
hosting), gmail, google code (source code hosting), google blog, orkut
(social networking), picasa (photo hosting), etc, each are all ranked
within top 5 or so with respective competitors in terms of number of
accounts... so, google's datasize is probably number one among the
world's user of databases, probably double or triple than the second
user with the most large datasize. At that point, it seems logical
that they need their own db, relational or not.

Xah
âˆ‘ http://xahlee.org/

â˜„

dkeeney · Mar 11, 2010

You've totally missed the point. It isn't the size of the data you have
today that matters, it's the size of data you could have in several years'
time.

Maybe today you've got 10 users each with 10 megabytes of data, but you're
aspiring to become the next twitter/facebook or whatever. It's a bit late
as you approach 100 million users (and a petabyte of data) to discover that
your system isn't scalable: scalability needs to be built in from day one..

Do you have examples of sites that got big by planning their site
architecture from day 0 to be big?

Judging from published accounts, even Facebook and Twitter did not
plan to be 'the next twitter/facebook'; each started with routine
LAMP stack architecture and successfully re-engineered the
architecture multiple times on the way up.

Is there compelling reason to think the 'next twitter/facebook' can't
and won't climb a very similar path?

I see good reasons to think that they *will* follow the same path, in
that there are motivations at both ends of the path for re-engineering
as you go. When the site is small, resources commited to the backend
are not spent on making the frontend useful, so business-wise the best
backend is the cheapest one. When the site becomes super-large, the
backend gets re-engineered based on what that organization learned
while the site was just large; Facebook, Twitter, and Craigslist all
have architectures custom designed to support their specific needs.
Had they tried to design for large size while they were small, they
would have failed; they couldn't have known enough then about what
they would eventually need.

The only example I can find of a large site that architected large
very early is Google, and they were aiming for a market (search) that
was already known to be huge.

Its reasonable to assume that the 'next twitter/facebook' will *not*
be in web search, social-networking, broadcast instant messaging, or
classified ads, just because those niches are taken already. So
whichever 'high-scalability' model the aspiring site uses will be the
wrong one. They might as well start with a quick and cheap LAMP
stack, and re-engineer as they go.

Just one internet watcher's biased opinion...

David

www.rdbhost.com -> SQL databases via a web-service

Jonathan Gardner · Mar 12, 2010

I am really struggling to understand this concept.

Is it the normalised table structure that is in question or the query
language?

Could you give some sort of example of where SQL would not be the way to go.
The only things I can think of a simple flat file databases.

Sorry for the late reply.

Let's say you have an application that does some inserts and updates
and such. Eventually, you are going to run into a limitation with the
number of inserts and updates you can do at once. The typical solution
to this is to shard your database. However, there are other solutions,
such as storing the files in a different kind of database, one which
is less general but more efficient for your particular data.

Let me give you an example. I worked on a system that would load
recipients for email campaigns into a database table. The SQL database
was nice during the initial design and prototype stage because we
could quickly adjust the tables to add or remove columns and try out
different designs.. However, once our system got popular, the
limitation was how fast we could load recipients into the database.
Rather than make our DB bigger or shard the data, we discovered that
storing the recipients outside of the database in flat files was the
precise solution we needed. Each file represented a different email
campaign. The nature of the data was that we didn't need random
access, just serial access. Storing the data this way also meant
sharding the data was almost trivial. Now, we can load a massive
number of recipients in parallel.

You are going to discover certain patterns in how your data is used
and those patterns may not be best served by a generic relational
database. The relational database will definitely help you discover
and even experiment with these patterns, but eventually, you will find
its limits.

D'Arcy J.M. Cain · Mar 12, 2010

Let me give you an example. I worked on a system that would load
recipients for email campaigns into a database table. The SQL database
was nice during the initial design and prototype stage because we
could quickly adjust the tables to add or remove columns and try out
different designs.. However, once our system got popular, the
limitation was how fast we could load recipients into the database.

Just curious, what database were you using that wouldn't keep up with
you? I use PostgreSQL and would never consider going back to flat
files. The only thing I can think of that might make flat files faster
is that flat files are buffered whereas PG guarantees that your
information is written to disk before returning but if speed is more
important than 100% reliability you can turn that off and let PG use
the file system buffering just like flat files.

Paul Rubin · Mar 12, 2010

D'Arcy J.M. Cain said:
Just curious, what database were you using that wouldn't keep up with
you? I use PostgreSQL and would never consider going back to flat
files.

Try making a file with a billion or so names and addresses, then
compare the speed of inserting that many rows into a postgres table
against the speed of copying the file.

The only thing I can think of that might make flat files faster is
that flat files are buffered whereas PG guarantees that your
information is written to disk before returning

Don't forget all the shadow page operations and the index operations,
and that a lot of these operations require reading as well as writing
remote parts of the disk, so buffering doesn't help avoid every disk
seek.

Generally when faced with this sort of problem I find it worthwhile to
ask myself whether the mainframe programmers of the 1960's-70's had to
deal with the same thing, e.g. when sending out millions of phone bills,
or processing credit card transactions (TPF), then ask myself how they
did it. Their computers had very little memory or disk space by today's
standards, so their main bulk storage medium was mag tape. A heck of a
lot of these data processing problems can be recast in terms of sorting
large files on tape, rather than updating database one record at a time
on disk or in memory. And that is still what (e.g.) large search
clusters spend a lot of their time doing (look up the term "pennysort"
for more info).

Jonathan Gardner · Mar 14, 2010

Try making a file with a billion or so names and addresses, then
compare the speed of inserting that many rows into a postgres table
against the speed of copying the file.

Also consider how much work it is to partition data from flat files
versus PostgreSQL tables.

Don't forget all the shadow page operations and the index operations,
and that a lot of these operations require reading as well as writing
remote parts of the disk, so buffering doesn't help avoid every disk
seek.

Plus the fact that your other DB operations slow down under the load.

D'Arcy J.M. Cain · Mar 14, 2010

That's a straw man argument. Copying an already built database to
another copy of the database won't be significantly longer than copying
an already built file. In fact, it's the same operation.

Also consider how much work it is to partition data from flat files
versus PostgreSQL tables.

Another straw man. I'm sure you can come up with many contrived
examples to show one particular operation faster than another.
Benchmark writers (bad ones) do it all the time. I'm saying that in
normal, real world situations where you are collecting billions of data
points and need to actually use the data that a properly designed
database running on a good database engine will generally be better than
using flat files.

Not sure what a "shadow page operation" is but index operations are
only needed if you have to have fast access to read back the data. If
it doesn't matter how long it takes to read the data back then don't
index it. I have a hard time believing that anyone would want to save
billions of data points and not care how fast they can read selected
parts back or organize the data though.

Plus the fact that your other DB operations slow down under the load.

Not with the database engines that I use. Sure, speed and load are
connected whether you use databases or flat files but a proper database
will scale up quite well.

Steve Holden · Mar 14, 2010

D'Arcy J.M. Cain said:
That's a straw man argument. Copying an already built database to
another copy of the database won't be significantly longer than copying
an already built file. In fact, it's the same operation.

Another straw man. I'm sure you can come up with many contrived
examples to show one particular operation faster than another.
Benchmark writers (bad ones) do it all the time. I'm saying that in
normal, real world situations where you are collecting billions of data
points and need to actually use the data that a properly designed
database running on a good database engine will generally be better than
using flat files.

Not sure what a "shadow page operation" is but index operations are
only needed if you have to have fast access to read back the data. If
it doesn't matter how long it takes to read the data back then don't
index it. I have a hard time believing that anyone would want to save
billions of data points and not care how fast they can read selected
parts back or organize the data though.

Not with the database engines that I use. Sure, speed and load are
connected whether you use databases or flat files but a proper database
will scale up quite well.

A common complaint about large database loads taking a long time comes
about because of trying to commit the whole change as a single
transaction. Such an approach can indeed causes stresses on the database
system, but aren't usually necessary.

I don't know about PostgreSQL's capabilities in this area but I do know
that Oracle (which claims to be all about performance, though in fact I
believe PostgreSQL is its equal in many applications) allows you to
switch off the various time-consuming features such as transaction
logging in order to make bulk updates faster.

I also question how many databases would actually find a need to store
addresses for a sixth of the world's population, but this objection is
made mostly for comic relief: I understand that tables of such a size
are necessary sometimes.

There was a talk at OSCON two years ago by someone who was using
PostgreSQL to process 15 terabytes of medical data. I'm sure he'd have
been interested in suggestions that flat files were the answer to his
problem ...

Another talk a couple of years before that discussed how PostgreSQL was
superior to Oracle in handling a three-terabyte data warehouse (though
it conceded Oracle's superiority in handling the production OLTP system
on which the warehouse was based - but that's four years ago).

http://images.omniti.net/omniti.com/~jesus/misc/BBPostgres.pdf

Of course if you only need sequential access to the data then the
relational approach may be overkill. I would never argue that relational
is the best approach for all data and all applications, but it's often
better than its less-informed critics realize.

regards
Steve

D'Arcy J.M. Cain · Mar 14, 2010

A common complaint about large database loads taking a long time comes
about because of trying to commit the whole change as a single
transaction. Such an approach can indeed causes stresses on the database
system, but aren't usually necessary.
True.

I don't know about PostgreSQL's capabilities in this area but I do know
that Oracle (which claims to be all about performance, though in fact I
believe PostgreSQL is its equal in many applications) allows you to
switch off the various time-consuming features such as transaction
logging in order to make bulk updates faster.

Yes, PostgreSQL has a bulk loading option as well. It's usually useful
when you are copying data from one database into another and need to do
it quickly.

I also question how many databases would actually find a need to store
addresses for a sixth of the world's population, but this objection is
made mostly for comic relief: I understand that tables of such a size
are necessary sometimes.

How about Microsoft storing it's user base? Oh wait, they only store
registered users with legitimate copies. Never mind.

Of course if you only need sequential access to the data then the
relational approach may be overkill. I would never argue that relational
is the best approach for all data and all applications, but it's often
better than its less-informed critics realize.

As a rule I find that in the real world the larger the dataset the
more likely you need a proper database. For smaller datasets it
doesn't matter so why not use a DB anyway and be prepared when that
"throwaway" system suddenly becomes your company's core application.

News123 · Mar 14, 2010

Hi DUncan,

Duncan said:
You've totally missed the point. It isn't the size of the data you have
today that matters, it's the size of data you could have in several years'
time.

Maybe today you've got 10 users each with 10 megabytes of data, but you're
aspiring to become the next twitter/facebook or whatever. It's a bit late
as you approach 100 million users (and a petabyte of data) to discover that
your system isn't scalable: scalability needs to be built in from day one.

any project/product has to adapt over time.

Not using SQL just because your 20 user application with 100 data sets
might grow into the worlds biggest database doesn't seem right to me.

I strongly believe in not overengineering a product.

For anything I do I use the most covnenient python library first.
This allows me to have results quicky and to get feedback about the
product ASAP.

Lateron I switch to the more performant versions.

bye

N

Jonathan Gardner · Mar 15, 2010

That's a straw man argument. Copying an already built database to
another copy of the database won't be significantly longer than copying
an already built file. In fact, it's the same operation.

I don't understand what you're trying to get at.

Each bit of data follows a particular path through the system. Each
bit of data has its own requirements for availability and consistency.
No, relational DBs don't have the same performance characteristic as
other data systems because they do different things.

If you have data that fits a particular style well, then I suggest
using that style to manage that data.

Let's say I had data that needs to hang around for a little while then
disappear into the archives. Let's say you hardly ever do random
access on this data because you always work with it serially or in
large batches. This is exactly like the recipient d

Another straw man. I'm sure you can come up with many contrived
examples to show one particular operation faster than another.
Benchmark writers (bad ones) do it all the time. I'm saying that in
normal, real world situations where you are collecting billions of data
points and need to actually use the data that a properly designed
database running on a good database engine will generally be better than
using flat files.

You're thinking in the general. Yes, RDBMs do wonderful things in the
general cases. However, in very specific circumstances, RDBMS do a
whole lot worse.

Think of the work involved in sharding an RDBMS instance. You need to
properly implement two-phase commit above and beyond the normal work
involved. I haven't run into a multi-master replication system that is
trivial. When you find one, let me know, because I'm sure there are
caveats and corner cases that make things really hard to get right.

Compare this to simply distributing flat files to one of many
machines. It's a whole lot easier to manage and easier to understand,
explain, and implement.

You should use the right tool for the job. Sometimes the data doesn't
fit the profile of an RDBMs, or the RDBMs overhead makes managing the
data more difficult than it needs to be. In those cases, it makes a
whole lot of sense to try something else out.

Not sure what a "shadow page operation" is but index operations are
only needed if you have to have fast access to read back the data. If
it doesn't matter how long it takes to read the data back then don't
index it. I have a hard time believing that anyone would want to save
billions of data points and not care how fast they can read selected
parts back or organize the data though.

I don't care how the recipients for the email campaign were indexed. I
don't need an index because I don't do random accesses. I simply need
the list of people I am going to send the email campaign to, properly
filtered and de-duped, of course. This doesn't have to happen within
the database. There are wonderful tools like "sort" and "uniq" to do
this work for me, far faster than an RDBMS can do it. In fact, I don't
think you can come up with a faster solution than "sort" and "uniq".

Not with the database engines that I use. Sure, speed and load are
connected whether you use databases or flat files but a proper database
will scale up quite well.

I know for a fact that "sort" and "uniq" are far faster than any
RDBMs. The reason why is obvious.

Designing a Pythonic search DSL for SQL and NoSQL databases	2	Jul 19, 2013
[BarCamp] WebWorkersCamp BarCamp: NodeJS, NoSQL, Message Queues,Asynchronous programming, Web Socket	0	May 17, 2010
Python web-framework+db with the widest scalability?	1	May 12, 2012
XML python to database	3	Nov 2, 2013
Upgrading Company's Internal Record Keeping Systems	0	Sep 24, 2021
PHP/MySQL UPDATE not working for second table	1	Jan 11, 2020
Presentation of a new native JavaScript database	3	Feb 10, 2010
Which one?	2	Nov 8, 2024

NoSQL Movement?

Xah Lee

dkeeney

Jonathan Gardner

D'Arcy J.M. Cain

Paul Rubin

Jonathan Gardner

D'Arcy J.M. Cain

Steve Holden

D'Arcy J.M. Cain

News123

Jonathan Gardner

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads