python-noob - which container is appropriate for later exportinginto mySql + matplotlib ?

C

Chris Angelico

Roger... But google says sqlite is supposed to be ACID compliant (although
maybe not "fully" as you indicate, I'm not sure about this)...

What your Google hits are telling you is that sqlite can (if
configured correctly) pass level 2. But it doesn't guarantee anything
about the other levels, so it's easy to have an, uhh, ACID leak.
Ok, that doesn't sound to be so simple after all...

I gave a fairly wordy run-down of what I tested, but it's actually
fairly simple in concept: Do a huge bunch of transactions, and keep a
log of what's returned from the COMMIT query; then pull the power out.
Ok, it would be nice to hear/read the opinion from another in here who've
been working (a lot?) with sqlite...

Agreed. I'm sure someone will chime in.
I'm not so rich, so I prefer to go for a free database solution rather than
an expensive license... I've heard good things about oracle and that's also
what they used at my previous company, but it's not something I am willing
to pay for, from my private/own money for my sparetime-projects...

I concur with Walter's assessment: You want PostgreSQL. It's free/open
source software (highly permissive MIT-like license), massively
trusted, and scales up beautifully. (That last one may not be
significant to you, but it's still good to know your database can
handle hundreds or thousands of tps on basic hardware.)

ChrisA
 
R

rusi

What your Google hits are telling you is that sqlite can (if
configured correctly) pass level 2. But it doesn't guarantee anything
about the other levels, so it's easy to have an, uhh, ACID leak.



I gave a fairly wordy run-down of what I tested, but it's actually
fairly simple in concept: Do a huge bunch of transactions, and keep a
log of what's returned from the COMMIT query; then pull the power out.


Agreed. I'm sure someone will chime in.


I concur with Walter's assessment: You want PostgreSQL. It's free/open
source software (highly permissive MIT-like license), massively
trusted, and scales up beautifully. (That last one may not be
significant to you, but it's still good to know your database can
handle hundreds or thousands of tps on basic hardware.)

ChrisA

Dunno why you guys are ACIDing a hapless python+SQL noob.
As far as I can see he did not even know what ACID was... Just
happened to start with mysql (without evidently knowing the DBMS area)
and Cousin Stanley's recommendation to step a notch down from mysql to
sqlite seems to me to be spot-on for his requirement.

To the OP:
Steven is welcome to his views about use of databases. Good to
remember that everyone does not agree with him. This includes the
firefox devs as well as python devs.

In particular, sqlite in python is quite special. All the other
databases have bridge modules to talk from python to the database.
Which means that python runs and the database runs and the two talk
asynchronously across the bridge using what is called a 'client-server
model'. Now client-server is powerful and sophisticated and earlier it
was the only option. That is the noob database programmer had to
grapple with sql (the basic stuff) along with the transaction/ACID
advanced stuff.

Sqlite changed the rules of the game. Sqlite allows programmers to
play with sql without having to deal with client server headaches at
the same time.
Python amplified that change by bundling it with python.

In short Python+Sqlite is a boon for beginners to programming+DBMS
 
S

someone

Chris Angelico said:
2) The database engine must employ some form of write-ahead log.
[...]
one way or another, there must be a way to detect half-done
transactions.

3) The operating system and filesystem must support a forced file
synchronization (fsync/fdatasync), so the database engine can wait for
the data to be written to disk.

4) The underlying media (hard disk, SSD, USB stick, etc) must respond
to the fsync call by actually writing the content to persistent
storage before returning.

Some of the early Unix file systems were very fragile. One of the
(often under-appreciated) major advances in BSD (it was certainly in
4.2, not sure how much earlier) was a new filesystem which was much more
robust in the face of hardware failures and system crashes. Prior to

Are you talking about (journaling?) filesystems such as ext3, ext4, JFS,
ReiserFS and XFS ?

http://en.wikipedia.org/wiki/Journaling_file_system
BSD, the on-disk data could be left in an inconsistent state if the
system crashed at the wrong time. In BSD, data was written to disk in
such a way that every operation could either be backed out cleanly or
had enough information to complete the transaction.

Journaling filesystems? I myself use ext4... There's a comparison here:

http://en.wikipedia.org/wiki/Comparison_of_file_systems

?
 
S

someone

(<paraphrasing> but I do care about ACID compliance)

Sounds to me that PostgreSQL is your man, then.

Oh, ok. Thanks! BTW: I just read: "Yahoo runs a multi-petabyte modified
PostgreSQL database that processes billions of events per day" - that's
truely amazing, I think...

I think maybe I'll experiment a bit with both mySql (small/medium sized
databases) and for critical/important stuff I should go with
PostgreSQL... Glad to hear this... Then I know what to look at...
 
S

someone

What your Google hits are telling you is that sqlite can (if
configured correctly) pass level 2. But it doesn't guarantee anything
about the other levels, so it's easy to have an, uhh, ACID leak.

Ok, thank you very much, this is something I couldn't easily see in the
first place... I think what I should do now is to play a bit with sqlite
and then afterwards, when I'm happy I would begin to play with
postgresql and be very happy with it, knowing that I can (hopefully) use
that for all important projects in the rest of my life :)

I might also play a bit with mySql, because it's my impression that it
also have a big user-group. But I read that postgresql is MUCH more
"safe" to use (and a bit slower) than postgresql which on the other hand
is VERY safe, being fully ACID-compliant...
I gave a fairly wordy run-down of what I tested, but it's actually
fairly simple in concept: Do a huge bunch of transactions, and keep a
log of what's returned from the COMMIT query; then pull the power out.

I'll try it (or something similar) out one day in the future and see
what happens with the "corrupted" changes due to pulling out the network
cable while transmitting data...
Agreed. I'm sure someone will chime in.


I concur with Walter's assessment: You want PostgreSQL. It's free/open
source software (highly permissive MIT-like license), massively
trusted, and scales up beautifully. (That last one may not be
significant to you, but it's still good to know your database can
handle hundreds or thousands of tps on basic hardware.)

I understand that scaling is VERY important and if I could choose
between two "equally" opensource systems and one of them scales better
than the other, I would definately work with the one that scales the
most - that means that I don't have to learn how to use a whole new
system, if I already learnt the system that scales best...

And I just found on google that yahoo runs a HUGE PostgreSQL database...
Very interesting - I'll definately try to play around with postgreSQL at
some time in the future...
 
S

someone

......
......

Dunno why you guys are ACIDing a hapless python+SQL noob.

That's ok - I'm very interested in hearing/reading this, so don't worry :)
As far as I can see he did not even know what ACID was... Just

I think I know it know (maybe not all the details, but generally I know
that it should be ACID-compliant for critical data to avoid corruption
and bad data) :)
happened to start with mysql (without evidently knowing the DBMS area)
and Cousin Stanley's recommendation to step a notch down from mysql to
sqlite seems to me to be spot-on for his requirement.

Agree - but after that I would like to play with a client/server-system,
so that's also interesting to hear about...
To the OP:
Steven is welcome to his views about use of databases. Good to
remember that everyone does not agree with him. This includes the
firefox devs as well as python devs.

Yes, I think I understand this discussion. I'm sorry to hear that the
sqlite-database-files sometimes become corrupted. I haven't experienced
this problem myself (AFAIR), because ~90% of the time I'm on chromium.
In particular, sqlite in python is quite special. All the other
databases have bridge modules to talk from python to the database.
Which means that python runs and the database runs and the two talk
asynchronously across the bridge using what is called a 'client-server
model'. Now client-server is powerful and sophisticated and earlier it

Yes, got it :)
was the only option. That is the noob database programmer had to
grapple with sql (the basic stuff) along with the transaction/ACID
advanced stuff.

Yep, I understand your intentions...
Sqlite changed the rules of the game. Sqlite allows programmers to
play with sql without having to deal with client server headaches at
the same time.
Python amplified that change by bundling it with python.

In short Python+Sqlite is a boon for beginners to programming+DBMS

I completely agree with you that Python+Sqlite is really really great...
But soon I'll also move on to using a client/server model and therefore
I also appreciate the other comments/discussion related to e.g. failure
or non-"fully-ACID compliance" of sqlite, which maybe can explain this
firefox problem with corrupted database(s)...

I think I learned a lot from this thread and know what I should be
working on now...
 
D

Dennis Lee Bieber

True ACID compliance demands support at every level:

1) The application has to operate in logical units of work, which -
apart from with DB2 - requires an explicit "BEGIN" query, or
single-statement transactions.
While SQLite3 normally runs in an auto-commit mode, the Python
DB-API spec, in general, requires that auto-commit be turned off. "The
Definitive Guide to SQLite" states that the Python adapter scans
queries, and will start a transaction if the query is one that will
change data (insert/replace/update). Read-only queries stay auto-commit
until one of the data change queries is submitted and not committed.
3) The operating system and filesystem must support a forced file
synchronization (fsync/fdatasync), so the database engine can wait for
the data to be written to disk.

4) The underlying media (hard disk, SSD, USB stick, etc) must respond
to the fsync call by actually writing the content to persistent
storage before returning.

Failure at any level means the overall system is not ACID compliant.
PostgreSQL has a huge amount of code in it to try to deal with (or at
least recognize) a level-3 failure, but nothing in the database engine
can deal with level 1 or 4 issues.

You'd have to actually test it. The easiest way is to get two
computers, side by side, and run the database engine on one and a
monitor on the other. To test some SSDs at work, I knocked together a
little program that worked somewhat thus:

* Connect to the database over TCP/IP (easy, as we were doing this
with PostgreSQL)

You don't with SQLite -- or, properly, it is not to an SQLite
port... It would be something like an NFS mounted file share -- and we
all know how uncertain file locking is over NFS. said:
* Create a table with a number of rows with an ID and a counter,
initialized to 0
* Repeatedly, in parallel, perform a transaction:
- Increment the counter on one of the rows (at random)
- Increment a "possible" in-memory counter for that row
- Commit the database transaction
- Increment a "confirmed" in-memory counter for that row
* When an error of "database seems to be down" is detected, wait for
it to come up again, then query the table. The counters must all be at
least their corresponding "possible" value and at most the
"confirmed".
SQLite is a "file server" database (like M$ JET engine [aka:
"Access"]). It's locking system is multi-stage. It allows multiple
concurrent readers on a "shared" lock state. Only one connection can
perform write operations ("reserved" lock) alongside the readers. A
second connection attempting to perform a write will be rejected with a
database locked condition. Then it really gets nasty -- the writer
attempts to commit the update: The first step is to block other
connections from even entering the read state (the "pending" lock).
However, the writer itself is blocked until all remaining readers have
exited; only then does it have exclusive access to and SQLite makes
changes to the database file itself (prior to that, the writer
connection is changing page images in memory)

So in your example above, the first process to submit an update
command is going to lock all the others from submitting updates AND will
itself be held from committing the update until all the other processes
have closed (commit or rollback their "read sessions"). Since the Python
adapter basically does auto-commit for all until an update is attempted,
the first process to submit the update will get the reserved lock, and
the other reading sessions can pop in and out freely. On the attempt to
commit, the other sessions will be blocked from even entering a read
session, and there should be no other session trying to start a write
transaction, so the odds are that the commit goes through (there is a
very small window in the locking sequence in which two Python
connections that submit update queries might get into the "shared read"
state, and one then has to back out when the other gets the "reserved"
lock.

In the commit phase, SQLite first tries to ensure the rollback
journal is flushed to disk -- but that apparently is out of its control;
it can submit a sync command to the OS, but has to rely on what the OS
tells it about the state of the writes to disk (the book indicates that
some IDE drives would lie when queried about sync status, while still
having unwritten data in the on-board buffers). After the rollback
journal it submits the data to the database. I

Crash during journal write: restart finds no journal, that transaction
is lost but the database itself is clean

Crash after journal during database update, restart finds journal,
assumes database is suspect, and rolls back the pages, database is
restored to pre-transaction state

Crash after database sync during removal of journal, restart either
finds journal still there and rolls back the pages restoring to
pretransaction state, or the file was removed from the directory and
SQLite determines database file is good with the last transaction in
place.
 
S

someone

[ ....]
* Create a table with a number of rows with an ID and a counter,
initialized to 0
* Repeatedly, in parallel, perform a transaction:
- Increment the counter on one of the rows (at random)
- Increment a "possible" in-memory counter for that row
- Commit the database transaction
- Increment a "confirmed" in-memory counter for that row
* When an error of "database seems to be down" is detected, wait for
it to come up again, then query the table. The counters must all be at
least their corresponding "possible" value and at most the
"confirmed".
SQLite is a "file server" database (like M$ JET engine [aka:
"Access"]). It's locking system is multi-stage. It allows multiple
concurrent readers on a "shared" lock state. Only one connection can
perform write operations ("reserved" lock) alongside the readers. A
second connection attempting to perform a write will be rejected with a
database locked condition. Then it really gets nasty -- the writer
attempts to commit the update: The first step is to block other
connections from even entering the read state (the "pending" lock).
However, the writer itself is blocked until all remaining readers have
exited; only then does it have exclusive access to and SQLite makes
changes to the database file itself (prior to that, the writer
connection is changing page images in memory)

Ok, this makes sense... It's not something I'll bother about to begin
with, but maybe later (for critical apps) I can see that this is important.

[ ....]
In the commit phase, SQLite first tries to ensure the rollback
journal is flushed to disk -- but that apparently is out of its control;
it can submit a sync command to the OS, but has to rely on what the OS
tells it about the state of the writes to disk (the book indicates that
some IDE drives would lie when queried about sync status, while still
having unwritten data in the on-board buffers). After the rollback
journal it submits the data to the database. I

I agree, this must be a problem, when the OS is lying...
Crash during journal write: restart finds no journal, that transaction
is lost but the database itself is clean

Crash after journal during database update, restart finds journal,
assumes database is suspect, and rolls back the pages, database is
restored to pre-transaction state

Crash after database sync during removal of journal, restart either
finds journal still there and rolls back the pages restoring to
pretransaction state, or the file was removed from the directory and
SQLite determines database file is good with the last transaction in
place.

Ok, this is a bit more advanced - I'll try to make my own experiments
now and then after some time I guess I can dig more into these details,
thanks.
 
R

Roy Smith

[QUOTE="someone said:
Some of the early Unix file systems were very fragile. One of the
(often under-appreciated) major advances in BSD (it was certainly in
4.2, not sure how much earlier) was a new filesystem which was much more
robust in the face of hardware failures and system crashes. Prior to

Are you talking about (journaling?) filesystems such as ext3, ext4, JFS,
ReiserFS and XFS ?

http://en.wikipedia.org/wiki/Journaling_file_system[/QUOTE]

No, I'm talking about

http://en.wikipedia.org/wiki/Berkeley_Fast_File_System

Journaling came along later.
 
S

someone

If it were me I wouldn't use MySQL for anything at all. I'd use sqlite
for little non-critical local applications, and Postgres for the rest.

Ok, thank you. I just came across a blog that said pytables is also a
very good option?

http://www.pytables.org/moin/PyTables?action=AttachFile&do=view&target=non-indexed.png
Postgres is not difficult at all, provided you RTFM and follow the
instructions (the documentation is superb). And whichever you use, you
need to learn SQL anyway.

Good to hear... I'll dig more into it, thank you...
 
C

Chris Angelico

I think maybe I'll experiment a bit with both mySql (small/medium sized
databases) and for critical/important stuff I should go with PostgreSQL

PostgreSQL isn't majorly slower than MySQL, and it's a lot more
trustworthy in terms of database constraints and so on. MySQL is
designed as a place for a single application to store its data, and it
assumes that the application is king; PostgreSQL is designed as a
database against which application(s) may execute queries, therefore
it assumes that the database administrator is king.

With heavy read/write workloads, I'd put my money on PostgreSQL every
time; MySQL has a much greater problem with wide locks (eg
table-level) and consequent loss of concurrency.

ChrisA
 
C

Chris Angelico

While SQLite3 normally runs in an auto-commit mode, the Python
DB-API spec, in general, requires that auto-commit be turned off. "The
Definitive Guide to SQLite" states that the Python adapter scans
queries, and will start a transaction if the query is one that will
change data (insert/replace/update). Read-only queries stay auto-commit
until one of the data change queries is submitted and not committed.

Okay, that's good. Point still stands, though, that the application
has to use BEGIN/COMMIT correctly; the size of the logical unit of
work should be defined by what's one logical action, not by what gives
the best performance.
You don't with SQLite -- or, properly, it is not to an SQLite
port... It would be something like an NFS mounted file share -- and we
all know how uncertain file locking is over NFS. <G>

Sure, but you could easily make a tiny "SQLite server" that accepts
socket connections, reads integers, and writes back "OK" when the
transaction's committed. The only difference is that you have to write
two halves instead of letting the DB itself be the other half.
So in your example above, the first process to submit an update
command is going to lock all the others from submitting updates AND will
itself be held from committing the update until all the other processes
have closed (commit or rollback their "read sessions").

Ah, that'd be a problem. What if each row is in its own file, though?
Would that work? That is, instead of:

UPDATE durability_test_table SET counter=counter+1 WHERE id=:random_value

you use:

UPDATE durability_test_:random_value SET counter=counter+1

(except, of course, that SQL parameterization wouldn't work there, so
it'd be Python string manipulation) - this way, transactions will lock
only against other transactions manipulating the same entry, which is
effectively the same as row-level locking. With 2-3 times as many
"rows" as threads, there should be very little lock contention.

ChrisA
 
C

Chris Angelico

Ok, thank you. I just came across a blog that said pytables is also a very
good option?

http://www.pytables.org/moin/PyTables?action=AttachFile&do=view&target=non-indexed.png
From what I gather, that's looking at performance of a non-indexable
query on a 10,000,000-row table. That's going to suck whatever you do,
and the exact level of suckitude doesn't really prove much. (Note that
even the best options are taking half a second for this single query.)

A better test of a database is transactions per second of something
that approximates to your real workload. For instance, English
Wikipedia has roughly a hundred edits per minute (assessed by me just
now by looking at the Recent Changes), and some ridiculous number of
page reads per minute (not assessed, but believed to be somewhere
between 11 and Graham's number); so a test of a proposed new database
would have to mimic this ratio. Most of the queries involved should be
able to be answered using indexes; in some cases, ONLY using the index
(eg if you just want to know whether or not a row exists).

PyTables may well outperform PostgreSQL in real usage, but that one
graph doesn't tell me that. (Not to mention that it's measuring a
somewhat old PG.)

ChrisA
 
D

Dennis Lee Bieber

Ah, that'd be a problem. What if each row is in its own file, though?
Would that work? That is, instead of:

UPDATE durability_test_table SET counter=counter+1 WHERE id=:random_value

you use:

UPDATE durability_test_:random_value SET counter=counter+1
SQLite3 requires one to "attach" other files.

ATTACH "file.spec" AS internalname

If tables in the files share names, you have to use a qualified name
to access them:

select * from internalname.table

{the original connected database file can be qualified as "main"}

I'll admit I don't know enough about SQLite3 to know if the locks
are global, or per attached file. The information may be in the book
(the first edition is actually larger than the second edition! and may
have more of the gritty internals described) -- but you'll forgive me if
I don't spend the evening reading to find out. If per file, you could
have an update on each file with no locking conflict between them.
 
S

someone

query on a 10,000,000-row table. That's going to suck whatever you do,
and the exact level of suckitude doesn't really prove much. (Note that
even the best options are taking half a second for this single query.)

Interesting... Thank you very much for that information...
A better test of a database is transactions per second of something
that approximates to your real workload. For instance, English
Wikipedia has roughly a hundred edits per minute (assessed by me just
now by looking at the Recent Changes), and some ridiculous number of
page reads per minute (not assessed, but believed to be somewhere
between 11 and Graham's number); so a test of a proposed new database
would have to mimic this ratio. Most of the queries involved should be
able to be answered using indexes; in some cases, ONLY using the index
(eg if you just want to know whether or not a row exists).

PyTables may well outperform PostgreSQL in real usage, but that one
graph doesn't tell me that. (Not to mention that it's measuring a
somewhat old PG.)

Ok, thank you very much... Sounds to me like PostgreSQL it is, then :)
 
S

someone

PostgreSQL isn't majorly slower than MySQL, and it's a lot more
trustworthy in terms of database constraints and so on. MySQL is
designed as a place for a single application to store its data, and it
assumes that the application is king; PostgreSQL is designed as a
database against which application(s) may execute queries, therefore
it assumes that the database administrator is king.

With heavy read/write workloads, I'd put my money on PostgreSQL every
time; MySQL has a much greater problem with wide locks (eg
table-level) and consequent loss of concurrency.

Ok, thank you very much... Sounds like PostgreSQL is the best option for
me to go on to from here, after I've played a bit my sqlite...
 
S

Steven D'Aprano

To the OP:
Steven is welcome to his views about use of databases.

I haven't given any views about databases. I've given my view on
application developers -- specifically, Firefox -- using a not-quite ACID
database in a way that is fragile, can cause data loss, and adds lots
more complexity to the application AND the end-user experience. And for
what? Simple data that would be much better in a simpler format, such as
bookmarks.

Good to remember
that everyone does not agree with him. This includes the firefox devs as
well as python devs.

I don't see what the Python devs have to do with it. They don't use
Sqlite for Python's internals, and the fact that there is a module for
sqlite doesn't mean squat. There's a module for parsing Sun AU audio
files, that doesn't mean the Python devs recommend that they are the best
solution to your audio processing and multimedia needs.

I'm not saying that Sqlite doesn't have it's uses, although I personally
haven't found them yet. And as for the Firefox devs, well, I'll just let
Jamie Zawinski show their l33t des1gn ski11z in context:

http://www.jwz.org/blog/2003/01/more-khtml/

Okay, that's ten years old. What do you think the odds are that Firefox
has a nice, clean design by now? Well, I suppose it's possible, but when
it takes a minimum of NINE files to do the equivalent of "Hello World" in
Firefox, I wouldn't put money on it:

http://kb.mozillazine.org/Getting_started_with_extension_development

I mean, really -- bookmarks, in a single-user application, and they store
it in a database. You can't even have two instances of Firefox running at
the same time.

The consequences of this over-engineered solution is that Firefox is more
complex and fragile than it needs be, and less reliable than it could be.
When your bookmarks database gets corrupt, which is easy, the browser
History and Back button stop working, which then pushes responsibility
for fixing the database corruption back on the user. So the Firefox
developers actually end up paying the costs of a non-lightweight
implementation, but without the benefits. They don't even get to remove
the old bookmarks to HTML code, since they still need it for manual
exports and backups.

Considering the rest of the Firefox architecture (XUL, XUL everywhere!),
using sqlite probably feels like a lightweight solution to the devs.

"The Mork database structure used by Mozilla Firefox v1-2 is unusual to
say the least. It was originally developed by Netscape for their browser
(Netscape v6) and the format was later adopted by Mozilla to be used in
Firefox. It is a plain text format which is not easily human readable
and is not efficient in its storage structures. For example, a single
Unicode character can take many bytes to store. The developers
themselves complained it was extremely difficult to parse correctly and
from Firefox v3, it was replaced by MozStorage which is based on an
SQLite database."


http://wordpress.bladeforensics.com/?p=357

http://en.wikipedia.org/wiki/Mork_(file_format)
 
R

rusi

I haven't given any views about databases.

You are twisting "use of databases" to just "about databases"

And heres what you said:
Using a database for such lightweight data as bookmarks is, in my
opinion, gross overkill and adds to the complexity of Firefox. More
complexity leads to more bugs…

Not that I would disagree with that for general databases, just for
something as atypical as sqlite.
In short, you are being hypnotized by the word 'database' and not
seeing that sqlite is a very strange instance of that species.
http://en.wikipedia.org/wiki/Etymological_fallacy
+
http://en.wikipedia.org/wiki/Accident_(fallacy)
I've given my view on
application developers -- specifically, Firefox -- using a not-quite ACID
database in a way that is fragile, can cause data loss,

FUD
Are you saying that flat-files dont lose data?
and adds lots
more complexity to the application AND the end-user experience. And for
what?

Strange argument: If I call a one line re.match(..) that hooks into
5000 arcane lines of the re module, on whose account is the complexity
-- mine or python's?

From a programmer's POV if 10 lines of flat-file munging are reduced
to two lines of SQL its a reduction of 10 to 2.
Simple data that would be much better in a simpler format, such as
bookmarks.


I don't see what the Python devs have to do with it. They don't use
Sqlite for Python's internals, and the fact that there is a module for
sqlite doesn't mean squat. There's a module for parsing Sun AU audio
files, that doesn't mean the Python devs recommend that they are the best
solution to your audio processing and multimedia needs.

Python made a choice to include AU file support when Sun existed and
looked more respectable than MS. Today the support continues to exist
probably for backward compatibility reasons. "The code's already
written. Why remove it?"
Sure but it has its costs -- memory footprint, sources-size etc --
which are deemed negligible enough to not bother.

Likewise python 2.5 made a choice to include sqlite. Following RoR's D
Hansson we may call it an 'opinionated choice.' That choice implies
that the devs decided that a fixed-cost of bundling sqlite with python
is deemed better than each programmer installing/rolling-his-own etc

I'm not saying that Sqlite doesn't have it's uses, although I personally
haven't found them yet. And as for the Firefox devs, well, I'll just let
Jamie Zawinski show their l33t des1gn ski11z in context:

http://www.jwz.org/blog/2003/01/more-khtml/

Faulty generalization fallacy:
http://en.wikipedia.org/wiki/Faulty_generalization
Because some code in firefox is bad, every choice of firefox is bad?
[Actually I am surprised that you agree with *that example*: Would you
claim that a void returning, no-argument function is better than one
with arguments and return values? Anyways thats really far away from
this discussion…]

To the OP:
Lets deconstruct ACID.

Consistency+Atomicity:
Lets say you write some stack code like this
stack[top] = newvalue
top += 1

And if you catch the machine state between the two assignments, you
will find an *inconsistent* stack because that code is *non-atomic*
Should you bother? Yes if you have concurrency, no if not.

Likewise Isolation is vacuously guaranteed if you are the sole guy
running your code.

As for Durability, if you randomly turn off your machine when your
program is running, yes you may lose the results of your program. You
may lose much else!

IOW if you are alone on your machine, all discussion of ACID is moot
 
C

Chris Angelico

FUD
Are you saying that flat-files dont lose data?

If they do, a human being can easily open them up and see what's
inside. Suppose bookmarks are stored like this:

r"""Some-Browser-Name web bookmarks file - edit with care
url: http://www.google.com/
title: Search engine
icon: whatever-format-you-want-to-use

url: http://www.duckduckgo.com/
title: Another search engine

url: http://www.python.org/

url: ftp://192.168.0.12/
title: My FTP Server
desc: Photos are in photos/, videos are in videos/
Everything else is in other/
user: root
pass: secret
"""

The parsing of this file is pretty simple. Blank line marks end of
entry; indented line continues the previous attribute (like RFC822),
everything else is "attribute: value". (You might even be able to
abuse an RFC822 parser/compositor for the job.) The whole file has to
be read and rewritten for any edits, so it's unsuited to gigabytes of
content; but we're talking about *web browser bookmarks* here. I know
some people have a lot of them, but hardly gigs and gigs. And if you
think they will, then all you need to do is have multiple files, eg
one for each folder in the bookmark tree.

Now suppose it gets damaged somehow. Firstly, that's a lot less likely
with a simple file format and a "write to temp file, then move temp
file over main file" setup; but mainly, it's very easy to
resynchronize - maybe there'll be one bookmark (or a group of
bookmarks) that get flagged as corrupted, but everything after that
can be parsed just fine - as soon as you get to a blank line, you
start parsing again. Very simple. Well suited to a simple task. (Note,
however, that the uber-simple concept I've posited here would have the
same concurrency problems that Firefox has. At very least, it'd rely
on some sort of filesystem-level lock when it starts rewriting the
file. But this is approximately similar to running two instances of a
text editor and trying to work with the same file.)
From a programmer's POV if 10 lines of flat-file munging are reduced
to two lines of SQL its a reduction of 10 to 2.

The complexity exists in a variety of places. The two lines of SQL
hide a morass of potential complexity; so would a massive regex. The
file itself is way harder for external tools to manage. And all of it
can be buggy. With a simple flat-file system, chances are you can turn
it into a nested list structure and a dict for indexing (or possibly a
collections.OrderedDict), and then you have the same reduction - it's
just simple in-memory operations, possibly followed by a save() call.
All the options available will do that, whether flat-file or database.
Python made a choice to include AU file support when Sun existed and
looked more respectable than MS. Today the support continues to exist
probably for backward compatibility reasons. "The code's already
written. Why remove it?"
Sure but it has its costs -- memory footprint, sources-size etc --
which are deemed negligible enough to not bother.

Actually, this is one place where I disagree with the current decision
of the Python core devs: I think bindings for other popular databases
(most notably PostgreSQL, and probably MySQL since it's so widely
used) ought to be included in core, rather than being shoved off to
PyPI. Databasing is so important to today's world that it would really
help if people had all the options right there in core, if only so
they're more findable (if you're browsing docs.python.org, you won't
know that psycopg is available). Currently the policy seems to be "we
don't include the server so why should we include the client"; I
disagree, I think the client would stand nicely on its own. (Does
Python have a DNS server module? DNS client? I haven't dug deep, but
I'm pretty sure I can do name lookups in Python, yet running a DNS
server is sufficiently arcane that it can, quite rightly, be pushed
off to PyPI.) But this is minor, and tangential to this discussion.
Faulty generalization fallacy:
http://en.wikipedia.org/wiki/Faulty_generalization
Because some code in firefox is bad, every choice of firefox is bad?

It's a matter of windows into the philosophy, rather than specific
examples. Requiring nine files to do a "Hello World" extension
suggests a large corpus of mandatory boilerplate; imagine, for
instance, that my example bookmarks file structure had demanded
_every_ attribute be provided for _every_ bookmark, instead of
permitting the defaults. That would demonstrate overkill in design,
and the sort of person who would produce that is probably unable to
simplify code for the same reasons.
As for Durability, if you randomly turn off your machine when your
program is running, yes you may lose the results of your program. You
may lose much else!

IOW if you are alone on your machine, all discussion of ACID is moot

No, no, a thousand times no! If I am doing financial transactions,
even if I'm alone on my machine, I will demand full ACID compliance.
Randomly turning off the machine is a simulation of the myriad
possible failures - incoming power failure (or UPS failure, if you
have one), power supply goes boom, motherboard gets fried, operating
system encounters a hard failure condition, cleaning lady unplugs the
server to put her vacuum cleaner onto the UPS... anything. The point
of ACID compliance is that you might lose the results of *this run* of
the program, but nothing more; and if any other program has been told
"That's committed", then it really has been. Without some such
guarantee, you might lose *all the data you have stored*, because
something got corrupted. Partial guarantees of acidity are
insufficient; imagine if power failure during ALTER TABLE can result
in your whole database being unreadable.

With the setup I described above, everything works beautifully if the
OS guarantees an atomic mv() operation. Even if it doesn't, you can
probably figure out what's going on by inspecting the file state; for
instance, you can assume that a non-empty main file should be kept
(discarding the temporary), but if the main file is empty or absent
AND the temporary is readable and parseable, use the temporary. (This
assumes that a fresh install creates a non-empty file, otherwise
there's ambiguity at initial file creation which would need to be
resolved. But you get the idea.)

Of course, that uber-simple option does require a full file rewrite
for every edit. But like I said, it's designed for simplicity, not
concurrent writing.

ChrisA
 
N

Ned Deily

Actually, this is one place where I disagree with the current decision
of the Python core devs: I think bindings for other popular databases
(most notably PostgreSQL, and probably MySQL since it's so widely
used) ought to be included in core, rather than being shoved off to
PyPI. Databasing is so important to today's world that it would really
help if people had all the options right there in core, if only so
they're more findable (if you're browsing docs.python.org, you won't
know that psycopg is available). Currently the policy seems to be "we
don't include the server so why should we include the client"; I
disagree, I think the client would stand nicely on its own. (Does
Python have a DNS server module? DNS client? I haven't dug deep, but
I'm pretty sure I can do name lookups in Python, yet running a DNS
server is sufficiently arcane that it can, quite rightly, be pushed
off to PyPI.) But this is minor, and tangential to this discussion.

For the bindings to be useful, Python batteries-included distributions
(like python.org installers) would either need to also ship the various
DB client libraries for all supported platforms (including Windows),
which adds complexity and potentially intractable license issues, or
there would need to be reverse-engineered implementations of the client
libs or wire protocols, either option adding fragility and complex
testing issues. DNS client lookups use published, well-understood
Internet-standard protocols, not at all like talking to a third-party
database, be it open-source or not. Sqlite3 is certainly an anomaly in
that it is not-only open source but designed to be a lightweight,
compatible library that runs on just about everything, and with a
fanatical devotion to compatibility and documentation. These days just
about every major product or operating system platform ships with or
uses a copy of sqllite3 for something.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,139
Messages
2,570,805
Members
47,356
Latest member
Tommyhotly

Latest Threads

Top