taking python enterprise level?...

S

simn_stv

Am 26.02.10 05:01, schrieb D'Arcy J.M. Cain:





So when you talk about ACKs, you don't mean these on the TCP-level
(darn, whatever iso-level that is...), but on some higher level?

Diez

i think its on the TCP that he's referring to or is it?...
if it is, that means he's doing some 'mean' network level scripting,
impressive...but i never thought python could go that deep in network
programming!...
 
D

D'Arcy J.M. Cain

sure it wouldnt have sped it up a bit, even a bit?; probably the
development and maintenance time would be a nightmare but it should
speed the app up a bit...

What do you mean by "even a bit?" The bulk of the time is in sending
bits on the wire. Computer time was always negligible in this
situation. Yes, I can write all of my applications in assembler to get
a 0.00000000000001% increase in speed but who cares?

If you have decent algorithms in place then 99% of the time I/O will be
your bottleneck and if it isn't then you have a compute heavy problem
that assembler isn't going to fix.

And even if I get a 100% increase in speed, I still lose. Computer
time is cheaper than programmer time by so many orders of magnitude
that it isn't even worh factoring in the speedup.
 
D

D'Arcy J.M. Cain

i think its on the TCP that he's referring to or is it?...

No, I mean in our own application layer.
if it is, that means he's doing some 'mean' network level scripting,
impressive...but i never thought python could go that deep in network
programming!...

What I meant was that we just keep sending packets which TCP/IP keeps
in order for us by reassembling out-of-order and retransmitted
packets. Asynchronously we sent back to our own application an ACK
that our app level packet was finally received. It's a sliding window
protocol. http://en.wikipedia.org/wiki/Sliding_Window_Protocol
 
M

mdipierro

but this would probably be to the detriment of my database design,
which is a no-no as far as im concerned. The way the tables would be
structured requires 'joins' when querying the db; or could you
elaborate a little??

Joins are the bottle neck of most web app that relay on relational
databases. That is why non-relational databases such as Google App
Engine, CouchDB, MongoDB do not even support Joins. You have to try to
minimize joins as much as possible by using tricks such as de-
normalization and caching.
hmmm...!, still thinking about what you mean by this statement also.

I meant 512MB. The point is you need a lot of ram because you want to
run multiple python instances, cache in ram as much as possible and
also allow the database to buffer in ram as much as possible. You will
see Ram usage tends to spike when you have lots of concurrent
requests.
 
D

D'Arcy J.M. Cain

Joins are the bottle neck of most web app that relay on relational
databases. That is why non-relational databases such as Google App
Engine, CouchDB, MongoDB do not even support Joins. You have to try to
minimize joins as much as possible by using tricks such as de-
normalization and caching.

I keep seeing this statement but nothing to back it up. I have created
many apps that run on Python with a PostgreSQL database with a fully
normalized schema and I can assure you that database joins were never
my problem unless I made a badly constructed query or left off a
critical index.
I meant 512MB. The point is you need a lot of ram because you want to
run multiple python instances, cache in ram as much as possible and
also allow the database to buffer in ram as much as possible. You will
see Ram usage tends to spike when you have lots of concurrent
requests.

Put as much memory as you can afford/fit into your database server.
It's the cheapest performance boost you can get. If you have a serious
application put at least 4GB into your dedicated database server.
Swapping is your enemy.
 
A

Aahz

Put as much memory as you can afford/fit into your database server.
It's the cheapest performance boost you can get. If you have a serious
application put at least 4GB into your dedicated database server.
Swapping is your enemy.

Also, put your log/journal files on a different spindle from the database
files. That makes a *huge* difference.
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"Many customs in this life persist because they ease friction and promote
productivity as a result of universal agreement, and whether they are
precisely the optimal choices is much less important." --Henry Spencer
 
M

mk

D'Arcy J.M. Cain said:
I keep seeing this statement but nothing to back it up. I have created
many apps that run on Python with a PostgreSQL database with a fully
normalized schema and I can assure you that database joins were never
my problem unless I made a badly constructed query or left off a
critical index.

I too have done that (Python/PGSQL), even adding a complicated layer of
SQLAlchemy on top of it and have not had issue with this: when I
profiled one of my apps, it turned out that it spent most of its
computation time... rendering HTML. Completely unexpected: I expected DB
to be bottleneck (although it might be that with huge datasets this
might change).

Having said that, re evidence that joins are bad: from what I've *heard*
about Hibernate in Java from people who used it (I haven't used
Hibernate apart from "hello world"), in case of complicated object
hierarchies it supposedly generates a lot of JOINs and that supposedly
kills DB performance.

So there *may* be some evidence that joins are indeed bad in practice.
If someone has smth specific/interesting on the subject, please post.

Regards,
mk
 
T

Tim Wintle

So there *may* be some evidence that joins are indeed bad in
practice.
If someone has smth specific/interesting on the subject, please post.

I have found joins to cause problems in a few cases - I'm talking about
relatively large tables though - roughly order 10^8 rows.

I'm on Mysql normally, but that shouldn't make any difference - I've
seen almost the same situation on Oracle

consider this simple example:

/* Table A */
CREATE TABLE TableA (
project_id BIGINT NOT NULL,
cost INT,
date DATETIME,
PRIMARY KEY (project_id, date)
);

/* Table projects */
CREATE TABLE projects (
client_id BIGINT NOT NULL,
project_id BIGINT NOT NULL,
INDEX(client_id)
);


.... now the index on TableA has been optimised for queries against date
ranges on specific project ids which should more or less be sequential
(under a load of other assumptions) - but that reduces the efficiency of
the query under a join with the table "projects".

If you denormalise the table, and update the first index to be on
(client_id, project_id, date) it can end up running far more quickly -
assuming you can access the first mapping anyway - so you're still
storing the first table, with stored procedures to ensure you still have
correct data in all tables.

I'm definitely glossing over the details - but I've definitely got
situations where I've had to choose denormalisation over purity of data.


Rolled-up data tables are other situations - where you know half your
queries are grouping by field "A" it's sometimes a requirement to store
that.


Tim
 
S

Steve Holden

mk said:
I too have done that (Python/PGSQL), even adding a complicated layer of
SQLAlchemy on top of it and have not had issue with this: when I
profiled one of my apps, it turned out that it spent most of its
computation time... rendering HTML. Completely unexpected: I expected DB
to be bottleneck (although it might be that with huge datasets this
might change).

Having said that, re evidence that joins are bad: from what I've *heard*
about Hibernate in Java from people who used it (I haven't used
Hibernate apart from "hello world"), in case of complicated object
hierarchies it supposedly generates a lot of JOINs and that supposedly
kills DB performance.

So there *may* be some evidence that joins are indeed bad in practice.
If someone has smth specific/interesting on the subject, please post.
I suspect that this myth is propagated from the distributed database
world: joining tables across two different servers can indeed be
problematic from a performance point of view.

However, the classic advice in database design is to start with a
normalized design and then vary it only if you need to for performance
reasons (which will also involve taking a hit on the coding side,
especially if updates are involved).

regards
Steve
 
P

Philip Semanchuk

I too have done that (Python/PGSQL), even adding a complicated layer
of SQLAlchemy on top of it and have not had issue with this: when I
profiled one of my apps, it turned out that it spent most of its
computation time... rendering HTML. Completely unexpected: I
expected DB to be bottleneck (although it might be that with huge
datasets this might change).

Having said that, re evidence that joins are bad: from what I've
*heard* about Hibernate in Java from people who used it (I haven't
used Hibernate apart from "hello world"), in case of complicated
object hierarchies it supposedly generates a lot of JOINs and that
supposedly kills DB performance.

So there *may* be some evidence that joins are indeed bad in
practice. If someone has smth specific/interesting on the subject,
please post.

It's an unprovable assertion, or a meaningless one depending on how
one defines the terms. You could also say "there *may* be some
evidence that Python lists are bad in practice". Python lists and SQL
JOINs are like any part of a language or toolkit. They're good tools
for solving certain classes of problems. They can also be misapplied
to problems that they're not so good at. Sometimes they're a
performance bottleneck, even when solving the problems for which
they're best. Sometimes the best way to solve a performance bottleneck
is to redesign your app/system so you don't need to solve that kind of
problem anymore (hence the join-less databases). Other times, the cure
is worse than the disease and you're better off throwing hardware at
the problem.

My $.02
Philip
 
M

mk

Hello Tim,

Pardon the questions but I haven't had the need to use denormalization
yet, so:

Tim said:
/* Table A */
CREATE TABLE TableA (
project_id BIGINT NOT NULL,
cost INT,
date DATETIME,
PRIMARY KEY (project_id, date)
);

/* Table projects */
CREATE TABLE projects (
client_id BIGINT NOT NULL,
project_id BIGINT NOT NULL,
INDEX(client_id)
);


.... now the index on TableA has been optimised for queries against date
ranges on specific project ids which should more or less be sequential
(under a load of other assumptions) - but that reduces the efficiency of
the query under a join with the table "projects".

If you denormalise the table, and update the first index to be on
(client_id, project_id, date) it can end up running far more quickly -

IOW you basically merged the tables like follows?

CREATE TABLE projects (
client_id BIGINT NOT NULL,
project_id BIGINT NOT NULL,
cost INT,
date DATETIME,
INDEX(client_id, project_id, date)
);

From what you write further in the mail I conclude that you have not
eliminated the first table, just made table projects look like I wrote
above, right? (and used stored procedures to make sure that both tables
contain the relevant data for client_id and project_id columns in both
tables)

Have you had some other joins on denormalized keys? i.e. in example how
the join of hypothetical TableB with projects on projects.client_id
behave with such big tables? (bc I assume that you obviously can't
denormalize absolutely everything, so this implies the need of doing
some joins on denormalized columns like client_id).
assuming you can access the first mapping anyway -

? I'm not clear on what you mean here.
so you're still
storing the first table, with stored procedures to ensure you still have
correct data in all tables.

Regards,
mk
 
M

mk

Philip said:
It's an unprovable assertion, or a meaningless one depending on how one
defines the terms. You could also say "there *may* be some evidence that
Python lists are bad in practice". Python lists and SQL JOINs are like
any part of a language or toolkit. They're good tools for solving
certain classes of problems. They can also be misapplied to problems
that they're not so good at. Sometimes they're a performance bottleneck,
even when solving the problems for which they're best. Sometimes the
best way to solve a performance bottleneck is to redesign your
app/system so you don't need to solve that kind of problem anymore
(hence the join-less databases). Other times, the cure is worse than the
disease and you're better off throwing hardware at the problem.

Look, I completely agree with what you're saying, but: that doesn't
change the possibility that joins may be expensive in comparison to
other SQL operations. This is the phrase I should have used perhaps;
'expensive in comparison with other SQL operations' instead of 'bad'.

Example from my app, where I behaved "by the book" (I hope) and
normalized my data:

$ time echo "\c hrs;
SELECT hosts.ip, reservation.start_date, architecture.architecture,
os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project,
email.email FROM hosts
INNER JOIN project ON project.id = hosts.project_id
INNER JOIN architecture ON hosts.architecture_id = architecture.id
INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
INNER JOIN os_version ON hosts.os_version_id = os_version.id
INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
INNER JOIN reservation on reservation.id =
reservation_hosts.reservation_id
INNER JOIN email ON reservation.email_id = email.id

;" | psql > /dev/null

real 0m0.099s
user 0m0.015s
sys 0m0.005s



$ time echo "\c hrs;
> SELECT hosts.ip FROM hosts;
> SELECT reservation.start_date FROM reservation;
> SELECT architecture.architecture FROM architecture;
> SELECT os_rel.os_rel FROM os_rel;
> SELECT os_version.os_version FROM os_version;
> SELECT project.project FROM project;
> SELECT email.email FROM email;
> " | psql > /dev/null

real 0m0.046s
user 0m0.008s
sys 0m0.004s

Note: I've created indexes on those tables, both on data columns like
hosts.ip and on .id columns.

So yes, joins apparently are at least twice as expensive as simple
selects without joins, on a small dataset. Not a drastic increase in
cost, but smth definitely shows.

It would be interesting to see what happens when row numbers increase to
large numbers, but I have no such data.

Regards,
mk
 
D

D'Arcy J.M. Cain

Maybe. Don't start with denormalization. Write it properly and only
consider changing if profiling suggests that that is your bottleneck.
With a decent database engine and proper design it will hardly ever be.
From what you write further in the mail I conclude that you have not
eliminated the first table, just made table projects look like I wrote
above, right? (and used stored procedures to make sure that both tables
contain the relevant data for client_id and project_id columns in both
tables)

Note that rather than speeding things up this could actually slow
things down depending on your usage. If you do lots of updates and you
have to write extra information every time then that's worse than a few
extra reads, especially since read data can be cached but written data
must be pushed to disk immediately in an ACID database.
 
P

Philip Semanchuk

Look, I completely agree with what you're saying, but: that doesn't
change the possibility that joins may be expensive in comparison to
other SQL operations. This is the phrase I should have used perhaps;
'expensive in comparison with other SQL operations' instead of 'bad'.

Well OK, but that's a very different argument. Yes, joins can be
expensive. They're often still the best option, though. The first step
people usually take to get away from joins is denormalization which
can improve SELECT performance at the expense of slowing down INSERTs,
UPDATEs, and DELETEs, not to mention complicating one's code and data
model. Is that a worthwhile trade? Depends on the application. As I
said, sometimes the cure is worse than the disease.

Don't worry about joins until you know they're a problem. As Knuth
said, premature optimization is the root of all evil.

Good luck
Philip

PS - Looks like you're using Postgres -- excellent choice. I miss
using it.
 
S

simn_stv

till i think i absolutely need to trade-off easier and less
complicated code, better db structure (from a relational perspective)
and generally less "head aches" for speed, i think i'll stick with the
joins for now!...;)

the thought of denormalization really doesnt appeal to me...
 
M

mk

Philip said:
Well OK, but that's a very different argument. Yes, joins can be
expensive. They're often still the best option, though. The first step
people usually take to get away from joins is denormalization which can
improve SELECT performance at the expense of slowing down INSERTs,
UPDATEs, and DELETEs, not to mention complicating one's code and data
model. Is that a worthwhile trade?

I'd say that in more than 99% of situations: NO.

More than that: if I haven't normalized my data as it should have been
normalized, I wouldn't be able to do complicated querying that I really,
really have to be able to do due to business logic. A few of my queries
have a few hundred lines each with many sub-queries and multiple
many-to-many joins: I *dread the thought* what would happen if I had to
reliably do it in a denormalized db and still ensure data integrity
across all the business logic contexts. And performance is still more
than good enough: so there's no point for me, as of the contexts I
normally work in, to denormalize data at all.

It's just interesting for me to see what happens in that <1% of situations.
Depends on the application. As I
said, sometimes the cure is worse than the disease.

Don't worry about joins until you know they're a problem. As Knuth said,
premature optimization is the root of all evil.

Sure -- the cost of joins is just interesting to me as a 'corner case'.
I don't have datasets large enough for this to matter in the first place
(and I probably won't have them that huge).
PS - Looks like you're using Postgres -- excellent choice. I miss using it.

If you can, I'd recommend using SQLAlchemy layer on top of
Oracle/Mysql/Sqlite, if that's what you have to use: this *largely*
insulates you from the problems below and it does the job of translating
into a peculiar dialect very well. For my purposes, SQLAlchemy worked
wonderfully: it's very flexible, it has middle-level sql expression
language if normal querying is not flexible enough (and normal querying
is VERY flexible), it has a ton of nifty features like autoloading and
rarely fails bc of some lower-level DB quirk AND its high-level object
syntax is so similar to SQL that you quickly & intuitively grasp it.

(and if you have to/prefer writing some query in "low-level" SQL, as I
have done a few times, it's still easy to make SQLAlchemy slurp the
result into objects provided you ensure there are all of the necessary
columns in the query result)

Regards,
mk
 
T

Tim Wintle

Hello Tim,

Pardon the questions but I haven't had the need to use denormalization
yet, so:
IOW you basically merged the tables like follows?

CREATE TABLE projects (
client_id BIGINT NOT NULL,
project_id BIGINT NOT NULL,
cost INT,
date DATETIME,
INDEX(client_id, project_id, date)
);
Yup

From what you write further in the mail I conclude that you have not
eliminated the first table, just made table projects look like I wrote
above, right? (and used stored procedures to make sure that both tables
contain the relevant data for client_id and project_id columns in both
tables)
Yup

Have you had some other joins on denormalized keys? i.e. in example how
the join of hypothetical TableB with projects on projects.client_id
behave with such big tables? (bc I assume that you obviously can't
denormalize absolutely everything, so this implies the need of doing
some joins on denormalized columns like client_id).

For these joins (for SELECT statements) this _can_ end up running faster
- of course all of this depends on what kind of queries you normally end
up getting and the distribution of data in the indexes.

I've never written anything that started out with a schema like this,
but several have ended up getting denormalised as the projects have
matured and query behaviour has been tested
? I'm not clear on what you mean here.

I'm refering to not eliminating the first table as you concluded
 
T

Tim Wintle

Maybe. Don't start with denormalization. Write it properly and only
consider changing if profiling suggests that that is your bottleneck.

Quite - and I'd add to cache reads as much in front end machines as is
permissible in your use case before considering denormalisation.
With a decent database engine and proper design it will hardly ever be.

I completely agree - I'm simply responding to the request for an example
where denormalisation may be a good idea.

Tim
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,500
Latest member
ArianneJsb

Latest Threads

Top