Why Is Escaping Data Considered So Magical?

Nobody · Jun 27, 2010

Except nobody has yet shown an alternative which is easier to get right.

For SQL, use stored procedures or prepared statements. For HTML/XML, use a
DOM (or similar) interface.

Jorgen Grahn · Jun 27, 2010

Jorgen Grahn said:
Jorgen Grahn said:

I thought it was well-known that the solution is *not* to try to
sanitize the input -- it's to switch to an interface which doesn't
involve generating an intermediate executable. In the Python example,
that would be something like os.popen2(['zcat', '-f', '--', untrusted]).

Click to expand...

That???s what I mean. Why do people consider input sanitization so hard?

I'm not sure you understood me correctly, because I advocate
*not* doing input sanitization. Hard or not -- I don't want to know,
because I don't want to do it.

/Jorgen

Jorgen Grahn · Jun 27, 2010

Yes. And then some.

Among web developers, the median level of programming knowledge amounts to
the first 3 chapters of "Learn PHP in 7 Days".

It doesn't help the the guy who wrote PHP itself wasn't much better.

Yep. Search the BugTraq archives for "SQL injection". And most of those
are for widely-deployed middleware; the zillions of bespoke site-specific
scripts are likely to be worse.

Also: http://xkcd.com/327/

Priceless!

As is often the case with xkcd, I learned something, too: there's a
widely used web application/portal/database thingy which silently
strips some characters from my input. I thought it had to do with
HTML, but it's in fact exactly the sequences "'", ')', ';' and '--'
from the comic, and a few more like '>' and undoubtedly some I haven't
noticed yet.

That is surely "input sanitization" gone horribly wrong: I enter "6--8
slices of bread", but the system stores "68 slices of bread".

Well known by anyone with a reasonable understanding of the principles of
programming, but somewhat less well known by the other 98% of web
developers.

There's a world of difference between a skilled chef and the people
flipping burgers for a minimum wage. And between a chartered civil
engineer and the people laying the asphalt. And between what you
probably consider a programmer and the people doing most web development.

I don't know them, so I wouldn't know ... What I would *expect* is
that safe tools are provided for them, not just workarounds so they
can keep using the unsafe tools. That's what Python did, with its
multitude of alternatives to os.system and os.popen.

Anyway, thanks. It's always nice to be able to map foreign terminology
like "SQL injection" to something you already know.

/Jorgen

Jorgen Grahn · Jun 27, 2010

A long while ago I came up with this macro:

#define Descr(v) &v, sizeof v

making the correct version of the above become

snprintf(Descr(buf), foo);

This is off-topic, but I believe snprintf() in C can *never* safely be
the only thing you do to the buffer: you also have to NUL-terminate it
manually in some corner cases. See the documentation.

/Jorgen

Carl Banks · Jun 27, 2010

I get worried when people talk about how easy it is to do something
safely. Let me suggest a couple of things you might not have considered:

1) Somebody is running your application (or the database server) with
the locale set to something unexpected. This might change how numbers,
dates, currency, etc, get formatted, which could change the meaning of
your constructed SQL statement.

2) Somebody runs your application with a different PYTHONPATH, which
causes a different (i.e. malicious) urllib module to get loaded, which
makes urllib.quote() do something you didn't expect.

Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?
How about this instead (where this a direct binary interface to the
library):

results = rdb_query(table = model,
columns = [model.name, model.number])

results = rdb_inner_join(tables = [records,tags],
joins = [(records.id,tags.record_id)]),
columns = [record.name, tag.name])

Well, we know the real reason is that C, Java, and friends lack
expressiveness and so constructing a binary query is an ASCII
nightmare. Still, it hasn't stopped binary APIs in other kinds of
libraries.

Carl Banks

Roy Smith · Jun 27, 2010

Carl Banks said:
Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?
How about this instead (where this a direct binary interface to the
library):

results = rdb_query(table = model,
columns = [model.name, model.number])

results = rdb_inner_join(tables = [records,tags],
joins = [(records.id,tags.record_id)]),
columns = [record.name, tag.name])

Well, we know the real reason is that C, Java, and friends lack
expressiveness and so constructing a binary query is an ASCII
nightmare. Still, it hasn't stopped binary APIs in other kinds of
libraries.

Well, the answer to that one is simple. SQL, in the hands of somebody
like me, can be used to express a few pathetic joins and what I do with
it could probably be handled with the kind of API you're describing.
But, the language has far more expressivity than that, and a
domain-specific language is really a good fit for what it can do.

The problem is not so much that SQL queries are described as text
strings, but that the distinction between program and data gets lost if
you build the query as one big string. What you need (and which the
Python API supplies) is the ability to clearly distinguish between "this
text is my program" and "this text is a value which my program uses".

Python has the same problem. If I had a text string, s, which I read
from some external source, and wanted to interpret that string as an
integer, I could do (at least) two different things.

# Thing 1
myInteger = int(s)

# Thing 2
myInteger = eval(s)

for properly formed input, either one works, but thing 2 loses the
distinction between program and data and is thus dangerous. Exactly
like building a SQL query by smashing a bunch of strings together.

Owen Jacobson · Jun 28, 2010

Thatâ€™s the same fallacious argument I pointed out earlier.

In the sense that "using correct manual escaping leads to SQL injection
vulnerabilities", yes, that's a fallacious argument on its own.
However, as sites like BUGTRAQ amply demonstrate, generating SQL
through string manipulation is a risky development practice[0]. You can
continue to justify your choice to do so however you want, and you may
even be the One True Developer capable of getting it absolutely right
under all circumstances, but I'd still reject patches that introduced a
SQLString-like function and ask that you resubmit them using the
database API's parameterization tools instead.

Assuming for the sake of discussion that your SQLString function
perfectly captures the transformation required to turn an arbitrary str
into a MySQL string literal. How do you address the following issues?

1. Other (possibly inexperienced) developers reading your source who
may not have the skills to correctly implement the same transform
correctly learn from your programs that writing your own query munger
is okay.
1a. Other (possibly inexperienced) developers decide to copy and paste
your function without fully understanding how it works, in tandem with
any of the other issues below. (If you think this is rare, I invite you
to visit stackoverflow or roseindia some time.)

2. MySQL changes the quoting and escaping rules to address a
bug/feature request/developer whim, introducing a new set of corner
cases into your function and forcing you to re-learn the escaping and
quoting rules. (For people using DB API parameters, this is a matter of
upgrading the DB adapter module to a version that supports the modified
rules.)

3. You decide to switch from MySQL to a more fully-featured RDBMS,
which may have different quoting and escaping rules around string
literals.
3a. *Someone else* decides to port your program to a different RDBMS,
and may not understand that SQLString implements MySQL's quoting and
escaping rules only.

4. MySQL AB finally get off their collective duffs and adds real
parameter separation to the MySQL wire protocol, and implements real
prepared statements to massive speed gains in scenarios that are
relevant to your interests; string-based query construction gets left
out in the cold.
4a. As with case 3, except that instead of the rules changing when you
move to a new RDBMS, it's the relative performance of submitting new
queries versus reusing a parameterized query that changes.

On top of the obvious issue of completely avoiding quoting bugs, using
query parameters rather than escaping and string manipulation neatly
saves you from having to address any of these problems (and a multitude
of others) -- the DB API implementation will handle things for you, and
you are propagating good practice in an easy-to-understand form.

I am honestly at a loss trying to understand your position. There is a
huge body of documentation out there about the weaknesses of
string-manipulation-based approaches to query construction, and the use
of query parameters is so compellingly the Right Thing that I have a
very hard time comprehending why anyone would opt not to use it except
out of pure ignorance of their existence. Generating executable code --
including SQL -- from untrusted user input introduces an large
vulnerability surface for very little benefit.

You don't handle function parameters by building up python-language
strs containing the values as literals and eval'ing them, do you?

-o

[0] If you want to be *really* pedantic, string-manipulation-based
query construction is strongly correlated with the occurrence of SQL
injection vulnerabilities and bugs, which is in turn not strongly
correlated with very many other practices. Happy?

Carl Banks · Jun 28, 2010

Except for the huge number that deal with text protocols or languages.

No, not really. Almost all types of libraries have binary APIs,
including those that deal with text protocols or language. Any
control with string commands is something that's built on top of the
binary API. And culturally, programmers interfacing those libraries
expect to and are expected to use the binary API for low-level
programming.

RDBs, as a whole, either don't have binary APIs or they have them but
no one really uses them.

Because SQL is a text language.

Circular logic. I'm disappointed, usually when you sit on your
reinforced soapbox and pretense the air of infinite expertise you at
least use reasonable logic.

Also, I was asking about databases. "SQL is a text language" is not
the answer to the question "Why do RDBs use string commands instead of
binary APIs"?

Carl Banks

Carl Banks · Jun 28, 2010

Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?
How about this instead (where this a direct binary interface to the
library):

Click to expand...

results = rdb_query(table = model,
columns = [model.name, model.number])

Click to expand...

results = rdb_inner_join(tables = [records,tags],
joins = [(records.id,tags.record_id)]),
columns = [record.name, tag.name])

Click to expand...

Well, we know the real reason is that C, Java, and friends lack
expressiveness and so constructing a binary query is an ASCII
nightmare. Still, it hasn't stopped binary APIs in other kinds of
libraries.

Click to expand...

Well, the answer to that one is simple. SQL, in the hands of somebody
like me, can be used to express a few pathetic joins and what I do with
it could probably be handled with the kind of API you're describing.
But, the language has far more expressivity than that, and a
domain-specific language is really a good fit for what it can do.

I'm not the biggest expert on SQL ever, but the only thing I can think
of is expressions. Statements don't express anything very complex,
and could straightforwardly be represented by function calls. But
it's a fair point.

The problem is not so much that SQL queries are described as text
strings,

No, it is the problem, or part of it. String commands are inherently
prone to injection attacks, that's the main problem with them.

but that the distinction between program and data gets lost if
you build the query as one big string.

That too.

Carl Banks

Owen Jacobson · Jun 28, 2010

Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?
How about this instead (where this a direct binary interface to the
library):

Click to expand...

results = rdb_query(table = model,
columns = [model.name, model.number])

Click to expand...

results = rdb_inner_join(tables = [records,tags],
joins = [(records.id,tags.record_id)]),
columns = [record.name, tag.name])

Click to expand...

Well, we know the real reason is that C, Java, and friends lack
expressiveness and so constructing a binary query is an ASCII
nightmare. Still, it hasn't stopped binary APIs in other kinds of
libraries.

Click to expand...

Well, the answer to that one is simple. SQL, in the hands of somebody
like me, can be used to express a few pathetic joins and what I do with
it could probably be handled with the kind of API you're describing.
But, the language has far more expressivity than that, and a
domain-specific language is really a good fit for what it can do.

Click to expand...

I'm not the biggest expert on SQL ever, but the only thing I can think
of is expressions. Statements don't express anything very complex,
and could straightforwardly be represented by function calls. But
it's a fair point.

Off the top of my head, I can think of a few things that would be
tricky to turn into an API:

* Aggregation (GROUP BY, aggregate functions over arbitrary
expressions, HAVING clauses).
* CASE expressions.
* Subqueries.
* Recursive queries (in DBMSes that support them).
* Window clauses (likewise).
* Set operations between queries (UNION, DIFFERENCE, INTERSECT).
* A surprisingly rich set of JOIN clauses beyond the obvious inner
natural joins.
* Various DBMS-specific locking hints.
* Computed inserts and updates.
* Updates and deletes that include joins.
* RETURNING lists on modification queries.
* Explicit (DBMS-side) cursors.

This is by no means an exhaustive list.

Of course, it's possible to represent all of this via an API rather
than a language, and libraries like SQLAlchemy make a reasonable
attempt at doing just that. However, not every programming language has
the kind of structural flexibility to do that well: a library similar
to SQLalchemy would be incredibly clunky (if it worked at all) in, say,
Java or C#, and it'd be nearly impossible to pull off in C. Even LDAP,
which is defined more in terms of APIs than languages, forgoes trying
to define a predicate API and uses a domain-specific filtering language
instead.

There's certainly a useful subset of SQL that could be trivially
replaced with an API. Simple by-the-numbers CRUD queries don't exercise
much of SQL's power. In fact, we can do that already: any ORM can
handle that level just fine.

-o

Owen Jacobson · Jun 28, 2010

Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?
How about this instead (where this a direct binary interface to the
library):

Click to expand...

results = rdb_query(table = model,
columns = [model.name, model.number])

Click to expand...

results = rdb_inner_join(tables = [records,tags],
joins = [(records.id,tags.record_id)]),
columns = [record.name, tag.name])

Click to expand...

Well, we know the real reason is that C, Java, and friends lack
expressiveness and so constructing a binary query is an ASCII
nightmare. Still, it hasn't stopped binary APIs in other kinds of
libraries.

Click to expand...

Well, the answer to that one is simple. SQL, in the hands of somebody
like me, can be used to express a few pathetic joins and what I do with
it could probably be handled with the kind of API you're describing.
But, the language has far more expressivity than that, and a
domain-specific language is really a good fit for what it can do.

Click to expand...

I'm not the biggest expert on SQL ever, but the only thing I can think
of is expressions. Statements don't express anything very complex,
and could straightforwardly be represented by function calls. But
it's a fair point.

Off the top of my head, I can think of a few things that would be
tricky to turn into an API:

* Aggregation (GROUP BY, aggregate functions over arbitrary
expressions, HAVING clauses).
* CASE expressions.
* Subqueries.
* Recursive queries (in DBMSes that support them).
* Window clauses (likewise).
* Set operations between queries (UNION, DIFFERENCE, INTERSECT).
* A surprisingly rich set of JOIN clauses beyond the obvious inner
natural joins.
* Various DBMS-specific locking hints.
* Computed inserts and updates.
* Updates and deletes that include joins.
* RETURNING lists on modification queries.
* Explicit (DBMS-side) cursors.

This is by no means an exhaustive list.

Of course, it's possible to represent all of this via an API rather
than a language, and libraries like SQLAlchemy make a reasonable
attempt at doing just that. However, not every programming language has
the kind of structural flexibility to do that well: a library similar
to SQLalchemy would be incredibly clunky (if it worked at all) in, say,
Java or C#, and it'd be nearly impossible to pull off in C. Even LDAP,
which is defined more in terms of APIs than languages, forgoes trying
to define a predicate API and uses a domain-specific filtering language
instead.

There's certainly a useful subset of SQL that could be trivially
replaced with an API. Simple by-the-numbers CRUD queries don't exercise
much of SQL's power. In fact, we can do that already: any ORM can
handle that level just fine.

-o

Carl Banks · Jun 28, 2010

Seriously, almost every other kind of library uses a binary API. What
makes databases so special that they need a string-command based API?
How about this instead (where this a direct binary interface to the
library):
results = rdb_query(table = model,
columns = [model.name, model.number])
results = rdb_inner_join(tables = [records,tags],
joins = [(records.id,tags.record_id)]),
columns = [record.name, tag.name])
Well, we know the real reason is that C, Java, and friends lack
expressiveness and so constructing a binary query is an ASCII
nightmare. Still, it hasn't stopped binary APIs in other kinds of
libraries.
Well, the answer to that one is simple. SQL, in the hands of somebody
like me, can be used to express a few pathetic joins and what I do with
it could probably be handled with the kind of API you're describing.
But, the language has far more expressivity than that, and a
domain-specific language is really a good fit for what it can do.

Click to expand...

Click to expand...

I'm not the biggest expert on SQL ever, but the only thing I can think
of is expressions. Statements don't express anything very complex,
and could straightforwardly be represented by function calls. But
it's a fair point.

Click to expand...

Off the top of my head, I can think of a few things that would be
tricky to turn into an API:

* Aggregation (GROUP BY, aggregate functions over arbitrary
expressions, HAVING clauses).
* CASE expressions.
* Subqueries.
* Recursive queries (in DBMSes that support them).
* Window clauses (likewise).
* Set operations between queries (UNION, DIFFERENCE, INTERSECT).
* A surprisingly rich set of JOIN clauses beyond the obvious inner
natural joins.
* Various DBMS-specific locking hints.
* Computed inserts and updates.
* Updates and deletes that include joins.
* RETURNING lists on modification queries.
* Explicit (DBMS-side) cursors.

This is by no means an exhaustive list.

I don't know the exact details of all of these, but I'm going to opine
that at least some of these are easily expressible with a function
call API. Perhaps more naturally than with string queries. For
instance, set operations:

query1 = rdb_query(...)
query2 = rdb_query(...)

final_query = rdb_union(query1,query2)

or

final_query = query1 & query2

I'm not sure why GROUP BY couldn't be expressed by a keyword
argument. The complexity of aggregate functions and computed inserts
comes mainly from expressions (which Roy Smith already mentioned), the
actual statements are simple.

Of course, it's possible to represent all of this via an API rather
than a language, and libraries like SQLAlchemy make a reasonable
attempt at doing just that. However, not every programming language has
the kind of structural flexibility to do that well: a library similar
to SQLalchemy would be incredibly clunky (if it worked at all) in, say,
Java or C#, and it'd be nearly impossible to pull off in C.

Yeah, which was kind of my original theory.

Carl Banks

Stephen Hansen · Jun 28, 2010

I'm not the biggest expert on SQL ever, but the only thing I can think
of is expressions. Statements don't express anything very complex,
and could straightforwardly be represented by function calls.

See, there's really two kinds of SQL out there.

There's the layman's SQL which is pretty straight-forward. Sure, it can
start looking a little complicated if you get multiple clauses in the
WHERE line (and maybe you're ambitious and do a simple inner join), but
its probably still not bad. That can get translated into an API pretty
easily.

Then there's the type of SQL that results in DBA's having jobs-- and
deservedly so. Its *really* a very flexible and powerful language
capable of doing quite a lot to bend, flex, twist, and interleave that
data in the server while building up a result set for you.

I'm honestly only really in the former camp with a toe into the latter
(I use aggregation and windowing functions over some interesting joins
on occasion, but it takes effort). So I can't give a lot of serious
examples to *prove* I'm right.

So I just have to say: based on my experience and admittedly limited
imagination, converting the full expressive power of SQL into a regular
sort of API would be a very, very, very hairy sort of mess. SQLAlchemy
can do the layman's SQL, and can *kind of* do a *little bit* of the
advanced stuff-- but usually, it does the advanced stuff by just making
it very easy for you to shove it out of the way and do SQL directly.

But still: that's the structured part of SQL which belongs in a string.
The data does not. It should be obvious that when a database provides
you a mechanism to pass data in such that it doesn't need sanitization*
at all, that's preferable to actually doing sanitization, even if you're
divinely capable of perfect sanitization and even if sanitization is a
trivial task that a monkey should be able to handle.

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/

P.S. *My computer /swears/ sanitization is spelled wrong. Either I'm
high or it's high. Stupid old school mac mini.

Stephen Hansen · Jun 28, 2010

I don't know the exact details of all of these, but I'm going to opine
that at least some of these are easily expressible with a function
call API. Perhaps more naturally than with string queries. For
instance, set operations:

query1 = rdb_query(...)
query2 = rdb_query(...)

final_query = rdb_union(query1,query2)

or

final_query = query1& query2

But, see, that's not actually what's going on behind the scenes in the
database. Unless your "query1" and "query2" objects are opaque
pseudo-objects which do not actually represent results -- the query
planners do a *lot* of stuff by looking at the whole query and computing
just how to go about executing all of the instructions.

The engine of a SQL database is a pretty sophisticated little pieces of
coding. Because SQL is declarative, the engine is able to optimize just
how to do everything when it looks at the full query, and even try out a
few different ideas at first before deciding on just which path to take.
(This is an area where parametrized queries is even more important: but
I'm not sure if MySQL does proper prepared queries and caching of
execution plans).

If you go and API it, then you're actually imposing an order on how it
processes the query... unless your API is just a sort of opaque wrapper
for some underlining declarative structure. (Like ORM's try to be)

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/

Carl Banks · Jun 28, 2010

Then there's the type of SQL that results in DBA's having jobs-- and
deservedly so. Its *really* a very flexible and powerful language
capable of doing quite a lot to bend, flex, twist, and interleave that
data in the server while building up a result set for you.

All right, I get it.

I'm not talking about SQL, I'm talking about RDBs. But I guess it is
important for serious RDBs to support queries complex enough that a
language like SQL is really needed to express it--even if being called
from an expressive language like Python. Not everything is a simple
inner joins. I defer to the community then, as my knowledge of
advanced SQL is minimal.

We'll just have accept the risk of injection attacks as a trade off,
and try to educate people to use placeholders when writing SQL.

Carl Banks

Owen Jacobson · Jun 28, 2010

But, see, that's not actually what's going on behind the scenes in the
database. Unless your "query1" and "query2" objects are opaque
pseudo-objects which do not actually represent results -- the query
planners do a *lot* of stuff by looking at the whole query and
computing just how to go about executing all of the instructions.

I believe that *is* his point: that we can replace the SQL language
with a "query object model" that lets us specify what we want without
resorting to string-whacking when our needs are dynamic, without
changing the rest of the workflow. This is obviously true: each RDBMS
does something very much like what Carl is proposing, internally.
However, implementing such an API usefully (never mind comfortably) in
a cross-language way is... difficult, and an RDBMS that can only be
used from Python (or even from Python and other Smalltalk-like
languages) is not terribly useful at all.

-o

Carl Banks · Jun 28, 2010

Kindly stop inventing straw men to attack; I deny the position you're
painting for me.

No, this is not a straw man, you are 100% percent guilty of circular
logic as I accused you of.

Plus, I will not kindly do anything for you unless you kindly stop
being condescending and self-righteous when answering questions and
start treating people with respect. It's not just me, you do it to
newbies who have reasonable questions and you end up making them feel
like assholes just for asking. You don't just act that way to newbies
that deserve it. You are part of the reason people are here accusing
the Python community of being unfriendly and unhelpful.

And that's not a strawman, either.

Carl Banks

Carl Banks · Jun 28, 2010

But, see, that's not actually what's going on behind the scenes in the
database. Unless your "query1" and "query2" objects are opaque
pseudo-objects which do not actually represent results

That's exactly what they are. Nothing is actually sent to the
database until the user starts retrieving results. This is fairly
common thing for some interfaces to do.

For instance, OpenGL almost always returns immediately after a command
is posted without doing anything. The driver will queue the command
in memory until some event happens to trigger it (maybe a signal from
the graphics that is is done processing commands, or the queue being
full, or an explicit flush request from the user).

Incidentally, OpenGL has its own DSL for per-vertex and per-pixel
operations (known as vertex and fragment shaders) that replaces an
older binary API. I daresay it's a little less at risk for an
injection attack, seeing that the shaders run on the GPU and only run
simple math operations. But you never know.

Carl Banks

Kushal Kumaran · Jun 28, 2010

What am I passing, then?

Here's what gcc tells me (I declared buf as char buf[512]):
sprintf.c:8: warning: passing argument 1 of â€˜snprintfâ€™ from
incompatible pointer type
/usr/include/stdio.h:363: note: expected â€˜char * __restrict__â€™ but
argument is of type â€˜char (*)[512]â€™

You just need to lose the & from the macro.

Kushal Kumaran · Jun 28, 2010

This is off-topic, but I believe snprintf() in C can *never* safely be
the only thing you do to the buffer: you also have to NUL-terminate it
manually in some corner cases. See the documentation.

snprintf goes to great lengths to be safe, in fact. You might be
thinking of strncpy.

is list comprehension necessary?	15	Oct 26, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Why Is Escaping Data Considered So Magical?

Nobody

Jorgen Grahn

Jorgen Grahn

Jorgen Grahn

Carl Banks

Roy Smith

Owen Jacobson

Carl Banks

Carl Banks

Owen Jacobson

Owen Jacobson

Carl Banks

Stephen Hansen

Stephen Hansen

Carl Banks

Owen Jacobson

Carl Banks

Carl Banks

Kushal Kumaran

Kushal Kumaran

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads