Why Is Escaping Data Considered So Magical?

  • Thread starter Lawrence D'Oliveiro
  • Start date
P

Paul Rubin

Cameron Simpson said:
The original V7 (and probably earlier) UNIX filesystem has 16 byte directory
entries: 2 bytes for an inode and 14 bytes for the name. You could use 14
bytes of that name, and strncpy makes it effective to work with that data
structure.

Why not use memcpy for that?
 
S

Steven D'Aprano

Not an exaggeration: it's an absolute. It literally says that any time
you try to solve a problem with a regex, (A) it won't solve the problem
and (B) it will in itself become a problem. And it doesn't tell you
why: you're supposed to accept or reject this without thinking.

It's a *two sentence* summary, not a reasoned and nuanced essay on the
pros and cons for REs.

Sheesh, I can just imagine you as a child, arguing with your teacher on
being told not to run with scissors -- "but teacher, there may be
circumstances where running with scissors is the right thing to do, you
are guilty of over-simplifying a complex topic into a single simplified
sound-byte, instead of providing a detailed, rich heuristic for analysing
each and every situation in full before making the decision whether or
not to run with scissors".

If you look at the quote carefully, instead of making a knee-jerk
reaction, you will see that it is *literally* correct. Given some
problem, having decided to solve it with a regex, you DO have two
problems:

(1) Merely making the decision "use REs" doesn't actually solve the
original problem, any more than "use a hammer" solves the problem of "how
do I build a table?". You've decided on an approach and a tool, but your
original problem still applies.

(2) AND you now have the additional problem of dealing with regular
expressions, which are notoriously hard to write, harder to debug,
difficult to maintain, often slow, incapable of solving certain common
problems (such as parsing nested parentheses).

So it might be a short, simplified quip, but it *is* literally correct.


How can that be a good thing to keep in mind?

Because many people consider REs to be some sort of panacea for solving
every text-based problem, and it's a good thing to open their eyes.
 
C

Cameron Simpson

| > The original V7 (and probably earlier) UNIX filesystem has 16 byte directory
| > entries: 2 bytes for an inode and 14 bytes for the name. You could use 14
| > bytes of that name, and strncpy makes it effective to work with that data
| > structure.
|
| Why not use memcpy for that?

Because when you've pulled names _out_ of the directory structure they're
conventional C strings, ready for conventional C string mucking about:
NUL terminated, with no expectation that any memory is allocated beyond
the NUL.

Think of strncpy as a conversion function. Your source is a conventional
C string of unknown size, your destination is a NUL padded buffer of
known size. "Copy at most n bytes of this string into the buffer, pad
with NULs."

Cheers,
 
L

Lawrence D'Oliveiro

Michael said:
Okay, I will. Your code passes a char** when a char* is expected.

No it doesn’t.
Consider this variation where I use a dynamically allocated buffer
instead of static:

And so you misunderstand the difference between a C array and a pointer.
 
M

Michael Torrie

No it doesn’t.

You're right; it doesn't. Your code passes char (*)[512].

warning: passing argument 1 of ‘snprintf’ from incompatible pointer type
/usr/include/stdio.h:385: note: expected ‘char * __restrict__’ but
argument is of type ‘char (*)[512]’
And so you misunderstand the difference between a C array and a
pointer.

You make a pretty big assumption.

Given "char buf[512]", buf's type is char * according to the compiler
and every C textbook I know of. With a static char array, there's no
need to take it's address since it *is* the address of the first
element. Taking the address can lead to problems if you ever substitute
a dynamically-allocated buffer for the statically-allocated one. For
one-dimensional arrays at least, static arrays and pointers are
interchangeable when calling snprinf. You do not agree?

Anyway, this is far enough away from Python.
 
J

Jorgen Grahn

It's a *two sentence* summary, not a reasoned and nuanced essay on the
pros and cons for REs.

Well, perhaps you cannot say anything useful about REs in general in
two sentences, and should use either more words, or not say anything
at all.

The way it was used in the quoted text above is one example of what I
mean. (Unless other details have been trimmed -- I can't check right
now.) If he meant to say "REs aren't really a good solution for this
kind of problem, even though they look tempting", then he should have
said that.
Sheesh, I can just imagine you as a child, arguing with your teacher on
being told not to run with scissors -- "but teacher, there may be
circumstances where running with scissors is the right thing to do, you
are guilty of over-simplifying a complex topic into a single simplified
sound-byte, instead of providing a detailed, rich heuristic for analysing
each and every situation in full before making the decision whether or
not to run with scissors".

When I was a child I expected that kind of argumentation from adults.
I expect something more as an adult.

/Jorgen
 
J

Jorgen Grahn

You're right. I normally don't use sizeof(char). This is obviously a
contrived example; I just wanted to make the example such that there's
no way the original poster could argue that the crash is caused by
something other than &buf.

Then again, it's always a bad idea in C to make assumptions about
anything.

There are some things you cannot assume, others which few fellow
programmers can care to memorize, and others which you often can get
away with (like assuming an int is more than 16 bits, when your code
is tied to a modern Unix anyway).

But sizeof(char) is always 1.
If you're on Windows and want to use the unicode versions of
everything, you'd need to do sizeof(). So using it here would remind
you that when you move to the 16-bit Microsoft unicode versions of
snprintf need to change the sizeof(char) lines as well to sizeof(wchar_t).

Yes -- see "unless you might change the type later" above.

/Jorgen
 
S

Stephen Hansen

Well, perhaps you cannot say anything useful about REs in general in
two sentences, and should use either more words, or not say anything
at all.

The way it was used in the quoted text above is one example of what I
mean. (Unless other details have been trimmed -- I can't check right
now.) If he meant to say "REs aren't really a good solution for this
kind of problem, even though they look tempting", then he should have
said that.

The way it is used above (Even with more stripping) is exactly where it
is legitimate.

Regular expressions are a powerful tool.

The use of a powerful tool when a simple tool is available that achieves
the same end is inappropriate, because power *always* has a cost.

The entire point of the quote is that when you look at a problem, you
should *begin* from the position that a complex, powerful tool is not
what you need to solve it.

You should always begin from a position that a simple tool will suffice
to do what you need.

The quote does not deny the power of regular expressions; it challenges
widely held assumption and belief that comes from *somewhere* that they
are the best way to approach any problem that is text related.

Does it come off as negative towards regular expressions? Certainly. But
not because of any fault of re's on their own, but because there is this
widespread perception that they are the swiss army knife that can solve
any problem by just flicking out the right little blade.

Its about redefining perception.

Regular expressions are not the go-to solution for anything to do with
text. Regular expressions are the tool you reach for when nothing else
will work.

Its not your first step; its your last (or, at least, one that happens
way later then most people come around expecting it to be).

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/
 
N

Nobody

Given "char buf[512]", buf's type is char * according to the compiler
and every C textbook I know of.

No, the type of "buf" is "char [512]", i.e. "array of 512 chars". If you
use "buf" as an rvalue (rather than an lvalue), it will be implicitly
converted to char*.

If you take its address, you'll get a "pointer to array of 512 chars",
i.e. a pointer to the array rather than to the first element. Converting
this to a char* will yield a pointer to the first element.

If buf was declared "char *buf", then taking its address will yield a
char**, and converting this to a char* will produce a pointer to the first
byte of the pointer, which is unlikely to be useful.
 
J

Jean-Michel Pichavant

Stephen said:
The way it is used above (Even with more stripping) is exactly where
it is legitimate.

Regular expressions are a powerful tool.

The use of a powerful tool when a simple tool is available that
achieves the same end is inappropriate, because power *always* has a
cost.

The entire point of the quote is that when you look at a problem, you
should *begin* from the position that a complex, powerful tool is not
what you need to solve it.

You should always begin from a position that a simple tool will
suffice to do what you need.

The quote does not deny the power of regular expressions; it
challenges widely held assumption and belief that comes from
*somewhere* that they are the best way to approach any problem that is
text related.

Does it come off as negative towards regular expressions? Certainly.
But not because of any fault of re's on their own, but because there
is this widespread perception that they are the swiss army knife that
can solve any problem by just flicking out the right little blade.

Its about redefining perception.

Regular expressions are not the go-to solution for anything to do with
text. Regular expressions are the tool you reach for when nothing else
will work.

Its not your first step; its your last (or, at least, one that happens
way later then most people come around expecting it to be).

Guys, this dogmatic discussion already took place in this list. Why
start again ?
Re is part of the python standard library, for some purpose I guess.

JM
 
R

Roy Smith

Stephen Hansen said:
The quote does not deny the power of regular expressions; it challenges
widely held assumption and belief that comes from *somewhere* that they
are the best way to approach any problem that is text related.

Well, that assumption comes from historical unix usage where traditional
tools like awk, sed, ed, and grep, made heavy use of regex, and
therefore people learned to become proficient at them and use them all
the time. Somewhat later, the next generation of tools such as vi and
perl continued that tradition. Given the tools that were available at
the time, regex was indeed likely to be the best tool available for most
text-related problems.

Keep in mind that in the early days, people were working on hard-copy
terminals [[http://en.wikipedia.org/wiki/ASR-33]] so economy of
expression was a significant selling point for regexes.

Not trying to further this somewhat silly debate, just adding a bit of
historical viewpoint to answer the implicit question you ask as to where
the assumption came from.
 
S

Stephen Hansen

Re is part of the python standard library, for some purpose I guess.

No, *really*?

So all those people who have been advocating its useless and shouldn't
be are already too late?

Damn.

Well, there goes *that* whole crusade we were all out on. Since we can't
destroy re, maybe we can go club baby seals.

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/
 
S

Stephen Hansen

Well, that assumption comes from historical unix usage where traditional
tools like awk, sed, ed, and grep, made heavy use of regex, and
therefore people learned to become proficient at them and use them all
the time.

Oh, I'm fully aware of the history of re's -- but its not those old hats
and even their students and the unix geeks I'm talking about.

It's the newbies and people wandering into the language with absolutely
no idea about the history of unix, shell scripting and such, who so
often arrive with the idea firmly planted in their head, that I wonder
at. Sure, there's going to be a certain amount of cross-polination from
unix-geeks to students-of-students-of-students-of unix geeks to spread
the idea, but it seems more pervasive for that. I just picture a
re-vangelist camping out in high schools and colleges selling the party
line or something :)

--

... Stephen Hansen
... Also: Ixokai
... Mail: me+list/python (AT) ixokai (DOT) io
... Blog: http://meh.ixokai.io/

P.S. And no, unix geeks is not a pejorative term.
 
M

Mel

Nobody said:
Given "char buf[512]", buf's type is char * according to the compiler
and every C textbook I know of.

References from Kernighan & Ritchie _The C Programming Language_ second
edition:
No, the type of "buf" is "char [512]", i.e. "array of 512 chars". If you
use "buf" as an rvalue (rather than an lvalue), it will be implicitly
converted to char*.

K&R2 A7.1
If you take its address, you'll get a "pointer to array of 512 chars",
i.e. a pointer to the array rather than to the first element. Converting
this to a char* will yield a pointer to the first element.

K&R2 A7.4.2


        Mel.
 
M

Michael Torrie

No, the type of "buf" is "char [512]", i.e. "array of 512 chars". If you
use "buf" as an rvalue (rather than an lvalue), it will be implicitly
converted to char*.

Yes this is true. I misstated. I meant that most text books I've seen
say to just use the variable in an *rvalue* as a pointer (can't think of
any lvalue use of an array).

K&R states that arrays (in C anyway) are always *passed* by pointer,
hence when you pass an array to a function it automatically decays into
a pointer. Which is what you said. So no need for & and the compiler
warning you get with it. That's all.

If the OP was striving for pedantic correctness, he would use &buf[0].
 
J

John Nagle

Nobody said:
Given "char buf[512]", buf's type is char * according to the compiler
and every C textbook I know of.

References from Kernighan& Ritchie _The C Programming Language_ second
edition:
No, the type of "buf" is "char [512]", i.e. "array of 512 chars". If you
use "buf" as an rvalue (rather than an lvalue), it will be implicitly
converted to char*.

Yes, unfortunately. The approach to arrays in C is just broken,
for historical reasons. To understand C, you have to realize that
in the early versions, function declarations weren't visible when
function calls were compiled. That came later, in ANSI C. So
parameter passing in C is very dumb. Billions of crashes due
to buffer overflows later, we're still suffering from that mistake.

But this isn't a Python issue.

John Nagle
 
L

Lawrence D'Oliveiro

Michael said:
Your case is still not persuasive.

So persuade me. I have given an example of code written the way I do it. Now
let’s see you rewrite it using your preferred technique, just to prove that
your way is simpler and easier to understand.

Enough hand-waving, let’s see some code!
 
L

Lawrence D'Oliveiro

The approach to arrays in C is just broken, for historical reasons.

Nevertheless, it it at least self-consistent. To return to my original
macro:

#define Descr(v) &v, sizeof v

As written, this works whatever the type of v: array, struct, whatever.
So parameter passing in C is very dumb.

Nothing to do with the above issue.
 
R

Rami Chowdhury

Nevertheless, it it at least self-consistent. To return to my original
macro:

#define Descr(v) &v, sizeof v

As written, this works whatever the type of v: array, struct, whatever.

Doesn't seem to, sorry. Using Michael Torrie's code example, slightly
modified...

[rami@tigris ~]$ cat example.c
#include <stdio.h>

#define Descr(v) &v, sizeof v

int main(int argc, char ** argv)
{
char *buf = malloc(512 * sizeof(char));
const int a = 2, b = 3;
snprintf(Descr(buf), "%d + %d = %d\n", a, b, a + b);
fprintf(stdout, buf);
free(buf);
return 0;
} /*main*/

[rami@tigris ~]$ clang example.c
example.c:11:18: warning: incompatible pointer types passing 'char **', expected
'char *' [-pedantic]
snprintf(Descr(buf), "%d + %d = %d\n", a, b, a + b);
^~~~~~~~~~
example.c:4:18: note: instantiated from:
#define Descr(v) &v, sizeof v
^~~~~~~~~~~~
<<snip>>
[rami@tigris ~]$ ./a.out
Segmentation fault
 
L

Lawrence D'Oliveiro

Rami said:
Doesn't seem to, sorry. Using Michael Torrie's code example, slightly
modified...

char *buf = malloc(512 * sizeof(char));

Again, you misunderstand the difference between a C array and a pointer.
Study the following example, which does work, and you might grasp the point:

ldo@theon:hack> cat test.c
#include <stdio.h>

int main(int argc, char ** argv)
{
char buf[512];
const int a = 2, b = 3;
snprintf(&buf, sizeof buf, "%d + %d = %d\n", a, b, a + b);
fprintf(stdout, buf);
return
0;
} /*main*/
ldo@theon:hack> ./test
2 + 3 = 5
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,173
Messages
2,570,938
Members
47,475
Latest member
NovellaSce

Latest Threads

Top