Can anyone write this recursion for simple regexp more beautifullyand clearly than the braggarts

P

Phil Carmody

Richard Heathfield said:
[Followups set to comp.lang.c]

In <[email protected]>,
Nick Keighley wrote:

why use assembler when you have C?

I can think of only two reasons. The first is easily dispensed with:
you might use assembly language instead of C because you WANT to. In
the professional sphere, this is unlikely to be considered a good
enough reason, but it's a perfectly adequate reason for hobbyists
(which obviously includes professional programmers in their hobbyist
capacity). "Because I want to" requires no further justification for
*any* decision whatsoever in any field of endeavour (provided the
decision is yours to make and doesn't break any local laws).

The second reason is perhaps best considered as a series of
conditions:

1) you have measured the performance of the C version

AND

2) you have determined that the performance does not meet the
constraints of your specification

AND

3) you have taken all reasonable steps to optimise your C code
(specifically, ensuring that you have chosen good algorithms and
implemented them properly in performance terms, and selected an
appropriately aggressive optimisation level on your compiler)

AND

4) you have taken a few relatively UNreasonable steps to optimise your
C code (e.g. loop unrolling, common subexpression elimination that
the compiler has somehow missed, etc)

AND

5) you have, or can acquire, or can hire, the requisite skill in
assembly language for your targeted platforms

AND

6) the assembly language version shows significant performance gains
over the C version

AND

7) performance on the platforms you support is more important than
portability to platforms that you don't yet support.

THEN you have a case for using assembly language.

In other words, almost never.

Remind me how I flush and invalidate a cache line in C?

Some of us still write for OSes, you know.

Phil
 
S

Seebs

Remind me how I flush and invalidate a cache line in C?
cacheflush(&foo);

:)

Some of us still write for OSes, you know.

Yup. That said, I've seen plenty of systems that provide C-level access
to some functionality like that, within reason. Past that... Well, that's
why we have low-level code that honors the C ABI so you can call into it.

-s
 
R

Robert Maas, http://tinyurl.com/uh3t

No, I don't think so. A typical API (Application Program Interface
From: (e-mail address removed) (Richard Tobin)
That's an ABI (Application Binary Interface), not a typical API.

I don't want to get into a big argument over what each of us
prefers to *call* something, that's a fruitless exercise. To my
mind, what you call an ABI is a special case of API. An API in the
generic sense is *documentation* as to how to call the system
utilities, together with "what's really happening under the hood".
Under the hood it's always binary. Whether it's expressed as
assembly language or C or Pascal is irrelevant.

My experience comes from the Macintosh, specifically System 6.0.3
where I interfaced in **binary** using Macintosh Allegro Common
Lisp 1.2.2, Pocket Forth, and Sesame C as an outer framework, but
all actual code I wrote involved hexadecimal opcodes for the OS or
toolbox traps and machine-level textual names like assembly
language for the individual registers (for OS traps that took
parameters and returned results in registers and in parameter
blocks linked from these registers) and pre-defined push/pop stack
callables ("functions" in Lisp, "words" in Forth, I forget what in
Sesame C).

The documentation for this API was in a multi-volume technical
document called "Inside Macintosh". It gave an "under the hood"
general explantion of how these traps worked, that you push
parameters on the stack (for toolbox traps. the high-level
routines) or allocated a block of memory of a particular size and
organzation to hold essentially a STRUCT of parameters and put a
pointer to that block in a register (for OS traps, the low-level
routines). For OS traps, you copy parameters into slots in that
parameter block and/or load other registers individually. Then you
execute the special opcode for that particular trap. Then you
unwind the stack or copy data out of the parameter block and/or
individual registers.

Then in the sections dealing with individual traps, it showed the
PASCAL code for calling the trap, and also cited the opcode in
hexadecimal. If you had MPW (Macintosh Programmer's Workshop)
installed on your Macintosh, which I didn't because it was too
expensive for my budget even when I had a paying job, it was all
pretty easy. Otherwise you had to make do, as I did, handcoding the
hexadecimal opcodes and hand-arranging the stack operations or
parameter-block building and destructuring. Note that Inside
Macintosh did **not** specify any C interface. If you had MPW you
needed to read your MPW documentation to see how to relate the
PASCAL documentation to what you really needed to code in C. If you
didn't have MPW, then you had to "roll your own", as I did.

So would you consider "Inside Macintosh", when used by somebody who
doesn't code in Pascal and doesn't have MPW available, to be an ABI
or API per your particular use of languaage?

MACL 1.2.2 had a rather nice set of functions and/or macros (long
time since I had a computer that could run MACL so don't expect to
remember such details) for building/destructuring the parameter
block and loading/fetching the individual registers. Pocket Forth
has a set of "words" that are almost as easy to use. Sesame C
didn't have anything nice at all. In Sesame C I would have to
manually embed the hexadecimal bytes for the traps in an assembly
language module that satisfied the C function-calling conventions,
and then link it as an assembly language module into the C program
build, and call it as a function from main() directly for test
purposes or indirectly in an application. About the only OS trap I
ever coded for Sesame C was allocating a block of memory, because
Sesame C didn't have malloc() so I had to write my own malloc()
from scratch using OS traps to get the memory and OS traps to
release it when I called my own free(). Unfortunately invoking a
separate OS trap for each malloc() call was horrendously slow, so
then I re-wrote it to allocate a huge block once then sub-divide it
to satisfy malloc() requests. By comparison, in MACL I implemented
high-level parameter-validating functions to interface to
manipulation of "resources", allowing me to examine and remove the
WDEF virus that had infected most of my diskettes, and to the
simple-harmonic-tone generator, allowing me to add "good" and
"wrong" sound effects to my flashcard-drill program, and allowing
me to play simple tunes from MACL using exactly the same syntax
that the HyperTalk "play" command used, and a few other kinds of
traps I forget now.

My Macintosh Plus died in mid-1999, so I haven't done any of this
for over ten years. MACL 1.2.2 doesn't run under System 7.5.5 on my
Macintosh Performa, and being unemployed this whole time I couldn't
afford to upgrade to MACL 1.3.2 which is said to work fine on this
system, so I couldn't continue my Inside Macintosh work using MACL.
Sesame C so totally sucks that I would rather avoid using it. No
malloc (except the crock I wrote myself), no STRUCTs at all, only
two data types 'int' and 'void*', no arrays at all (that's why I
needed malloc, to emulate arrays and STRUCTs), you get the
picture??? Talk about C being a high-level assembly langauge,
Sesame C isn't even that, it's a mid-level assembly language.

So that leaves Pocket Forth as the only programming system on my
Macintosh that is possibly worthwhile for coding toolbox/OS traps
and higher-level software based on them. Someday soon I may
implement a subset of Common Lisp, with no garbage reclamation, or
with a reference-count system, based in Pocket Forth, hopefully
portable to just about any Forth anywhere in the world, with a
per-OS-API low-level layer to provide a machine-independent
Forth-API, and everything above that layer exactly the same in all
Forths on all OSs, true portability.
 
S

sfuerst

THEN you have a case for using assembly language.
In other words, almost never.

Actually, the use of assembly language is a little more common than
that. Sometimes you just have to use it. The biggest case is when
you need to use functionality of your processor that isn't exported by
the C language.

Eamples of this are:
i/o instructions that talk to hardware.
atomic and memory barrier instructions, so parallel primitives can be
written.
Things that deal with or create calling conventions: i.e. creating a
user-space threading library needs some way of expressing and
implementing a stack-switch. Calling a C++ member function pointer
from C requires an asm thunk to get the parameters in the right
registers/stack slots.

The other case is where performance is part of the specification.
There, you are forced to use assembly to access instructions that
otherwise are not emitted by the C compiler.

Examples of this are:
Vectorization, SSE, SSE2, Altivec etc. (compiler intrinsics help, but
often emit extra register-register copies that slow things down.)
Bit-mashing instructions like bitscan forward, reverse and popcount

Finally, we get to the "implement a whole function in asm" part.
Here, yes, it is often a last resort in an all-out search for
performance. In my experience, functions that benefit the most are
ones where you just can't get the compiler to emit efficient code
because of some hidden knowledge you have, but can't quite tell it.
Otherwise, these days compilers do a pretty good job, and mindlessly
converting C into assembly is unlikely to create much of a speed-up.

Steven
 
F

Flash Gordon

sfuerst said:
Actually, the use of assembly language is a little more common than
that. Sometimes you just have to use it. The biggest case is when
you need to use functionality of your processor that isn't exported by
the C language.

Eamples of this are:
i/o instructions that talk to hardware.
atomic and memory barrier instructions, so parallel primitives can be
written.

Unless they are exposed as extensions, yes you can need assembler to
deal with those.
Things that deal with or create calling conventions: i.e. creating a
user-space threading library needs some way of expressing and
implementing a stack-switch.

Again, although you only write such things rarely.
Calling a C++ member function pointer
from C requires an asm thunk to get the parameters in the right
registers/stack slots.

Ask over in comp.lang.c++ and they will explain how to call C++ from C
without resorting to such fragile and dirty tricks.
The other case is where performance is part of the specification.
There, you are forced to use assembly to access instructions that
otherwise are not emitted by the C compiler.

Sometimes, yes, but modern compilers are often pretty good at this...
Examples of this are:
Vectorization, SSE, SSE2, Altivec etc. (compiler intrinsics help, but
often emit extra register-register copies that slow things down.)
Bit-mashing instructions like bitscan forward, reverse and popcount

Finally, we get to the "implement a whole function in asm" part.
Here, yes, it is often a last resort in an all-out search for
performance. In my experience, functions that benefit the most are
ones where you just can't get the compiler to emit efficient code
because of some hidden knowledge you have, but can't quite tell it.
Otherwise, these days compilers do a pretty good job, and mindlessly
converting C into assembly is unlikely to create much of a speed-up.

Agreed.
 
F

Flash Gordon

Richard said:
Why over there? Why is there more on topic than here for that?

Because the C++ standard defines the relevant mechanisms, not the C
standard, and the C++ people will know what you need to do in C++ to
make it work.

The C side of it is as simple as calling the bit provided from the C++
side, so nothing to discuss.
 
F

Flash Gordon

Richard said:
Oh. So you did know the answer.

I'm aware there is an answer, but I'm also aware that I'm not a C++
programmer so I don't know all the details. I'm also aware that some of
the people over in comp.lang.c++ *do* know the details.
 
F

Flash Gordon

Richard said:
For calling C++ from C?

Yes.

are certainly right, however, that clc++ is far more likely to give
the right answer than clc.

So why comment further on it here, rather than either forgetting or
asking where the experts are?
 
S

sfuerst

Because the C++ standard defines the relevant mechanisms, not the C
standard, and the C++ people will know what you need to do in C++ to
make it work.

Actually they probably won't. The ABI for member function pointers is
highly compiler and operating system dependent. The C++ standard says
nothing about how they are implemented at a low level, just how they
work when called from the C++ source code. Compiler writers have used
that leeway to have all sorts of implementations.
The C side of it is as simple as calling the bit provided from the C++
side, so nothing to discuss.

The whole point was doing things the other way around. With the C +
asm combo, you can call pretty much any function written in any other
language. This fact tends to make C the glue-language of choice. The
only exception I know about are things that require va_args.
Unfortunately, one of the constraints is knowing how many arguments to
pass to your arbitrary other function, (and also what their types
are).

A more specific example is trying to call something in a microsoft
windows library from C. Unfortunately, microsoft likes making C++ -
only interfaces. You can convert most of them to things that are C
callable with a few pointer tricks, but microsoft didn't export the
"thiscall" calling convection as an attribute you can use. So you
need a simple shim to move the first function argument from the stack
to the %ecx register. (Yes people actually do this. Calling graphics
library functions from C via asm without having to go through a C++
shim is a performance win in an area which people care about.)

The 64bit case is different. There, microsoft by default puts the
first argument in %rcx if it isn't a floating point number, so no
thunk required. However, now the calling convention is different from
64bit linux, which places the first argument in %rdi. For the longest
time gcc didn't support the windows calling convention (and vise
versa), so if you wanted to call a function in a windows 64bit DLL
from linux, you needed some other asm thunk to get the parameters in
the right registers/stack slots.

Steven
 
F

Flash Gordon

sfuerst said:
Actually they probably won't. The ABI for member function pointers is
highly compiler and operating system dependent. The C++ standard says
nothing about how they are implemented at a low level, just how they
work when called from the C++ source code. Compiler writers have used
that leeway to have all sorts of implementations.

<snip>

The standard doesn't need to. If you want to know, ask the experts, but
if not I don't really care if you make life hard for yourself.
 
F

Flash Gordon

sfuerst said:
Actually they probably won't. The ABI for member function pointers is
highly compiler and operating system dependent. The C++ standard says
nothing about how they are implemented at a low level, just how they
work when called from the C++ source code. Compiler writers have used
that leeway to have all sorts of implementations.

<snip>

The standard doesn't need to define an ABI. If you want to know, ask the
experts, but if not I don't really care if you make life hard for yourself.
 
I

Ian Collins

Flash said:
<snip>

The standard doesn't need to define an ABI. If you want to know, ask the
experts, but if not I don't really care if you make life hard for yourself.

The experts would live on a compiler specific group, not c.l.c++. But
no one in their right mind would want to call a C++ member though a
function pointer from C.
 
J

James Kuyper

sfuerst said:
Actually they probably won't. The ABI for member function pointers is
highly compiler and operating system dependent. The C++ standard says
nothing about how they are implemented at a low level, just how they
work when called from the C++ source code. Compiler writers have used
that leeway to have all sorts of implementations.

Without specifying how it works at that low level, the C++ standard does
specify at a high level how to declare a C function so that it can be
called from C++ code, and how to define a function to be compiled by C++
that will be callable from C. The fundamental mechanism is something
called "language linkage". A lot of the details are
implementation-specific, but the basic mechanism is not.
 
I

Ian Collins

James said:
Without specifying how it works at that low level, the C++ standard does
specify at a high level how to declare a C function so that it can be
called from C++ code, and how to define a function to be compiled by C++
that will be callable from C. The fundamental mechanism is something
called "language linkage". A lot of the details are
implementation-specific, but the basic mechanism is not.

No one's disputing that. The point was "Calling a C++ member function
pointer" from C which a different issue entirely. How one would attempt
this lunacy is highly compiler specific and unspecified.
 
W

WJ

bolega said:
This braggart admits that he had to put this code in TWO books and
visit it twice to be explained. I am puting the excerpt from pp2-4 of
this book and the C code. The C code will become indented and syntax
highlighted once you paste in emacs etc. It is my belief and
observation on a lot of problems by these so called "demi gods" that
they are actually all average and no more intelligent. Its all that
they got some opportunities to study some things at opportune time and
facilities and also spent a lot of time in a productive environment
and team.

I know that lisp eval is written more clear than this recursion below
because I am able to read it easily. and that code is almost self
explanatory. C is more quirky. When you really mean recursively call
another function, you are using return so you can have tail
recursion ???? .

Anyway, its your chance to show how clear C/C++/lisp/Scheme code you
can write that is clearer. Also, i dont exclude pseudocode but it
should be clear enough to be instantly translatable to a programming
language. The real goal is to learn how to write or DERIVE recursion,
how to enumerate cases, order them, and build recursion. You may even
put some introductory tuturial and dont have to reply here in ascii
but can upload a pdf at some link in rapidshare etc. Look how he put
the code after problem statement and then has the need to explain it.
Ideally, the discussion should be such that the student or reader
himself jumps to the solution. That is why I give these unix school of
pike/thomson/kernighan low grade in almost all their expositions
except that they wrote earliest books to make millions of dollars in
royalties and since now they are nobody due to linux, they are poorly
regurgitating old material.

Enjoy .............

============================

The Practice of Programming

In 1998, Rob Pike and I were writing The Practice of Programming
(Addison-Wesley). The
last chapter of the book, “Notation,” collected a number of examples
where good notation
led to better programs and better programming. This included the use
of simple data specifications
(printf, for instance), and the generation of code from tables.
Because of our Unix backgrounds and nearly 30 years of experience with
tools based on
regular expression notation, we naturally wanted to include a
discussion of regular
expressions, and it seemed mandatory to include an implementation as
well. Given our
emphasis on tools, it also seemed best to focus on the class of
regular expressions found in
grep—rather than, say, those from shell wildcards—since we could also
then talk about the
design of grep itself.
The problem was that any existing regular expression package was far
too big. The local
grep was over 500 lines long (about 10 book pages) and encrusted with
barnacles. Open
source regular expression packages tended to be huge—roughly the size
of the entire
book—because they were engineered for generality, flexibility, and
speed; none were
remotely suitable for pedagogy.
I suggested to Rob that we find the smallest regular expression
package that would illustrate
the basic ideas while still recognizing a useful and nontrivial class
of patterns. Ideally,
the code would fit on a single page.
Rob disappeared into his office. As I remember it now, he emerged in
no more than an
hour or two with the 30 lines of C code that subsequently appeared in
Chapter 9 of The
Practice of Programming. That code implements a regular expression
matcher that handles
the following constructs.

Character Meaning
c Matches any literal character c.
. (period) Matches any single character.
^ Matches the beginning of the input string.
$ Matches the end of the input string.
* Matches zero or more occurrences of the previous character.

This is quite a useful class; in my own experience of using regular
expressions on a day-today
basis, it easily accounts for 95 percent of all instances. In many
situations, solving the
right problem is a big step toward creating a beautiful program. Rob
deserves great credit
for choosing a very small yet important, well-defined, and extensible
set of features from
among a wide set of options.
Rob’s implementation itself is a superb example of beautiful code:
compact, elegant,
efficient, and useful. It’s one of the best examples of recursion that
I have ever seen, and it
shows the power of C pointers. Although at the time we were most
interested in conveying
the important role of good notation in making a program easier to use
(and perhaps
easier to write as well), the regular expression code has also been an
excellent way to
illustrate algorithms, data structures, testing, performance
enhancement, and other
important topics.

Implementation
In The Practice of Programming, the regular expression matcher is part
of a standalone program
that mimics grep, but the regular expression code is completely
separable from its
surroundings. The main program is not interesting here; like many Unix
tools, it reads
either its standard input or a sequence of files, and prints those
lines that contain a match
of the regular expression.
This is the matching code:
/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
if (regexp[0] == '^')
return matchhere(regexp+1, text);
do { /* must look even if string is empty */
if (matchhere(regexp, text))
return 1;
} while (*text++ != '\0');
return 0;
}
/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
if (regexp[0] == '\0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
Character Meaning
c Matches any literal character c.
. (period) Matches any single character.
^ Matches the beginning of the input string.
$ Matches the end of the input string.
* Matches zero or more occurrences of the previous character.
4 C H A P T E R O N E
if (regexp[0] == '$' && regexp[1] == '\0')
return *text == '\0';
if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
return matchhere(regexp+1, text+1);
return 0;
}
/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
do { /* a * matches zero or more instances */
if (matchhere(regexp, text))
return 1;
} while (*text != '\0' && (*text++ == c || c == '.'));
return 0;
}

Arc:

(def match (regex-str text-str)
(with (regex (coerce regex-str 'cons)
text (coerce text-str 'cons))
(if (is #\^ (car regex))
(match-here (cdr regex) text)
(or
(reclist (fn (xs) (match-here regex xs)) text)
(match-here regex '())))))

(def match-here (regex text)
(or
(empty regex)
(and (is #\* (cadr regex))
(match-star (car regex) (cddr regex) text))
(and (is #\$ (car regex)) (single regex) (empty text))
(and (or (and (acons text) (is #\. (car regex)))
(is (car regex) (car text)))
(match-here (cdr regex) (cdr text)))))

(def match-star (chr regex text)
(or
(match-here regex text)
(if (and (acons text) (or (is chr (pop text)) (is chr #\.)))
(match-star chr regex text)
nil)))


Jarc> (match "ob" "foobar")
t
Jarc> (match "ab" "foobar")
nil
Jarc> (match "fob" "foobar")
nil
Jarc> (match "fo*b" "foobar")
t
Jarc> (match "f.*b" "foobar")
t
Jarc> (match "^f.*b" "foobar")
t
Jarc> (match "f.*b$" "foobar")
nil
Jarc> (match "b.r$" "foobar")
t
Jarc> (match "^foobar$" "foobar")
t
Jarc> (match "" "foobar")
t
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top