FindFirstIn

J

jacob navia

Task:

given a text, and a set of characters, find the first occurrence of any
of the characters in the set.

Interface

There are two possibilities:
1)
size_t strFindFirst(char *text,char *set);

The function would return the index of the first character in the string.

2)
char *strFindFirst(char *text,char *set);

The function would return a pointer to a character in the middle of the
string.

Discussion
----------
The first solution is OK if you are programing in C++ where counted
strings are the norm. Lcc-win also provides you with counted strings
using operator overloading and the first solution is choosen in that
context.

For a pure "standard C" solution however, the second solution could be
more interesting: you have immediately the wanted character with just
dereferencing a pointer.

Actually of course both solutions are completely equivalent since you
can add to the start of the string the index returned by the first
solution and obtain a pointer that is equivalent to the second solution.

A first approach:

1 #include <string.h>
2 char *strFindFirstIn(char *str,char *set)
3 {
4 char *pSet;
5 int c = *str;
6 while (*str) {
7 pSet = set;
8 while (*pSet) {
9 if (c == *pSet++) return str;
10 }
11 c = *++str;
12 }
13 return str;
14 }

For each character in the text we search if it matches any of the
characters in the string. If we have N characters in the text and M
characters in the set, we make N x M character comparisons.

A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

We make a table of 256 positions filled initially with zeroes. For each
character in the set we fill the table with a "1" value. Then, we go
through the text and see for each character if the position contans a
zero (no match) or a one (match).

Note that for repeated characters in the set, the first solution will
again test for the character... but not in the second solution.

Of course the second solution has a drawback if the text is very short
and the set very small. It needs a zeroing of 256 positions, then the
filling of the occupied positions.

But in the case of a very long text and a longuish set, the second
soution wins since it does only one comparison per character!

How about you?

Can you give a better solution?

Comments welcome.

Here is a test program.
----------------------

1 #include <stdio.h>
2 int main(void)
3 {
4 char *s=
"This sentence will be changed to replace vowels by asterisks";
5 char *p = strFindFirstIn1(s,"aeiou");
6 while (*p) {
7 *p++ = '*';
8 p = strFindFirstIn(p,"aeiou");
9 }
10 printf("%s\n",s);
11 }

jacob
 
B

BartC

given a text, and a set of characters, find the first occurrence of any of
the characters in the set.
A first approach:
For each character in the text we search if it matches any of the
characters in the string. If we have N characters in the text and M
characters in the set, we make N x M character comparisons.

A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
We make a table of 256 positions filled initially with zeroes. For each
character in the set we fill the table with a "1" value. Then, we go
through the text and see for each character if the position contans a zero
(no match) or a one (match).

Note that for repeated characters in the set, the first solution will
again test for the character... but not in the second solution.

Of course the second solution has a drawback if the text is very short and
the set very small. It needs a zeroing of 256 positions, then the filling
of the occupied positions.

If the same set if being used to match against millions of short strings,
then possibly creating the set can be a bottleneck. In this case, perhaps
allow for the caller to supply the set (maybe using an extra function to
create it separately). Then that overhead can be largely eliminated.

Also, when searching the same (long) string many times, for a different set
each time, it might be worth creating an index (table of char*) of the 256
characters with the initial position in the string (with NULL if a character
does not occur).

Then search time for subsequent finds depends only on the length of the set
(maximum 256). For a single character set, that would take no time at all.
 
K

Kaz Kylheku

Task:

given a text, and a set of characters, find the first occurrence of any
of the characters in the set.

#include <string.h>

char *first_occurrence = strpbrk(str, set);

Obscure trivia: the "break" in strpbrk and "span" in strspn come from Snobol.
 
A

Anand Hariharan

#include <string.h>

char *first_occurrence = strpbrk(str, set);

Obscure trivia: the "break" in strpbrk and "span" in strspn come from Snobol.

Or

size_t idx = strcspn(str, set);


As far as complexity analysis goes, the first solution has a worst case of M x N operations, whereas the second one has 256 * M operations.

- Anand
 
J

jacob navia

Le 22/05/2014 20:21, Kaz Kylheku a écrit :
#include <string.h>

char *first_occurrence = strpbrk(str, set);

Obscure trivia: the "break" in strpbrk and "span" in strspn come from Snobol.

Well, who cares about that?

How would you write that?

That was my aim at posting the message. Something for people that start
programming to see different ways of doing stuff.

Thanks for your input.
 
J

jacob navia

Le 22/05/2014 20:41, Anand Hariharan a écrit :
Or

size_t idx = strcspn(str, set);


As far as complexity analysis goes, the first solution has a worst case of M x N operations, whereas the second one has 256 * M operations.

- Anand
The standard says about that, that
"... computes the length of the maximum initial segment of the string
pointed to by s1 which consists entirely of characters not from the
string pointed to by s2."

How clear. Instead of saying "returns the index of the first character
of a set in a text"

But anyway the goal was to show how that could be implemented.
 
B

Ben Bacarisse

jacob navia said:
Le 22/05/2014 20:41, Anand Hariharan a écrit :
The standard says about that, that
"... computes the length of the maximum initial segment of the string
pointed to by s1 which consists entirely of characters not from the
string pointed to by s2."

How clear. Instead of saying "returns the index of the first character
of a set in a text"

They say different things so the authors could not have used your words
instead. To compare two definitions for clarity they must say the same
thing.
 
J

jacob navia

Le 22/05/2014 22:44, Ben Bacarisse a écrit :
They say different things so the authors could not have used your words
instead. To compare two definitions for clarity they must say the same
thing.

Probably

That standard is so well written that anybody understands what it says
but me.
 
B

Ben Bacarisse

jacob navia said:
Here is a test program.
----------------------

1 #include <stdio.h>
2 int main(void)
3 {
4 char *s=
"This sentence will be changed to replace vowels by asterisks";

UB. You want char s[] = ...;
5 char *p = strFindFirstIn1(s,"aeiou");
6 while (*p) {
7 *p++ = '*';
8 p = strFindFirstIn(p,"aeiou");

Are you deliberately calling two separate versions?

To avoid this (and to avoid duplicating the set) I'd write one call:

p = s;
while (*(p = strFindFirstIn(p, "aeiou"))) *p++ = '*';
 
B

Ben Bacarisse

jacob navia said:
Le 22/05/2014 22:44, Ben Bacarisse a écrit :

Probably

That standard is so well written that anybody understands what it says
but me.

I don't think it's hard. What is the index of the first character in
the set "x" in the string "y"? I don't know. What is the length of the
maximum initial segment of the string "y" which consists entirely of
characters not in the string "x"? Clearly 1.
 
J

jacob navia

Le 22/05/2014 23:49, Ben Bacarisse a écrit :
I don't think it's hard. What is the index of the first character in
the set "x" in the string "y"? I don't know. What is the length of the
maximum initial segment of the string "y" which consists entirely of
characters not in the string "x"? Clearly 1.


Sure. To say that

"If no character is found it returns the length of the string."

that's way too complicated. I understand now. They had to choose a
simpler way.

GREAT!
 
J

jacob navia

Le 22/05/2014 23:44, Ben Bacarisse a écrit :
jacob navia said:
Here is a test program.
----------------------

1 #include <stdio.h>
2 int main(void)
3 {
4 char *s=
"This sentence will be changed to replace vowels by asterisks";

UB. You want char s[] = ...;


yes, thanks
Are you deliberately calling two separate versions?

No, it is a typo
To avoid this (and to avoid duplicating the set) I'd write one call:

p = s;
while (*(p = strFindFirstIn(p, "aeiou"))) *p++ = '*';

Looks ugly. I prefer two calls...
 
J

jacob navia

Le 22/05/2014 18:42, jacob navia a écrit :
I rewrote the second function in assembler. Only 77 bytes!

gcc (with -Os) is 350.
 
B

Ben Bacarisse

jacob navia said:
Le 22/05/2014 23:44, Ben Bacarisse a écrit :

No, it is a typo

Looks ugly. I prefer two calls...

I can't argue with that, except to say I prefer the version where the
error I pointed out is impossible! Anyway, I don't expect to persuade,
just to put it out there. It's helpful, I think, to see options.
 
B

Ben Bacarisse

jacob navia said:
Le 22/05/2014 23:49, Ben Bacarisse a écrit :

Sure. To say that

"If no character is found it returns the length of the string."

that's way too complicated. I understand now. They had to choose a
simpler way.

GREAT!

The sarcasm makes me *want* to disagree! But you get my point, I think.
Had you presented specifications that covered the same cases in the same
detail, an assessment could be made if one was indeed overly complex.
 
I

Ike Naar

Le 22/05/2014 18:42, jacob navia a ?crit :
I rewrote the second function in assembler. Only 77 bytes!
gcc (with -Os) is 350.

Another interesting metric would be the number of bugs.
 
I

Ian Collins

jacob said:
A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

There are rather a lot of type mismatches there! Not least the return
type and the variable returned...
 
J

jacob navia

Le 23/05/2014 08:13, Ian Collins a écrit :
jacob said:
A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

There are rather a lot of type mismatches there! Not least the return
type and the variable returned...
Since I am using the characters as an index into a table I have to avoid
indexing using negative numbers. I am forced to use *unsigned* chars.

Since in the interface I do not want to use unsigned char I am stuck
with implicit conversions.

I do not see any other way to get out of this problem. Do you?
 
J

jacob navia

Le 23/05/2014 07:45, Ike Naar a écrit :
Another interesting metric would be the number of bugs.

Go ahead. If your remark is not just bad faith tell me why this program
would be wrong
1 .text 64
2 ; char *strFindFirst(char *str,char *set)
3 strFindFirst:
4 ; Make space for a table of 256 positions
5 subq $256,%rsp
6 ; if (*str==0) return p;
7 cmpb $0,(%rcx)
8 je SetResultAndExit
9 ; memset(tab,0,sizeof(tab));
10 ; Save the registers rdi and rcx into r10 and r11
11 movq %rdi,%r10
12 movq %rcx,%r11
13 ; Prepare for the stosq loop: Count is 32 (256/8),rax is zero
14 ; and destination is rsp, stored in rdi.
15 movl $32,%ecx
16 xorl %eax,%eax
17 movq %rsp,%rdi
18 rep
19 stosq
20 movq %r10,%rdi
21 movq %r11,%rcx
22 jmp TestFillTableLoopCondition
23 FillTable:
24 ; tab[*(unsigned char *)set++]=1;
25 incb (%rsp,%rax)
26 incq %rdx
27 TestFillTableLoopCondition:
28 movb (%rdx),%al
29 testb %al,%al
30 jne FillTable
31 SearchCharLoop:
32 ; if (tab[*p]) return p;
33 movb (%rcx),%al ; move char to al
34 cmpb $0,(%rsp,%rax) ; index table with it
35 jne SetResultAndExit ; test if nonzero
36 incq %rcx ; next char
37 cmpb $0,(%rcx) ; test if zero
38 jne SearchCharLoop
39 SetResultAndExit:
40 movq %rcx,%rax ; set result
41 addq $256,%rsp ; restore stack
42 ret
 
B

Ben Bacarisse

jacob navia said:
Le 23/05/2014 08:13, Ian Collins a écrit :
jacob said:
A more sophisticated approach:

1 char *strFindFirstIn1(char *str,char *set)
2 {
3 int c = *str;
4 unsigned char tab[256];
5 unsigned char *p=str;
6
7 if (*p==0) return p;
8 memset(tab,0,sizeof(tab));
9 while (*set) {
10 tab[*(unsigned char *)set++]=1;
11 }
12 while (*p) {
13 if (tab[*p]) return p;
14 p++;
15 }
16 return p;
17 }

There are rather a lot of type mismatches there! Not least the return
type and the variable returned...
Since I am using the characters as an index into a table I have to
avoid indexing using negative numbers. I am forced to use *unsigned*
chars.

Since in the interface I do not want to use unsigned char I am stuck
with implicit conversions.

I do not see any other way to get out of this problem. Do you?

Yes. You could write

bool table[256];
bool *tab = table - CHAR_MIN;

(I used bool because that's logically the right type and you are a fan
of C99, but there might be a reason you chose char.)

You should certainly swap which pointer you do *(unsigned char *) on.
If you make a local unsigned char *pset, change p to be char * and use
the cast on tab[*p] instead, you get far fewer type mismatches with
pretty much the same code complexity. (And you can remove c, it's not
used.)

But, since you return a pointer to the null byte when none of the
non-null characters is found, logically the null is in the set of
searched-for characters. The code gets a lot simpler if you treat the
null just like the others:

bool table[256] = {0};
bool *tab = table - CHAR_MIN;
do tab[*set] = true; while (*set++);

while (!tab[*str]) ++str;
return str;

gcc compiles this to 20/21 instructions (depending on optimising for
speed or size). That's fewer than your hand-coded assembler version,
though it may, of course, be either more bytes, or slower, or both.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top