Finding a word inside a string

A

August Karlstrom

Tabs are a no no in any serious development.

This is not an argument
The look must be consistent across machines and editors.

Consistent, yes, but not necessarily the same.
Tabs Are Evil.

Not an argument.
http://www.emacswiki.org/emacs/TabsAreEvil


of course in a perfect world it would matter - set the tab and bang
.... but its not a real perfect world. And tabs and spaces get mixed.

http://www.pluralsight-training.net/community/blogs/craig/archive/2004/12/08/3775.aspx

As long as you keep it simple and stay away from "fancy formatting" I
see no problems with tabs. With fancy formatting I mean lining up lines
at a column which is not a multiple of the basic indentation step.

To me using spaces for indentation is analogous to using consecutive BR
elements to separate paragraphs instead of using the P element in HTML.

One drawback of this fancy formatting is also that it often requires
lots of editing in terms of reindenting surrounding lines e.g. when an
identifier is renamed. Compare

int a[] = {
3,
5,
...
};

to

int a[] = {3,
5,
...};


On the other hand, if you *have to* do fancy formatting I would also
recommend spaces for indentation.

/August
 
C

Chad

On one of my systems, ERANGE is 34.

Wouldn't it suck to have a program blow up claiming I hadn't
provided a valid number every time I tried to enter "34"?


I'm still not *getting it* like what I should. I suppose I would ask
for clarification, but I'm not too sure what I'm confused about. All I
know for sure is that I'm still somewhat confused.

Chad
 
I

Ike Naar

IMHO it's safest to use `<` and follow the loop body with

assert(i == N);

if the code actually depends on that. That might not be the case if `N`
is allowed to be negative (in that case, `i != N` as the loop condition
will obviously screw up).


Or it might not; `i` may eventually reach `N`, possibly after a
wraparound. If the width of `i` is small (say, 16 bits), you may not
even notice the delay.

Indeed, might or might not. But by making the loop condition as weak
as possible you maximize the potential for nontermination in case
of a screwed-up loop control.
Note that if the loop terminates with the weaker condition, it will
certainly terminate with the stronger condition.
 
K

Keith Thompson

Ike Naar said:
Indeed, might or might not. But by making the loop condition as weak
as possible you maximize the potential for nontermination in case
of a screwed-up loop control.

And is that necessarily a bad thing?

If you screw up the loop control, the program is wrong and needs
to be fixed. If the symptom is an infinite loop (or, say, a crash
when you attempt to access past the end of an array), you're more
likely to be aware of the problem.
Note that if the loop terminates with the weaker condition, it will
certainly terminate with the stronger condition.

If it matters (if the weaker and stronger conditions are not
equivalent), then you have a buggy program.

Having said that, I tend to write things like

for (i = 0; i < N; i ++) ...

myself.

<OT>C++ iterators tend to encourage equality tests for loop
termination.</OT>
 
S

Seebs

I'm still not *getting it* like what I should. I suppose I would ask
for clarification, but I'm not too sure what I'm confused about. All I
know for sure is that I'm still somewhat confused.

Walk through what the program does in two cases:

1. The conversion fails, and errno is set to erange.
-> v is 0
-> errno is ERANGE
-> we return errno, which is ERANGE (which is 34 on my system)
2. The conversion succeeds, and the string converted was "34"
-> v is 34
-> we return v, which is 34

Now. How is the caller supposed to tell which of these two things
happened? How can they tell whether the program succeeded or failed?

For any function like this, you really need to separate out the value
returned from information about whether or not a value was returned,
by ensuring that any error return is an IMPOSSIBLE value for the
"successful" returns.

Consider getchar(). On typical systems (where int has a larger range
than unsigned char), EVERY legitimate non-error value is non-negative, and
the error indicator is negative. There is NO character which can occur
in a stream which can be mistaken for EOF, because characters (when
converted to unsigned char) are never negative values, and thus, the
conversion to int is never negative*, so EOF can ALWAYS be distinguished.

If you don't have a way to guarantee that the range of meaningful values
is definitely smaller than the range of int, though, you need to find
some way to distinguish between those two pieces of information; you
can't just store them in the same int object.

-s
[*] There have been machines on which unsigned char had as many values as
int, and some of them did not convert to non-negative values, but they're
very unusual and don't do much standard I/O stream work, usually.
 
B

Barry Schwarz

Walk through what the program does in two cases:

1. The conversion fails, and errno is set to erange.
-> v is 0
-> errno is ERANGE
-> we return errno, which is ERANGE (which is 34 on my system)

The only time strtol returns 0 as an error indicator is if no
conversion is performed. In that case errno is not set to ERANGE.

The only time errno is set to ERANGE is if the converted value is out
of range for a long. In that case, strtol returns either LONG_MAX or
LONG_MIN. The assignment of either of these values to an int will
either result in a non-zero value (when int and long have the same
range of values), undefined behavior (C89), or an implementation
defined result or an implementation defined signal (C99). It would
seem strange to me for the implementation defined result to be 0 but
that is the only way for your postulated result to occur.
2. The conversion succeeds, and the string converted was "34"
-> v is 34
-> we return v, which is 34

Now. How is the caller supposed to tell which of these two things
happened? How can they tell whether the program succeeded or failed?

For any function like this, you really need to separate out the value
returned from information about whether or not a value was returned,
by ensuring that any error return is an IMPOSSIBLE value for the
"successful" returns.

Consider getchar(). On typical systems (where int has a larger range
than unsigned char), EVERY legitimate non-error value is non-negative, and
the error indicator is negative. There is NO character which can occur
in a stream which can be mistaken for EOF, because characters (when
converted to unsigned char) are never negative values, and thus, the
conversion to int is never negative*, so EOF can ALWAYS be distinguished.

If you don't have a way to guarantee that the range of meaningful values
is definitely smaller than the range of int, though, you need to find
some way to distinguish between those two pieces of information; you
can't just store them in the same int object.

-s
[*] There have been machines on which unsigned char had as many values as
int, and some of them did not convert to non-negative values, but they're
very unusual and don't do much standard I/O stream work, usually.
 
I

Ike Naar

And is that necessarily a bad thing?

No it's a good thing! Use the weakest loop condition that will do the job.
If you screw up the loop control, the program is wrong and needs
to be fixed. If the symptom is an infinite loop (or, say, a crash
when you attempt to access past the end of an array), you're more
likely to be aware of the problem.


If it matters (if the weaker and stronger conditions are not
equivalent), then you have a buggy program.

Yes, but that's what the argument is about: which loop condition
works best in case of a program bug.
Having said that, I tend to write things like

for (i = 0; i < N; i ++) ...

myself.

And, for the reasons stated above, I'd write i!=N.
Of course, if i<=N is an invariant of the loop, loop conditions
i<N and i!=N are equivalent.
But if the loop has a bug (e.g. if the body of the loop does not
preserve the invariant), loop condition i<N may cause the loop to
terminate in a state where i>N (which violates the invariant).
In my opinion, it's better not to terminate than to terminate
in an invalid state. Marcin Grzegorczyk suggested to use i<N
as the loop condition, in combination with an assert(i<=N) after
the loop. That's okay. But I'd rather use i!=N as the loop
condition, and place the assert(i<=N) in the loop body.
 
N

Nick

Ike Naar said:
No it's a good thing! Use the weakest loop condition that will do the
job.

I'm not actually convinced that these arguments really stack up. If
your code is written such that there's a likelihood that the loop index
will suddenly leap past the terminating condition then lots of other
things are probably wrong, and you might well want to find them out.
And I think you said that below in the bit I've snipped.

OTOH, it's certainly the familiar idiom in C, and anyone seeing
"for(...;i != 10; ...) will wonder why you've done it that way. That
seems to me to be a good enough reason in its own right to stick to the
traditional < form.
 
N

namekuseijin

WANTED: want to find whether a words exists inside a string or not and
convert it into int (its a number).

download:
http://www.pcre.org/

use:
#include <pcre.h>
....
pcre *regex;
const char *error;
int erroffset;
char *pattern = "([0-9]+)";
char *string = "The answer to the Life, The Universe and Everything
is... 42";
regex = pcre_compile(pattern,0,&error,&erroffset,NULL);
if (!regex) exit(1);
....
int result;

int sz = 30;
int v[sz];
result = pcre_exec(regexp,NULL,string,strlen(string),0,0,v,sz);
//0 = found, -1 = not found
if (result == 0)
{
int start = v[???]; //start of match
int end = (v[????]);//end of match
}
....
pcre_free(regexp);

read this for a better example:
http://www.haifux.org/lectures/156/PCRE-Perl_Compatible_Regular_Expression_Library.pdf

I would rather just use a line of perl, though... :)
 
A

arnuld

With this I get:

[arnuld@dune programs]$ ./a.out
v_str = ""
v_int = 0

Okay, let's be more specific.

When using ``get_value_using_key'', do you get what you want if you
search for "ant" in "Size of elephant: huge\nSize of ant: tiny" ?

The software I am working on does not have these kind of matches. Its has
only unique keywords, means if there is an ant: tiny, then it is sure
there will be nothing like elephant: huge as ant: is unique

But, even after all this I want to learn how to write code which makes
sure if I want ant: then ant: tiny is searched not the elephant: huge. So
I want to take this discussion forward to code :)


Alternatively, when using ``get_int_value_using_char_key'', do you get
what you want if you search for "ant" in "Size of elephant:
9999999\nSize of ant: 1" ?
The point is that the method of using ``strstr'' to find the key in the
string table is rather crude, and you may get unexpected matches if the
key appears as part of another word.

right. I need a solution for that bug.
 
L

luserXtrog

With this I get:
[arnuld@dune programs]$ ./a.out
v_str = ""
v_int = 0
Okay, let's be more specific.
When using ``get_value_using_key'', do you get what you want if you
search for "ant" in "Size of elephant: huge\nSize of ant: tiny" ?

The software I am working on does not have these kind of matches. Its has
only unique keywords, means if there is an ant: tiny, then it is sure
there will be nothing like elephant: huge as ant: is unique

But, even after all this I want to learn how to write  code which makes
sure if I want ant: then ant: tiny is searched not the elephant: huge. So
I want to take this discussion forward to code :)
Alternatively, when using ``get_int_value_using_char_key'', do you get
what you want if you search for "ant" in "Size of elephant:
9999999\nSize of ant: 1" ?
The point is that the method of using ``strstr'' to find the key in the
string table is rather crude, and you may get unexpected matches if the
key appears as part of another word.

right. I need a solution for that bug.

Add context.

strstr("Size of elephant: 9999999\nSize of ant: 1",
"Size of elephant: ");
strstr("Size of elephant: 9999999\nSize of ant: 1",
"Size of ant: ");
 
A

arnuld

-Eliminate the undefined behavior.
done.


-Delete the useless cast.

Will see it later. Right now focusing on 1 part of program only, See the
code down.
-Eliminate the wasted vertical space. -Don't use tabs for indenting.
-Indent consistently.

I don't get it. I use Emacs and using tab whatever default indenting is
there I use it. Normal indentation is 2 space when I press tab for
writing function definition and from there indentation increases at the
rate of 2 spaces per indentation. I think 2 spaces are fine. You have
some other/better way ?

-Align your braces. Some of your } are under the corresponding {;
others are not.
-Delete unused enum values.

done. See if this is fine, only then we will go to next "char to int"
function commented here.


#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>


enum
{
VAL_ERR = -1,
VAL_TRUE = 1,
SIZE_VALUE = 10,
ERR_CONVERSION_CHAR_TO_INT = -10
};


/* int get_int_value_using_char_key(const char* t, const char* k); */
void get_value_using_key(const char* t, const char* k, char* f);


int main(void)
{
int v_int = VAL_ERR;
char v_str[SIZE_VALUE+1] = {0};

/* const char* desc = "This is the value of Content: 123\n and this is
Size: 98"; */
const char* desc = "Size of elephant: huge\nSize of ant: tiny";
char* k1 = "ant";
char* k2 = "Size";

get_value_using_key(desc, k1, v_str);
printf("v_str = \"%s\"\n", v_str);
/*
v_int = get_int_value_using_char_key(desc, k2);
printf("v_int = %d\n", v_int);
*/

return 0;
}



/* returns the integer value corresponding to the key mentioned as const
char* k
int get_int_value_using_char_key(const char* t, const char* k)
{
int v = 0;
char value_str[SIZE_VALUE+1] = {0};

get_value_using_key(t, k, value_str);

if(value_str[0])
{
v = strtol(value_str, (char**) NULL, 10);
if(0 == v && ERANGE == errno) return ERR_CONVERSION_CHAR_TO_INT;
}

return v;
}
*/


void get_value_using_key(const char* t, const char* k, char* fill_value)
{
int f_idx;
const int fill_size = SIZE_VALUE; /* can not use strlen(f) here as
fill_value is full of null characters */
char* p = NULL;
size_t len_of_key = strlen(k);

while(VAL_TRUE)
{
p = strstr(t, k);
if(NULL == p) return;

if(*(p-1) == ' ')
{
p += len_of_key;

if( *p == ':' && *(p+1) == ' ' )
{
p += 2; /* Removing the colon and next space after colon */
for(f_idx = 0; (f_idx != fill_size) && (*p != '\n'); ++p, +
+f_idx)
{
fill_value[f_idx] = *p;
}
break;
}
}
else
{
t = p + len_of_key;
}
}
}
======================= OUTPUT =====================
[arnuld@dune programs]$ gcc -ansi -pedantic -Wall -Wextra get-value-from-
key.c
get-value-from-key.c: In function ‘main’:
get-value-from-key.c:28: warning: unused variable ‘k2’
get-value-from-key.c:22: warning: unused variable ‘v_int’
[arnuld@dune programs]$ ./a.out
v_str = "tiny"
[arnuld@dune programs]$
 
B

Ben Bacarisse

arnuld said:
I don't get it. I use Emacs and using tab whatever default indenting is
there I use it. Normal indentation is 2 space when I press tab for
writing function definition and from there indentation increases at the
rate of 2 spaces per indentation. I think 2 spaces are fine. You have
some other/better way ?

You have tabs and spaces making up the indent on some lines. I must
have the same tab setting in my news reader as you do in Emacs because I
see your indents as you intend them to, but that won't be true for
everyone.

[You can set indent-tabs-mode to nil in Emacs and it will only use
spaces when indenting.]

Your use of a "double indent" for block contents (one for the { and
another for the content) is uncommon. Creativity in programming is very
important, but individual flair in coding style is not usually and
advantage.
done. See if this is fine, only then we will go to next "char to int"
function commented here.

Looks to have gone wrong in some way:

int main(void)
{
}

/* returns the integer value corresponding to the key mentioned as const
char* k
int get_int_value_using_char_key(const char* t, const char* k)
{

This whole function seems to indented.
int v = 0;
char value_str[SIZE_VALUE+1] = {0};

get_value_using_key(t, k, value_str);

if(value_str[0])
{
v = strtol(value_str, (char**) NULL, 10);

But the body of this if is not.
if(0 == v && ERANGE == errno) return ERR_CONVERSION_CHAR_TO_INT;
}

return v;
}

<snip>
 
I

Ike Naar

void get_value_using_key(const char* t, const char* k, char* fill_value)
{
int f_idx;
const int fill_size = SIZE_VALUE; /* can not use strlen(f) here as
fill_value is full of null characters */
char* p = NULL;
size_t len_of_key = strlen(k);

while(VAL_TRUE)
{
p = strstr(t, k);
if(NULL == p) return;

if(*(p-1) == ' ')

If t starts with k (which is not impossible), then strstr(t, k) equals p, and
the behaviour of evaluating p[-1] will be undefined. You probably want here:

if (p == t || p[-1] == ' ')
{
p += len_of_key;

if( *p == ':' && *(p+1) == ' ' )
{
p += 2; /* Removing the colon and next space after colon */
for(f_idx = 0; (f_idx != fill_size) && (*p != '\n'); ++p, +
+f_idx)
{
fill_value[f_idx] = *p;
}

At this point, if you add:

fill_value[f_idx] = '\0';

then the string in fill_value will be nicely zero-terminated, even when the
caller of this function fails to provide a zero-filled fill_value array.
 
B

Barry Schwarz


Not really.

snip
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>


enum
{
VAL_ERR = -1,
VAL_TRUE = 1,
SIZE_VALUE = 10,
ERR_CONVERSION_CHAR_TO_INT = -10
};


/* int get_int_value_using_char_key(const char* t, const char* k); */

There is no need to remove a prototype when you are not ready to test
the function. This just adds one more place you need to remember to
uncomment when you are ready to test.
void get_value_using_key(const char* t, const char* k, char* f);


int main(void)
{
int v_int = VAL_ERR;
char v_str[SIZE_VALUE+1] = {0};

/* const char* desc = "This is the value of Content: 123\n and this is
Size: 98"; */
const char* desc = "Size of elephant: huge\nSize of ant: tiny";
char* k1 = "ant";
char* k2 = "Size";

get_value_using_key(desc, k1, v_str);
printf("v_str = \"%s\"\n", v_str);
/*
v_int = get_int_value_using_char_key(desc, k2);
printf("v_int = %d\n", v_int);
*/

Since it is entirely possible for the block of code you wish to
suppress to contain comments, using /* and */ brackets is of limited
utility. You are probably better off using #if 0 and #endif brackets.
return 0;
}



/* returns the integer value corresponding to the key mentioned as const
char* k
int get_int_value_using_char_key(const char* t, const char* k)
{
int v = 0;
char value_str[SIZE_VALUE+1] = {0};

get_value_using_key(t, k, value_str);

if(value_str[0])
{
v = strtol(value_str, (char**) NULL, 10);
if(0 == v && ERANGE == errno) return ERR_CONVERSION_CHAR_TO_INT;
}

return v;
}
*/


void get_value_using_key(const char* t, const char* k, char* fill_value)
{
int f_idx;
const int fill_size = SIZE_VALUE; /* can not use strlen(f) here as
fill_value is full of null characters */

This comment is true only the first time you call the function with a
particular output array. The code prevents you from calling the
function with different output arrays of different sizes. The common
convention is to pass the size of the array to the function as an
argument in the calling statement.
char* p = NULL;
size_t len_of_key = strlen(k);

while(VAL_TRUE)
{
p = strstr(t, k);
if(NULL == p) return;

This will prevent you from detecting the condition where the word is
not present inside the string if you call the function a second time.
You might want to consider setting fill_value[0] to '\0' at the start
of the function.

However, that will fail to distinguish the case where the word is not
found and the case where the word is at the end of the line
(as in "x ant: \n"). One solution to this problem is to return a
status value telling the user if the function was successful or not.
if(*(p-1) == ' ')

Ike has already identified the potential problem here.
{
p += len_of_key;

if( *p == ':' && *(p+1) == ' ' )
{
p += 2; /* Removing the colon and next space after colon */
for(f_idx = 0; (f_idx != fill_size) && (*p != '\n'); ++p, +
+f_idx)

There is no '\n' after the word "tiny" in your string literal. How
did this loop end without stepping of the end of your string literal
and invoking undefined behavior.
{
fill_value[f_idx] = *p;
}
break;
}
}
else
{
t = p + len_of_key;
}
}
}
======================= OUTPUT =====================
[arnuld@dune programs]$ gcc -ansi -pedantic -Wall -Wextra get-value-from-
key.c
get-value-from-key.c: In function ‘main’:
get-value-from-key.c:28: warning: unused variable ‘k2’
get-value-from-key.c:22: warning: unused variable ‘v_int’
[arnuld@dune programs]$ ./a.out
v_str = "tiny"

You only received this illusion of correct output because your string
literal had a terminating '\0' which was also copied and that caused
printf to stop printing. However, the loop that filled fill_value did
invoke undefined behavior which unfortunately had no obvious
manifestation. Run the loop step by step under a debugger and see for
yourself.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,954
Messages
2,570,114
Members
46,702
Latest member
VernitaGow

Latest Threads

Top