strtok()

M

Mark

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
....

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save a
copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

return 0;
}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

Where is my mistake?
Thanks!
 
M

Malcolm McLean

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
    char *token, *subtoken;
    char buftmp[20];

    for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
            for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
             subtoken = strtok(NULL, "-")) {
            printf("%s ", buf,subtoken);
        }
        putchar('\n');
    }
Where is my mistake?
Nesting strtoks(). The function uses a static to store the current
pointer position, which you then overwrite witht he nested call.
strtok is basically a bad function. Write your own strsplit() instead,
returning a list of strings in allocated memory.
 
B

Ben Bacarisse

Mark said:
I'm trying to write a simple parser for my application, the purpose is
to allow application understand the command line arguments in the
form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function
doesn't print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL,
",")) {
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save
a copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);

The problem with strtok has been pointed out, but you can continue to
use it because you don't really need it here. You expect only one pair
or maybe a lone number and you can parse that using sscanf:

sscanf(token, "%d-%d", &low, &high)

will return 1 for lone numbers, 2 for a pair like 1-3 and anything else
is an error and needs to be reported.

If you need to check that there are no other characters in the token you
could do something like this:

sscanf(token, "%d%n-%d%n", &low, &len1, &high, &len1)

Now, you need a return of 1 and strlen(token) == len1 or a return of 2
and strlen(token) == len2. Again, anything else is an error.
}
putchar('\n');
}

return 0;
}

<snip>
 
E

Eric Sosman

Hi

I'm trying to write a simple parser for my application, the purpose is
to allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function
doesn't print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save a copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

return 0;
}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

Where is my mistake?

strtok() doesn't "nest:" It can be working on only one source
string at a time. When you call strtok(buftmp,...), it forgets
about the "outer" string.

If your system has the (non-Standard) strtok_r() function, you
might be able to use that instead of strtok().
 
N

Nick Keighley

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
    char *token, *subtoken;
    char buftmp[20];

    for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
        printf("%s: ", token);
        strcpy(buftmp, token);    /* strtok modifies buffer, so we save a
copy */
        for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
             subtoken = strtok(NULL, "-")) {
            printf("%s ", buf,subtoken);
        }
        putchar('\n');
    }

    return 0;

}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

be nice if you told us what it did instead...
other posters have pointed out the nesting problem.
also not strtok() modifies the string it's parsing so beware

parseLine ("1-3,5,6");

might give a problem (its actually undefined behaviour to modify a
string literal)
 
K

Keith Thompson

Ben Bacarisse said:
The problem with strtok has been pointed out, but you can continue to
use it because you don't really need it here. You expect only one pair
or maybe a lone number and you can parse that using sscanf:

sscanf(token, "%d-%d", &low, &high)

will return 1 for lone numbers, 2 for a pair like 1-3 and anything else
is an error and needs to be reported.
[...]

Keep in mind that sscanf's behavior is undefined if you scan a number
outside the range of the specified type. For example,
if INT_MAX==32767, then this:

sscanf("40000-50000", "%d-%d", &low, &high);

has undefined behavior. Which is a great pity; it makes the *scanf()
functions very difficult to use safely for numeric input.

With a bit of extra work, you can use the strto*() functions instead;
they're sane enough to tell you if the value is out of range (by
returning an extreme value and setting errno to ERANGE).
 
M

Mark

Keith Thompson wrote:
[skip]
With a bit of extra work, you can use the strto*() functions instead;
they're sane enough to tell you if the value is out of range (by
returning an extreme value and setting errno to ERANGE).
My system's strtok man page (Fedore Core 6) doesn't say anything about
returning extreme value or setting errno to ERANGE.
 
M

Mark

Vincenzo said:
I've written a scratch I hope will serve. Beware that maybe I am
missing some error checkings, also you couldn't write white spaces
between the separators "," , "-" and numbers. I didn't add any checks
Thanks, I'll give it a try.
 
M

Mark

Eric Sosman wrote:
[skip]
strtok() doesn't "nest:" It can be working on only one source
string at a time. When you call strtok(buftmp,...), it forgets
about the "outer" string.

If your system has the (non-Standard) strtok_r() function, you
might be able to use that instead of strtok().

So for strtok_r() it's safe to pass the same buffer pointer? Like this:

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
printf("%s: ", token);
/* no need to keep a copy of 'buf' */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL; subtoken =
strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
}
 
M

Mark

One more question; when I compile code featuring strtok_r() with
"gcc -ansi -pedantic -W -Wall" it naturally complains:

warning: implicit declaration of function 'strtok_r'
warning: assignment makes pointer from integer without a cast

First warning is clear, the second refers to strtok_r() call:

char *token;
char *saveptr1 = NULL, *saveptr2 = NULL;
token = strtok_r(buf, ",", &saveptr1);

I wonder, what is the compiler's logic here: if in ANSI mode a function is
not prototyped, then the compiler considers that such functions return
'int', but it actually return 'char *', is that correct?

These warnings are gone, when compiled with "-posix -W -Wall"
 
K

Keith Thompson

Mark said:
One more question; when I compile code featuring strtok_r() with
"gcc -ansi -pedantic -W -Wall" it naturally complains:

warning: implicit declaration of function 'strtok_r'
warning: assignment makes pointer from integer without a cast

First warning is clear, the second refers to strtok_r() call:

char *token;
char *saveptr1 = NULL, *saveptr2 = NULL;
token = strtok_r(buf, ",", &saveptr1);

I wonder, what is the compiler's logic here: if in ANSI mode a function is
not prototyped, then the compiler considers that such functions return
'int', but it actually return 'char *', is that correct?

That's correct. In C90, a reference to an undeclared function
effectively creates an implicit declaration for the function assuming it
returns int and takes a fixed but unspecified number and type of
arguments. So writing
token = strtok_r(buf, ",", &saveptr1);
implicitly declares
int strtok_r();

In C99, a reference to an undeclared function is a constraint violation.
Even in C90, it's poor style to depend on it; functions should be
declared, preferably by #include'ing the appropriate header.
These warnings are gone, when compiled with "-posix -W -Wall"

Probably "-posix" causes the declaration of strtok_r to become visible.
 
K

Keith Thompson

Ian Collins said:
Keith Thompson wrote:
[skip]
With a bit of extra work, you can use the strto*() functions instead;
they're sane enough to tell you if the value is out of range (by
returning an extreme value and setting errno to ERANGE).
My system's strtok man page (Fedore Core 6) doesn't say anything about
returning extreme value or setting errno to ERANGE.

I'm sure Keith was referring to strtol() and strtoll()

Yes, along with strtoul(), strtoull(), strtod(), strof(), and
strtold(). I didn't notice that "strtok" matches the same pattern
(because the "to" in "strtok" is part of "tok", an abbreviation of
"token", not the word "to").
 
E

Eric Sosman

Eric Sosman wrote:
[skip]
strtok() doesn't "nest:" It can be working on only one source
string at a time. When you call strtok(buftmp,...), it forgets
about the "outer" string.

If your system has the (non-Standard) strtok_r() function, you
might be able to use that instead of strtok().

So for strtok_r() it's safe to pass the same buffer pointer? Like this:

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
printf("%s: ", token);
/* no need to keep a copy of 'buf' */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL; subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
}

I don't see *any* strtok_r() calls here ...

Ordinary strtok() returns a pointer to the start of a token,
and remembers where it ends so it knows where to start the next
search. This is why it doesn't nest: It can only remember one
restart point in its internal variable.

The non-Standard strtok_r() function behaves similarly, but
uses a caller-provided variable to store the restart point. If
the caller can uses one variable for the "outer" calls and another
for the "inners," the two scanning sequences won't interfere.

As for the copy, it's perfectly all right to do anything you
want to a substring located by strtok() or strtok_r(): Once it's
been located and divided from the surrounding string, they're done
with it and don't need it any more. (Well, "almost anything:" it
would be a bad idea to strcat() "Hello" onto its end, because that
would disrupt the still-unscanned part of the original string. But
as long as you stay within the bounds of the token string itself,
you can do whatever you like there.)
 
G

Gene

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
    char *token, *subtoken;
    char buftmp[20];

    for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
        printf("%s: ", token);
        strcpy(buftmp, token);    /* strtok modifies buffer, so we save a
copy */
        for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
             subtoken = strtok(NULL, "-")) {
            printf("%s ", buf,subtoken);
        }
        putchar('\n');
    }

    return 0;

}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

Where is my mistake?
Thanks!

I have been through this so many times: hacking up a little parser
with strtok() and sscanf()/atoi(), then throwing it away when the
input language gets just a bit more sophisticated. These days I
always go ahead and implement a traditional scanner and simple EBNF
parser. Once you have the framework, it's very quick to adapt it to
new problems, and it's liberating to know this extra power can be
tapped with no code rewriting. Here's what I'm talking about:

#include <stdio.h>
#include <ctype.h>

// Tokens our scanner can discover.
typedef enum token_e {
T_NULL,
T_ERROR,
T_END_OF_INPUT,
T_INT,
T_COMMA,
T_DASH,
} TOKEN;

// Encapsulated state of an input token scanner.
typedef struct scanner_state_s {
char *text; // Input to scan
TOKEN token; // Last token found.
int p0, p1; // Last token string is text[t0..t1).
} SCANNER_STATE;

// Initialize a scanner's state.
void init_scanner_state(SCANNER_STATE *ss, char *text)
{
ss->text = text;
ss->token = T_NULL;
ss->p0 = ss->p1 = 0;
}

// Return current character.
static int current_char(SCANNER_STATE *ss)
{
return ss->text[ss->p1];
}

// Advance the scanner to the next token.
static void advance(SCANNER_STATE *ss)
{
if (current_char(ss) != '\0')
++ss->p1;
}

// Return the current token.
TOKEN current_token(SCANNER_STATE *ss)
{
return ss->token;
}

// Return the integer value of an INT token.
int get_int_value(SCANNER_STATE *ss, int *value) {
if (ss->token == T_INT) {
sscanf(&ss->text[ss->p0], "%d", value);
return 0;
}
return 1;
}

// Mark the beginning of a token.
static void start_token(SCANNER_STATE *ss, TOKEN token)
{
ss->p0 = ss->p1;
ss->token = token;
}

// Action on discovering the end of a token.
static void end_token(SCANNER_STATE *ss)
{
// Do nothing in this scanner.
}

// Scan a token without advancing the input.
static void scan_zero_char_token(SCANNER_STATE *ss, TOKEN token)
{
start_token(ss, token);
end_token(ss);
}

// Scan a single character token from the input.
static void scan_one_char_token(SCANNER_STATE *ss, TOKEN token)
{
start_token(ss, token);
advance(ss);
end_token(ss);
}

// Scan the next token from the input.
void scan(SCANNER_STATE *ss)
{
// Skip whitespace.
while (isspace(current_char(ss))) advance(ss);

// Use a switch() here if speed is necessary.
// The if's let us use ctype.h predicates.
if (isdigit(current_char(ss))) {
start_token(ss, T_INT);
do {
advance(ss);
} while (isdigit(current_char(ss)));
end_token(ss);
}
else if (current_char(ss) == ',')
scan_one_char_token(ss, T_COMMA);
else if (current_char(ss) == '-')
scan_one_char_token(ss, T_DASH);
else if (current_char(ss) == '\0')
scan_zero_char_token(ss, T_END_OF_INPUT);
else
scan_zero_char_token(ss, T_ERROR);
}

// Match a given token and scan past it to the next
// or else raise a syntax error if it's not there.
// It's usually best to longjmp out of the parser on error.
void match(SCANNER_STATE *ss, TOKEN token)
{
if (current_token(ss) == token)
scan(ss);
else {
fprintf(stderr, "syntax error (%d) at end of '%.*s'\n",
ss->token, ss->p1 + 1, ss->text);
ss->token = T_ERROR;
}
}

// Parse the EBNF form: <range> ::= INT [ '-' INT ]
static void range(SCANNER_STATE *ss)
{
int lo, hi;

get_int_value(ss, &lo);
match(ss, T_INT);

if (current_token(ss) == T_DASH) {
scan(ss);
get_int_value(ss, &hi);
match(ss, T_INT);
}
else
hi = lo;

// Action code.
printf(lo == hi ? "%d\n" : "[%d-%d]\n", lo, hi);
}

// Parse the EBNF form:
// <line> ::= [ <range> { ',' <range> } ] END_OF_INPUT
void parse_line(char *text)
{
SCANNER_STATE ss[1];

init_scanner_state(ss, text);
scan(ss); // scan the initial token

if (current_token(ss) == T_INT) {

range(ss);

while (current_token(ss) == T_COMMA) {
scan(ss);
range(ss);
}
}
match(ss, T_END_OF_INPUT);
}

// Simple test.
int main(int argc, char *argv[])
{
if (argc == 2)
parse_line(argv[1]);
return 0;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top