strtok()

Mark · Aug 3, 2010

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
....

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save a
copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

return 0;
}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

Where is my mistake?
Thanks!

Malcolm McLean · Aug 3, 2010

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

Where is my mistake?

Nesting strtoks(). The function uses a static to store the current
pointer position, which you then overwrite witht he nested call.
strtok is basically a bad function. Write your own strsplit() instead,
returning a list of strings in allocated memory.

Ben Bacarisse · Aug 3, 2010

Mark said:
I'm trying to write a simple parser for my application, the purpose is
to allow application understand the command line arguments in the
form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function
doesn't print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL,
",")) {
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save
a copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);

The problem with strtok has been pointed out, but you can continue to
use it because you don't really need it here. You expect only one pair
or maybe a lone number and you can parse that using sscanf:

sscanf(token, "%d-%d", &low, &high)

will return 1 for lone numbers, 2 for a pair like 1-3 and anything else
is an error and needs to be reported.

If you need to check that there are no other characters in the token you
could do something like this:

sscanf(token, "%d%n-%d%n", &low, &len1, &high, &len1)

Now, you need a return of 1 and strlen(token) == len1 or a return of 2
and strlen(token) == len2. Again, anything else is an error.

}
putchar('\n');
}

return 0;
}

<snip>

Eric Sosman · Aug 3, 2010

Hi

I'm trying to write a simple parser for my application, the purpose is
to allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function
doesn't print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save a copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

return 0;
}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

Where is my mistake?

strtok() doesn't "nest:" It can be working on only one source
string at a time. When you call strtok(buftmp,...), it forgets
about the "outer" string.

If your system has the (non-Standard) strtok_r() function, you
might be able to use that instead of strtok().

Nick Keighley · Aug 3, 2010

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save a
copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

return 0;

}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

be nice if you told us what it did instead...
other posters have pointed out the nesting problem.
also not strtok() modifies the string it's parsing so beware

parseLine ("1-3,5,6");

might give a problem (its actually undefined behaviour to modify a
string literal)

Keith Thompson · Aug 3, 2010

Ben Bacarisse said:
The problem with strtok has been pointed out, but you can continue to
use it because you don't really need it here. You expect only one pair
or maybe a lone number and you can parse that using sscanf:

sscanf(token, "%d-%d", &low, &high)

will return 1 for lone numbers, 2 for a pair like 1-3 and anything else
is an error and needs to be reported.

[...]

Keep in mind that sscanf's behavior is undefined if you scan a number
outside the range of the specified type. For example,
if INT_MAX==32767, then this:

sscanf("40000-50000", "%d-%d", &low, &high);

has undefined behavior. Which is a great pity; it makes the *scanf()
functions very difficult to use safely for numeric input.

With a bit of extra work, you can use the strto*() functions instead;
they're sane enough to tell you if the value is out of range (by
returning an extreme value and setting errno to ERANGE).

Mark · Aug 4, 2010

Keith Thompson wrote:
[skip]

With a bit of extra work, you can use the strto*() functions instead;
they're sane enough to tell you if the value is out of range (by
returning an extreme value and setting errno to ERANGE).

My system's strtok man page (Fedore Core 6) doesn't say anything about
returning extreme value or setting errno to ERANGE.

Mark · Aug 4, 2010

Vincenzo said:
I've written a scratch I hope will serve. Beware that maybe I am
missing some error checkings, also you couldn't write white spaces
between the separators "," , "-" and numbers. I didn't add any checks

Thanks, I'll give it a try.

Mark · Aug 4, 2010

Eric Sosman wrote:
[skip]

strtok() doesn't "nest:" It can be working on only one source
string at a time. When you call strtok(buftmp,...), it forgets
about the "outer" string.

If your system has the (non-Standard) strtok_r() function, you
might be able to use that instead of strtok().

So for strtok_r() it's safe to pass the same buffer pointer? Like this:

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
printf("%s: ", token);
/* no need to keep a copy of 'buf' */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL; subtoken =
strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
}

Mark · Aug 4, 2010

One more question; when I compile code featuring strtok_r() with
"gcc -ansi -pedantic -W -Wall" it naturally complains:

warning: implicit declaration of function 'strtok_r'
warning: assignment makes pointer from integer without a cast

First warning is clear, the second refers to strtok_r() call:

char *token;
char *saveptr1 = NULL, *saveptr2 = NULL;
token = strtok_r(buf, ",", &saveptr1);

I wonder, what is the compiler's logic here: if in ANSI mode a function is
not prototyped, then the compiler considers that such functions return
'int', but it actually return 'char *', is that correct?

These warnings are gone, when compiled with "-posix -W -Wall"

Keith Thompson · Aug 4, 2010

Mark said:
One more question; when I compile code featuring strtok_r() with
"gcc -ansi -pedantic -W -Wall" it naturally complains:

warning: implicit declaration of function 'strtok_r'
warning: assignment makes pointer from integer without a cast

First warning is clear, the second refers to strtok_r() call:

char *token;
char *saveptr1 = NULL, *saveptr2 = NULL;
token = strtok_r(buf, ",", &saveptr1);

I wonder, what is the compiler's logic here: if in ANSI mode a function is
not prototyped, then the compiler considers that such functions return
'int', but it actually return 'char *', is that correct?

That's correct. In C90, a reference to an undeclared function
effectively creates an implicit declaration for the function assuming it
returns int and takes a fixed but unspecified number and type of
arguments. So writing
token = strtok_r(buf, ",", &saveptr1);
implicitly declares
int strtok_r();

In C99, a reference to an undeclared function is a constraint violation.
Even in C90, it's poor style to depend on it; functions should be
declared, preferably by #include'ing the appropriate header.

These warnings are gone, when compiled with "-posix -W -Wall"

Probably "-posix" causes the declaration of strtok_r to become visible.

Keith Thompson · Aug 4, 2010

Ian Collins said:
Keith Thompson wrote:
[skip]

With a bit of extra work, you can use the strto*() functions instead;
they're sane enough to tell you if the value is out of range (by
returning an extreme value and setting errno to ERANGE).

Click to expand...

My system's strtok man page (Fedore Core 6) doesn't say anything about
returning extreme value or setting errno to ERANGE.

Click to expand...

I'm sure Keith was referring to strtol() and strtoll()

Yes, along with strtoul(), strtoull(), strtod(), strof(), and
strtold(). I didn't notice that "strtok" matches the same pattern
(because the "to" in "strtok" is part of "tok", an abbreviation of
"token", not the word "to").

Eric Sosman · Aug 4, 2010

Eric Sosman wrote:
[skip]

strtok() doesn't "nest:" It can be working on only one source
string at a time. When you call strtok(buftmp,...), it forgets
about the "outer" string.

If your system has the (non-Standard) strtok_r() function, you
might be able to use that instead of strtok().

Click to expand...

So for strtok_r() it's safe to pass the same buffer pointer? Like this:

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ",")) {
printf("%s: ", token);
/* no need to keep a copy of 'buf' */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL; subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
}

I don't see *any* strtok_r() calls here ...

Ordinary strtok() returns a pointer to the start of a token,
and remembers where it ends so it knows where to start the next
search. This is why it doesn't nest: It can only remember one
restart point in its internal variable.

The non-Standard strtok_r() function behaves similarly, but
uses a caller-provided variable to store the restart point. If
the caller can uses one variable for the "outer" calls and another
for the "inners," the two scanning sequences won't interfere.

As for the copy, it's perfectly all right to do anything you
want to a substring located by strtok() or strtok_r(): Once it's
been located and divided from the surrounding string, they're done
with it and don't need it any more. (Well, "almost anything:" it
would be a bad idea to strcat() "Hello" onto its end, because that
would disrupt the still-unscanned part of the original string. But
as long as you stay within the bounds of the token string itself,
you can do whatever you like there.)

Gene · Aug 4, 2010

Hi

I'm trying to write a simple parser for my application, the purpose is to
allow application understand the command line arguments in the form:

my_app 1-3,5,9
or
my_app 1,4,8-24
...

so it should support both ranges and enumerators. But my function doesn't
print what I expect:

int parseLine(char *buf)
{
char *token, *subtoken;
char buftmp[20];

for (token = strtok(buf, ","); token != NULL; token = strtok(NULL, ","))
{
printf("%s: ", token);
strcpy(buftmp, token); /* strtok modifies buffer, so we save a
copy */
for (subtoken = strtok(buftmp, "-"); subtoken != NULL;
subtoken = strtok(NULL, "-")) {
printf("%s ", buf,subtoken);
}
putchar('\n');
}

return 0;

}

For example, buf="1-3,5,8", and I'd expect to have such output:
1-3: 1 3
5: 5
8: 8

Where is my mistake?
Thanks!

I have been through this so many times: hacking up a little parser
with strtok() and sscanf()/atoi(), then throwing it away when the
input language gets just a bit more sophisticated. These days I
always go ahead and implement a traditional scanner and simple EBNF
parser. Once you have the framework, it's very quick to adapt it to
new problems, and it's liberating to know this extra power can be
tapped with no code rewriting. Here's what I'm talking about:

#include <stdio.h>
#include <ctype.h>

// Tokens our scanner can discover.
typedef enum token_e {
T_NULL,
T_ERROR,
T_END_OF_INPUT,
T_INT,
T_COMMA,
T_DASH,
} TOKEN;

// Encapsulated state of an input token scanner.
typedef struct scanner_state_s {
char *text; // Input to scan
TOKEN token; // Last token found.
int p0, p1; // Last token string is text[t0..t1).
} SCANNER_STATE;

// Initialize a scanner's state.
void init_scanner_state(SCANNER_STATE *ss, char *text)
{
ss->text = text;
ss->token = T_NULL;
ss->p0 = ss->p1 = 0;
}

// Return current character.
static int current_char(SCANNER_STATE *ss)
{
return ss->text[ss->p1];
}

// Advance the scanner to the next token.
static void advance(SCANNER_STATE *ss)
{
if (current_char(ss) != '\0')
++ss->p1;
}

// Return the current token.
TOKEN current_token(SCANNER_STATE *ss)
{
return ss->token;
}

// Return the integer value of an INT token.
int get_int_value(SCANNER_STATE *ss, int *value) {
if (ss->token == T_INT) {
sscanf(&ss->text[ss->p0], "%d", value);
return 0;
}
return 1;
}

// Mark the beginning of a token.
static void start_token(SCANNER_STATE *ss, TOKEN token)
{
ss->p0 = ss->p1;
ss->token = token;
}

// Action on discovering the end of a token.
static void end_token(SCANNER_STATE *ss)
{
// Do nothing in this scanner.
}

// Scan a token without advancing the input.
static void scan_zero_char_token(SCANNER_STATE *ss, TOKEN token)
{
start_token(ss, token);
end_token(ss);
}

// Scan a single character token from the input.
static void scan_one_char_token(SCANNER_STATE *ss, TOKEN token)
{
start_token(ss, token);
advance(ss);
end_token(ss);
}

// Scan the next token from the input.
void scan(SCANNER_STATE *ss)
{
// Skip whitespace.
while (isspace(current_char(ss))) advance(ss);

// Use a switch() here if speed is necessary.
// The if's let us use ctype.h predicates.
if (isdigit(current_char(ss))) {
start_token(ss, T_INT);
do {
advance(ss);
} while (isdigit(current_char(ss)));
end_token(ss);
}
else if (current_char(ss) == ',')
scan_one_char_token(ss, T_COMMA);
else if (current_char(ss) == '-')
scan_one_char_token(ss, T_DASH);
else if (current_char(ss) == '\0')
scan_zero_char_token(ss, T_END_OF_INPUT);
else
scan_zero_char_token(ss, T_ERROR);
}

// Match a given token and scan past it to the next
// or else raise a syntax error if it's not there.
// It's usually best to longjmp out of the parser on error.
void match(SCANNER_STATE *ss, TOKEN token)
{
if (current_token(ss) == token)
scan(ss);
else {
fprintf(stderr, "syntax error (%d) at end of '%.*s'\n",
ss->token, ss->p1 + 1, ss->text);
ss->token = T_ERROR;
}
}

// Parse the EBNF form: <range> ::= INT [ '-' INT ]
static void range(SCANNER_STATE *ss)
{
int lo, hi;

get_int_value(ss, &lo);
match(ss, T_INT);

if (current_token(ss) == T_DASH) {
scan(ss);
get_int_value(ss, &hi);
match(ss, T_INT);
}
else
hi = lo;

// Action code.
printf(lo == hi ? "%d\n" : "[%d-%d]\n", lo, hi);
}

// Parse the EBNF form:
// <line> ::= [ <range> { ',' <range> } ] END_OF_INPUT
void parse_line(char *text)
{
SCANNER_STATE ss[1];

init_scanner_state(ss, text);
scan(ss); // scan the initial token

if (current_token(ss) == T_INT) {

range(ss);

while (current_token(ss) == T_COMMA) {
scan(ss);
range(ss);
}
}
match(ss, T_END_OF_INPUT);
}

// Simple test.
int main(int argc, char *argv[])
{
if (argc == 2)
parse_line(argv[1]);
return 0;
}

Why does strcat mess up the tokens in strtok (and strtok_r)?	92	Jun 11, 2014
Access violation reading location	0	Oct 23, 2022
strtok	7	Jun 28, 2010
Can't solve problems! please Help	0	Sep 26, 2022
Memory corruption on freeing a pointer to pointer	172	Aug 23, 2013
PyObject_CallObject freezing	0	Aug 16, 2022
strtok problem	16	Jun 8, 2010
pointer from integer?	4	Apr 30, 2009

strtok()

Mark

Malcolm McLean

Ben Bacarisse

Eric Sosman

Nick Keighley

Keith Thompson

Mark

Mark

Mark

Mark

Keith Thompson

Keith Thompson

Eric Sosman

Gene

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads