strings to/from enumerations mapping

I

InuY4sha

Hi list,
I'm writing a math parser and I'm looking for the most refined and
convenient way to map operators in form of strings (e.g. "sin",
"exp"), into numbers to use for faster comparing.

I was about to follow this way:

#define SIN_OP "sin"
#define SIN 1
#define TAN_OP "tan"
#define TAN 2
....
///Mapping from OP to string
#define IN_OP2STR(x) x
#define OP2STR(x) IN_OP2STR(x ## _OP)

The problem is that I need the exact opposite: I start from a string
(like "sin") and need to map that into a number (that would be 1 in
the above example).

Another solution could be something like the following

#define SIN 1
#define TAN 2
struct operator
{
char *op;
int value;
};

struct command commands[] =
{
{ "sin", SIN },
{ "tan", TAN },
...
};

This way the mapping is done, but I don't know how to implement a fast
way of comparing strings to determine the operator value... I'd need
some kind of macro. Anyway I'm not sure if macro is the way, as I
actually need to check operator's strings in runtime, therefore
decision is taken on the operator value only after that and
preprocessor cannot help...

Do you have a cleaner vision to share about this?
Thanks in advance
RM
 
I

InuY4sha

Sounds like you're looking for a hash function?
(Or did I misunderstand the problem statement.)

http://en.wikipedia.org/wiki/Hash_function

That's close to my needs. I need direct mapping and I know in advance
my operators and values.
I mean... consider this: an hash function would do several
computations to compare the string with a set of well known strings to
determine a fixed value. Or it could generate a value from the string
with a certain mathematical function. Say that this mapping is
univocal, still you need some time to perform it.

If we could do something like getting the string from the user and
"cast" it to a define that maps into a value that would be O(1) in
terms of computations...

I'm wondering if this is feasible..
 
B

Ben Bacarisse

InuY4sha said:
That's close to my needs. I need direct mapping and I know in advance
my operators and values.
I mean... consider this: an hash function would do several
computations to compare the string with a set of well known strings to
determine a fixed value. Or it could generate a value from the string
with a certain mathematical function. Say that this mapping is
univocal, still you need some time to perform it.

If we could do something like getting the string from the user and
"cast" it to a define that maps into a value that would be O(1) in
terms of computations...

A hash table is usually considered to be O(1) (the precise big-O for
any reasonably complicated algorithm is fiddly and off-topic here
anyway).

I'll take a guess, and say that "perfect hashing" (sometimes called
minimal hashing) is the closest thing to what you are after. It
requires and extra build-stage operation but it is quite easy to use
(I've only ever used gpref but there are others).
 
L

ld

That's close to my needs. I need direct mapping and I know in advance
my operators and values.
I mean... consider this: an hash function would do several
computations to compare the string with a set of well known strings to
determine a fixed value. Or it could generate a value from the string
with a certain mathematical function. Say that this mapping is
univocal, still you need some time to perform it.

If we could do something like getting the string from the user and
"cast" it to a define that maps into a value that would be O(1) in
terms of computations...

I'm wondering if this is feasible..

If the list of operators is fixed at compile time, the best you can
expect is a perfect hash which garantee O(1) for searching but you
still have to hash the user input which is O(n) where n is the length
of the input.

http://en.wikipedia.org/wiki/Perfect_hash_function

But before going into complicated hash functions, you should check if
bsearch + strcmp on a sorted list of operator strings is too slow for
your application. The following code (untested) should do the work:

#include <stdlib.h>
#include <string.h>
#include <assert.h>

enum { COS, SIN, TAN };

struct cmd {
const char *op;
int id;
};

static const struct cmd cmd[] = {
{ "cos", COS }, // ascending order of op
{ "sin", SIN },
{ "tan", TAN }
};

static const size_t num_cmd = sizeof cmd / sizeof *cmd;

static int find_op(const void *o, const void *c)
{
const char *op = o;
const struct cmd *cmd = c;

return strcmp(o, cmd->op);
}

int op2id(const char *op)
{
const struct cmd *found =
bsearch(op, cmd, num_cmd, sizeof *cmd, find_op);

return found ? found->id : -1;
}

const char* id2op(int id)
{
assert(id >= 0 && id < num_cmd);

return cmd[id]->op;
}

Cheers,

ld.
 
B

Bart

Hi list,
I'm writing a math parser and I'm looking for the most refined and
convenient way to map operators in form of strings (e.g. "sin",
"exp"), into numbers to use for faster comparing.

I was about to follow this way:

#define SIN_OP "sin"
#define SIN 1
#define TAN_OP "tan"
#define TAN 2
...
///Mapping from OP to string
#define IN_OP2STR(x) x
#define OP2STR(x) IN_OP2STR(x ## _OP)

The problem is that I need the exact opposite: I start from a string
(like "sin") and need to map that into a number (that would be 1 in
the above example).

Another solution could be something like the following

     #define SIN 1
     #define TAN 2
     struct operator
     {
       char *op;
       int value;
     };

     struct command commands[] =
     {
       { "sin", SIN },
       { "tan", TAN },
       ...
     };

This way the mapping is done, but I don't know how to implement a fast
way of comparing strings to determine the operator value... I'd need
some kind of macro. Anyway I'm not sure if macro is the way, as I
actually need to check operator's strings in runtime, therefore
decision is taken on the operator value only after that and
preprocessor cannot help...

Do you have a cleaner vision to share about this?

Maths operators tend to be short.

You could just interpret the first 4 (or perhaps 8; if depends on your
machine a little) characters of each string as an integer, then use
that instead:

char op[] = "sin ";
int opn;

opn = *(int*)op;

On my machine, opn is now 0x206E6973. You can use this value directly,
or do a search (faster now because it's all integers) for a compact
code.

Note the string must be at least 4 (or 8) characters, here I've filled
with a space. In practice, just concatenate 4 to 8 spaces on any
operator strings that are input. To display the operator, you may need
to trim the spaces, or you can use the compact code (sin=1, cos=2,
etc) to index of table of these names.

The downside: it's more fiddly to detect extraneous characters, if the
input is "sinhabcd" you will only see "sinh", apparenly a valid
operator. So some extra checking is needed.

Of course if speed is not an issue (how many expressions a second will
be processed: 10, or 10 million)?), then a linear search of an array
of strings will work fine, no need to bother with hashes and other
things that will introduce bugs as well as unnecessary speed.
 
B

Barry Schwarz

Hi list,
I'm writing a math parser and I'm looking for the most refined and
convenient way to map operators in form of strings (e.g. "sin",
"exp"), into numbers to use for faster comparing.

I was about to follow this way:

#define SIN_OP "sin"
#define SIN 1
#define TAN_OP "tan"
#define TAN 2
...
///Mapping from OP to string
#define IN_OP2STR(x) x
#define OP2STR(x) IN_OP2STR(x ## _OP)

The problem is that I need the exact opposite: I start from a string
(like "sin") and need to map that into a number (that would be 1 in
the above example).

Another solution could be something like the following

     #define SIN 1
     #define TAN 2
     struct operator
     {
       char *op;
       int value;
     };

     struct command commands[] =
     {
       { "sin", SIN },
       { "tan", TAN },
       ...
     };

This way the mapping is done, but I don't know how to implement a fast
way of comparing strings to determine the operator value... I'd need
some kind of macro. Anyway I'm not sure if macro is the way, as I
actually need to check operator's strings in runtime, therefore
decision is taken on the operator value only after that and
preprocessor cannot help...

Do you have a cleaner vision to share about this?

Maths operators tend to be short.

You could just interpret the first 4 (or perhaps 8; if depends on your
machine a little) characters of each string as an integer, then use
that instead:

char op[] = "sin ";
int opn;

opn = *(int*)op;

And what happens if the array op is not aligned properly for an int?
On my machine, opn is now 0x206E6973. You can use this value directly,

And what happens when he tries to run on a big endian machine or one
that uses EBCDIC or one where sizeof(int) is not 4 or one where
CHAR_BITS is not 8?
 
B

Bart

I'm writing a math parser and I'm looking for the most refined and
convenient way to map operators in form of strings (e.g. "sin",
"exp"), into numbers to use for faster comparing.
#define SIN_OP "sin"
#define SIN 1
#define TAN_OP "tan"
#define TAN 2
...
///Mapping from OP to string
#define IN_OP2STR(x) x
#define OP2STR(x) IN_OP2STR(x ## _OP)
The problem is that I need the exact opposite: I start from a string
(like "sin") and need to map that into a number (that would be 1 in
the above example).
Another solution could be something like the following
     #define SIN 1
     #define TAN 2
     struct operator
     {
       char *op;
       int value;
     };
     struct command commands[] =
     {
       { "sin", SIN },
       { "tan", TAN },
       ...
     };
This way the mapping is done, but I don't know how to implement a fast
way of comparing strings to determine the operator value... I'd need
some kind of macro. Anyway I'm not sure if macro is the way, as I
actually need to check operator's strings in runtime, therefore
decision is taken on the operator value only after that and
preprocessor cannot help...
Do you have a cleaner vision to share about this?

Maths operators tend to be short.

You could just interpret the first 4 (or perhaps 8; if depends on your
machine a little) characters of each string as an integer, then use
that instead:

char op[] = "sin ";
int opn;

opn = *(int*)op;

On my machine, opn is now 0x206E6973. You can use this value directly,
or do a search (faster now because it's all integers) for a compact
code.

I'd forgotten C also has multi-byte char constants, so you can say:

int n = 'sin';

and have an array or list of these int/char codes to search through.
Then a simple conversion from a input operator string can be done,
without worrying about padding the string (these constants will be
padded with 0).

(It also addresses the alignment issue someone else mentioned,
although I would only worry about that at this stage, if it's an
actual problem. Complete arrays of char (but not substrings of them)
are quite likely to be int-aligned anyway. And it's simple enough to
force an alignment, with some extra clutter in the code:

typedef union opc{long long int opn; char ops[8];} operator;
operator op;

op.opn=0;
strncpy(op.ops,"power",sizeof op.ops);

printf("OPN= %I64x \n",op.opn); /* can't remember proper long long
code */
printf("OPS= %s \n",op.ops);
)
 
J

jameskuyper

Bart wrote:
....
I'd forgotten C also has multi-byte char constants, so you can say:

int n = 'sin';

and have an array or list of these int/char codes to search through.
Then a simple conversion from a input operator string can be done,
without worrying about padding the string (these constants will be
padded with 0).

While the C standard does allow such constants, the only thing it says
about the value of such constants is that it's implementation defined.
In particular, the standard says nothing to prohibit 'sin' == 'cos'.
More plausibly, such character constants might have a value that
depends upon only the first sizeof(int) characters, with the remaining
characters being ignored. That could be problematic for an application
such as this, if any two of the strings are indentical in the first
sizeof(int) characters.
 
B

Bart

Bart wrote:
While the C standard does allow such constants, the only thing it says
about the value of such constants is that it's implementation defined.
In particular, the standard says nothing to prohibit 'sin' == 'cos'.
More plausibly, such character constants might have a value that
depends upon only the first sizeof(int) characters, with the remaining
characters being ignored.

Might have? I think it's a certainty than only the first int's worth
of characters will be significant. Otherwise we'd all be using ints
instead of strings.

For this application, it's reasonable to limit these constants to 8
characters, perhaps even 4, with longer strings being an error, unless
the OP wants the extra work of recognising 'arctangent' as a valid
operator, or using multiple int representations.
 
L

ld

char op[] = "sin ";
int opn;
opn = *(int*)op;

Yes it's surely a smart way

It's not so sure. You may experiment memory alignment problem (e.g.
SIGBUS), and nothing says that a full and safe implementation using
this trick will be faster than my (not so) naive proposal. If it's a
valid trick for your implementation, there is a good chance that
decent compilers use it. Most of the time, bsearch and strcmp will be
optimized and inlined (e.g. builtin). Measure first, hack only if it's
too slow. Some time ago, I was using a compile-time hash function (cpp
macro) to finally discover (with gcc 4.x) that inlined functions was
doing the job equally well. And since the macro revealed a bug in gcc
3.x leading to exponential compile time, I stopped to use it.

cheers,

ld.
 
B

Bart

char op[] = "sin ";
int opn;
opn = *(int*)op;
Yes it's surely a smart way

It's not so sure. You may experiment memory alignment problem (e.g.
SIGBUS), and nothing says that a full and safe implementation using
this trick will be faster than my (not so) naive proposal. If it's a
valid trick for your implementation, there is a good chance that
decent compilers use it.

You guys have more faith in the optimising cabilities of compilers
than I have.

A search for the sixth element of a ten-op list was 5-10 x faster
using int encodings compared to a string, both using linear search
(and limited to 4-char int constants as any more generated an error --
WHY?)

Maybe one of those clever searches they teach in CS courses would be
better, but we are only talking about a dozen or two elements to
search for, and the method can anyway also be applied to the int-const
version! Int processing is generally faster than string processing.
Most of the time, bsearch and strcmp will be
optimized and inlined (e.g. builtin). Measure first, hack only if it's
too slow.

Sure.
 
J

jameskuyper

Bart said:
Might have? I think it's a certainty than only the first int's worth
of characters will be significant. Otherwise we'd all be using ints
instead of strings.

No, it might depend only upon the last sizeof(int) characters, or be
determined in some other manner entirely.
For this application, it's reasonable to limit these constants to 8
characters, ...

Which wouldn't be a sufficiently low limit on most implementations I'm
familiar with.
 
A

Alan Curry

I'll take a guess, and say that "perfect hashing" (sometimes called
minimal hashing) is the closest thing to what you are after. It
requires and extra build-stage operation but it is quite easy to use
(I've only ever used gpref but there are others).

That's gperf, for anyone struggling to find "gpref"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top