Best way to search through STL maps?

S

Steve

I have the following STL containers:

(using namespace std;)

typedef vector<string> TokenVector; // e.g [the] [cat] [sat] [on]
[the] [mat]

typedef map<TokenVector, long > NGramMap; // a list of unique n-grams
(as above), and a freq count.

My goal is to search for n-grams that match a pattern with wildcards,
for instance:

[the] [cat] [sat] [?] [the] [mat]

and return a list of all matches with all the possibilities for the
wildcard [?], along with the freq count

Assuming the map is sorted alphabetically, is there some way to search
them - like a regular expression - that return pattern matches?

The only way I can think of at the moment is to search for "[the]
[cat] [sat]", then sift through those matches.
This gets messy, though, when the pattern starts with a wildcard, or
contains 2 or 3 wildcards.

This seems like it might be a common task, so I thought I'd ask here
before re-inventing the wheel.
Does STL have ready-made solutions for this kind of search?

Does it make the problem any easier if it's constrained to _always_
being 6-grams, and that there are never more than 3 wildcards?

Thanks for any clues, even if it's just what topics I need to study in
order to crack it.

Steve
 
E

er

Does the question boil down to finding whether

[the] [cat] [sat] [on] [the] [mat]

matches

[the] [cat] [sat] [?] [the] [mat]


where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?
 
S

Steve

Does the question boil down to finding whether

[the] [cat] [sat] [on] [the] [mat]

matches

[the] [cat] [sat] [?] [the] [mat]

where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?


Yes, succinctly put.
 
E

er

Does the question boil down to finding whether
[the] [cat] [sat] [on] [the] [mat]

[the] [cat] [sat] [?] [the] [mat]
where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?

Yes, succinctly put.

I have a feeling (yes, nothing more reliable than that as I don't deal
text processing) that working with strings might be easier than
breaking them into tokens that are held by a STL container. And do so
using C++ I would look into Boost.Regex. But probably you already
thought of this and have a good reason to use tokens.
 
S

Steve

Does the question boil down to finding whether
[the] [cat] [sat] [on] [the] [mat]
matches
[the] [cat] [sat] [?] [the] [mat]
where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?
Yes, succinctly put.

I have a feeling (yes, nothing more reliable than that as I don't deal
text processing) that working with strings might be easier than
breaking them into tokens that are held by a STL container. And do so
using C++ I would look into Boost.Regex. But probably you already
thought of this and have a good reason to use tokens.

Yes, but having still found no shortcuts, I'm beginning to think that
keeping composite string copies for searching is the way to go.
(I really should investigate Boost, but a quick search reveals people
having nightmares building it for OSX due to the transition to 64-bit,
so I'll have to hold off for a while.)

Thanks

Steve
 
E

er

(I really should investigate Boost, but a quick search  reveals people
having nightmares building it for OSX due to the transition to 64-bit,
so I'll have to hold off for a while.)

FYI I was able to build a couple of the boost libraries (the bulk of
boost does not need installing) on Snow Leopard (64-bit). Some of the
searches you found probably pertain to advanced features that you
probably don't need.
 
J

Jeff Flinn

Steve said:
Does the question boil down to finding whether
[the] [cat] [sat] [on] [the] [mat]
matches
[the] [cat] [sat] [?] [the] [mat]
where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?
Yes, succinctly put.
I have a feeling (yes, nothing more reliable than that as I don't deal
text processing) that working with strings might be easier than
breaking them into tokens that are held by a STL container. And do so
using C++ I would look into Boost.Regex. But probably you already
thought of this and have a good reason to use tokens.

Yes, but having still found no shortcuts, I'm beginning to think that
keeping composite string copies for searching is the way to go.
(I really should investigate Boost, but a quick search reveals people
having nightmares building it for OSX due to the transition to 64-bit,
so I'll have to hold off for a while.)

Did you mean this:
> it was reported that in Snow Leopard, Apple has removed support for
> ppc64, and folks trying to use that receive fairly obscure errors.
> Some of Boost users were already confused by this. Now, this is
> of course more like Apple's problem, but here's a patch, against
> current trunk, to do a nice diagnostic of this in Boost.Build.
> The intention is that if you try to build just ppc64, or a two-way
> fat binary including ppc64 you get an error message, and if you
> try tobuild 4-way fat binary, ppc64 is skipped and you get 3-way
> binary.

Jeff
 
S

Steve

Steve said:
Does the question boil down to finding whether
[the] [cat] [sat] [on] [the] [mat]
matches
[the] [cat] [sat] [?] [the] [mat]
where each of the above n-grams are represented by a collection of
strings and symbol ? is not a question mark but a wildcard?
Yes, succinctly put.
I have a feeling (yes, nothing more reliable than that as I don't deal
text processing) that working with strings might be easier than
breaking them into tokens that are held by a STL container. And do so
using C++ I would look into Boost.Regex. But probably you already
thought of this and have a good reason to use tokens.
Yes, but having still found no shortcuts, I'm beginning to think that
keeping composite string copies for searching is the way to go.
(I really should investigate Boost, but a quick search  reveals people
having nightmares building it for OSX due to the transition to 64-bit,
so I'll have to hold off for a while.)

Did you mean this:

 > it was reported that in Snow Leopard, Apple has removed support for
 > ppc64, and folks trying to use that receive fairly obscure errors.
 > Some of Boost users were already confused by this. Now, this is
 > of course more like Apple's problem, but here's a patch, against
 > current trunk, to do a nice diagnostic of this in Boost.Build.
 > The intention is that if you try to build just ppc64, or a two-way
 > fat binary including ppc64 you get an error message, and if you
 > try tobuild 4-way fat binary, ppc64 is skipped and you get 3-way
 > binary.

Jeff

Yes, not that exact message, but the gist of it. Well, it looks like
it's worth persevering then.
 
S

Steve

Does ppc64 concern you? Not me, so I did not need a patch.

No it doesn't, but I thought specifying *just* that omission was
tricky for a non-unix man like myself. I'm just being chicken ;) I'm
sure we've all suffered the scenario when, in the middle of an idea,
you get sidetracked installing something, and then 24 hours disappear,
along with clumps of hair.
I think I read somewhere that aspects of boost will be appearing C+
+0x, so it's a no-brainer that I need to start making good use of it.I
may be gone some time ;)

PS Great to be talking to someone here using OSX, I almost thought it
wasn't worth mentioning.
 
E

er

No it doesn't, but I thought specifying *just* that omission was
tricky for a non-unix man like myself.  I'm just being chicken ;) I'm
sure we've all suffered the scenario when, in the middle of an idea,
you get sidetracked installing something, and then 24 hours disappear,
along with clumps of hair.

You might be able to spare that by reading/posting to the boost.build
mailing list. As far as building a library (again most don't require
building), my own shameful tribulations are reported in posts 10/20
and 11/2. In each case, it turns out I was overlooking simplicity.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top