remove certain words from a c++ string

P

prasanna.hariharan

Hi guys,

I want to remove certain words from a c++ string. The list of words are
in a file with each word in a new line. I tried using the
std::transform, but it dint work.

Anybody got a clue as to how i should go about this.

thanks a lot,
Hp
 
A

Andrej Hristoliubov

Hi guys,

I want to remove certain words from a c++ string. The list of words are
in a file with each word in a new line. I tried using the
std::transform, but it dint work.

Anybody got a clue as to how i should go about this.

thanks a lot,
Hp


Try using string::find and string remove (I added swap for
optimization, you don't have to):

example:

string str="Hello world is the only assignment I can do",
remword=world;
size_t=pos;

if((pos=str.find(remword))!=string::npos)
{
str.swap(str.erase(pos,remword.length()));

}





ps. I rule!
 
D

Dave Rahardja

Hi guys,

I want to remove certain words from a c++ string. The list of words are
in a file with each word in a new line. I tried using the
std::transform, but it dint work.

Anybody got a clue as to how i should go about this.

transform() doesn't remove entries in a container, it only modifies them.

Use string::find() to find the substring, then use string::erase() to remove
the substring from the string.

-dr
 
H

Hp

Hi, Thanks a lot for your replies. But i fugured out that before the
word removal from the string, i need to convert the c++ string from
upper to lower case. I infact used the transform() to perform this
operation, which dint work.
And also, after the uppertolower case conversion, i need to read the
file containing all the stopwords, one in each line, to be removed from
the transformed string.

Thanks a lot in advance,
Hp
 
J

Jonathan Mcdougall

Hp said:
Hi, Thanks a lot for your replies. But i fugured out that before the
word removal from the string, i need to convert the c++ string from
upper to lower case. I infact used the transform() to perform this
operation, which dint work.

How did you do it? How didn't it work?
And also, after the uppertolower case conversion, i need to read the
file containing all the stopwords, one in each line, to be removed from
the transformed string.

Show us some code! Reading a line from a file is a basic operation
described in any textbook (good or bad).

1) get the string
2) convert it to lower case
3) read the lines from the file
4) search the string for the words you just read and remove each one


Jonathan
 
P

puzzlecracker

Hp said:
Hi, Thanks a lot for your replies. But i fugured out that before the
word removal from the string, i need to convert the c++ string from
upper to lower case. I infact used the transform() to perform this
operation, which dint work.
And also, after the uppertolower case conversion, i need to read the
file containing all the stopwords, one in each line, to be removed from
the transformed string.

Thanks a lot in advance,
Hp

stopwords?? (like the, an, a)-- sounds like a problem from data mining
what course are you taking?

I think I've dealt with it (long ago in my academic career!!!!)
 
H

Hp

Hey Puzzlecracker, its exactly a problem from Datamining...yes, the
stopwords are in a file, with each stop word in a line.

Hi jonathan, thanks for your replies. I used the following code to
convert the string from upper to lower case:
std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);

file: is the string from which stopwords need to be removed
Thanks a lot,
Hp
 
P

puzzlecracker

Hp said:
Hey Puzzlecracker, its exactly a problem from Datamining...yes, the
stopwords are in a file, with each stop word in a line.

Hi jonathan, thanks for your replies. I used the following code to
convert the string from upper to lower case:
std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);

file: is the string from which stopwords need to be removed
Thanks a lot,
Hp

explain "(int(*)(int))std::tolow­er)"; of transform? not quite sure
what that casting is all about.

thanks

ps I assume you didn't just blindly copied the code.
 
H

Hp

Hi, i figured out on how to do the case conversion, it was a casting
error which i took care of, thanks for the hint Puzzlecracker.
I tried using andreus piece of code to remove the stop words, but could
not get thru. Any hint on stopword removal would be greatly
appreciated, as i m a novice to c++.
Thanks, Hp
 
P

puzzlecracker

Hp said:
Hi, i figured out on how to do the case conversion, it was a casting
error which i took care of, thanks for the hint Puzzlecracker.
I tried using andreu piece of code to remove the stop words, but could
not get thru. Any hint on stopword removal would be greatly
appreciated, as i m a novice to c++.
Thanks, Hp


easy:

1. populate all stop words into a set
2. read all words from the file into a vector and as you read, check
wether that word is a stop word (use lexegraphics_compare to avoid case
issue. If it is, discard it, otherwise put into a vector.

I will start:?
#include<iostream>
#include<set>
#include<vector>

using namespace std;

void initialize(const set<string>);


int main(int argc, char *argv[])
{

set<string> stopWset;
vector<string> wordvec;
ifstream in("input.txt");

if(!in)
//report error

initialize(stopWset); //

string word;
while(in>>word)
if(stopWset.find(word)!=stopWset.end())
wordvec.push_back(word);


return 0;


}



you get the idea. Or you suggest reading the entire file at once?
 
H

Hp

Hi puzzlecracker, I got the idea, wherein we are putting all the
non-stopwords into a vector of strings.
Here, if i am not wrong, input.txt is the file that has the list of
stopwords. Which one is the string that has the contents with the
stopwords and non-stopwords?
And what does initialize do?
Thanks
 
K

Karl Heinz Buchegger

Hp said:
Hi puzzlecracker, I got the idea, wherein we are putting all the
non-stopwords into a vector of strings.

No.
In puzzlecrackers code

stopWset stands for the 'set of stop words'
wordvec is the vector of words you read from your input and which are
(after the loop has finished) not stop words
Here, if i am not wrong, input.txt is the file that has the list of
stopwords.

That's why it is called 'input' :)
input is the file you want to check against the stop words
Which one is the string that has the contents with the
stopwords and non-stopwords?
And what does initialize do?

What do you think.
There are 2 file operations going on in the whole program
* one deals with your input
* the second one deals with the file of stop words

so if the loop handles your input file, what do you think
will be the job of initialize( stopWset). Especially when one
takes into account that it gets passed 'stopWset'.
 
H

Hp

Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp
 
G

Greg

Hp said:
Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp

I know this may sound sacriliegious in a C++ newsgroup and all, but
does the text processing program have to be written in C++?

There are several dedicated text processing tools such as awk or sed,
or scripting languages (like Perl) that are specifically designed for
text stream editing. While certainly none of these alternatives is
particularly accessible, none has a steep learning curve either.

The power of regular expressions for manipulating text is difficult to
match in a C++ program without such support, at least in my experience.
And since I am not (too much of) a language snob, I recommend choosing
the best language for the job, even if it's not the best language. For
example, lowercasing a file's content with sed is a simple command

sed -e 's/[A-Z]/[a-z]/g' inputfile

Writing a C++ program to do the same would more involved. The good news
is that tr1's regex brings regular expression support to C++. So if a
C++ solution is required, I would look at regex to see whether it can
help solve your problem.

And if you do write the program in a language other than C++, some here
will be able to forgive you. But just don't tell your friends what you
have done.

Greg
 
H

Hp

Yeah Greg, i do need to have it coded in C++.
Thanks for your reply though. I still havent found a solution to that..
Hp said:
Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp

I know this may sound sacriliegious in a C++ newsgroup and all, but
does the text processing program have to be written in C++?

There are several dedicated text processing tools such as awk or sed,
or scripting languages (like Perl) that are specifically designed for
text stream editing. While certainly none of these alternatives is
particularly accessible, none has a steep learning curve either.

The power of regular expressions for manipulating text is difficult to
match in a C++ program without such support, at least in my experience.
And since I am not (too much of) a language snob, I recommend choosing
the best language for the job, even if it's not the best language. For
example, lowercasing a file's content with sed is a simple command

sed -e 's/[A-Z]/[a-z]/g' inputfile

Writing a C++ program to do the same would more involved. The good news
is that tr1's regex brings regular expression support to C++. So if a
C++ solution is required, I would look at regex to see whether it can
help solve your problem.

And if you do write the program in a language other than C++, some here
will be able to forgive you. But just don't tell your friends what you
have done.

Greg
 
D

Dave Rahardja

I know this may sound sacriliegious in a C++ newsgroup and all, but
does the text processing program have to be written in C++?

There are several dedicated text processing tools such as awk or sed,
or scripting languages (like Perl) that are specifically designed for
text stream editing. While certainly none of these alternatives is
particularly accessible, none has a steep learning curve either.

My thoughts exactly. I use Python for my scripting needs. But this is a C++
forum and I think answers using C++ tools are appropriate.

Maybe the OP would like to take a look at boost's regular expressions library?

-dr
 
?

=?ISO-8859-15?Q?Juli=E1n?= Albo

Greg said:
I know this may sound sacriliegious in a C++ newsgroup and all, but
does the text processing program have to be written in C++?

There are several dedicated text processing tools such as awk or sed,
or scripting languages (like Perl) that are specifically designed for
text stream editing. While certainly none of these alternatives is
particularly accessible, none has a steep learning curve either.

I disagree to some point with that common point of view. Certainly the use
of the language best adapted to the work sound reasonable. But in many
cases one person or organization only has a relatively good knowledge of
one language and a superficial and possibly outdated of others. And the use
of the "main" language even for relatively small things has the advantage
that the code, or parts of it, can be reused in other projects.

Other factor is the coherency of the project. Several projects have a Perl
or Python part that generates C or C++ code. That means that the people
able to collaborate in the project as a whole must know the two languages
you choose.

And finally, the C++ standard library is powerful enough to do without much
effort many things. For example, std::string makes affordable many things
that were unreasonable to write with C-style strings. "Accelerated C++" can
be seen as a sample of how to use C++ to do "scripting-style" tasks.

Certainly there are, for example, a lot of Perl modules for many tasks that
are not easily available or not so versatile in other languages.
 
H

Hp

Hi Guys,
I need to use C++, and no other scripting tool.
If anybody could give a solution to the problem, it would be higly
appreciated.
Thanks,
Hp
 
M

Mike Wahler

Hp said:
Hi Guys,
I need to use C++, and no other scripting tool.
If anybody could give a solution to the problem, it would be higly
appreciated.

It's almost a certainty that nobody here is going to
simply provide a solution (that's not what this group
is for). You've received many ideas and hints, why
not give it a try, then when you get stuck, you can
post your (relevant) code and ask specific questions,
whereupon you'll receive more specific assistance.

If you really do want a completed solution, you need
to find a 'help wanted' group to post your solicitation.

-Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top