replace only full words

cerr · Sep 28, 2013

Hi,

I have a list of sentences and a list of words. Every full word that appears within sentence shall be extended by <WORD> i.e. "I drink in the house." Would become "I <drink> in the <house>." (and not "I <d<rink> in the <house>.")I have attempted it like this:
for sentence in sentences:
for noun in nouns:
if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
sentence = sentence.replace(noun, '<' + noun + '>')

print(sentence)

but what if The word is in the beginning of a sentence and I also don't like the approach using defined word terminations. Also, is there a way to make it faster?

Thanks

Tim Chase · Sep 28, 2013

I have a list of sentences and a list of words. Every full word
that appears within sentence shall be extended by <WORD> i.e. "I
drink in the house." Would become "I <drink> in the <house>." (and
not "I <d<rink> in the <house>.")

This is a good place to reach for regular expressions. It comes with
a "ensure there is a word-boundary here" token, so you can do
something like the code at the (way) bottom of this email. I've
pushed it off the bottom in the event you want to try and use regexps
on your own first. Or if this is homework, at least make you work a
*little*

Also, is there a way to make it faster?

The code below should do the processing in roughly O(n) time as it
only makes one pass through the data and does O(1) lookups into your
set of nouns. I included code in the regexp to roughly find
contractions and hyphenated words. Your original code grows slower
as your list of nouns grows bigger and also suffers from
multiple-replacement issues (if you have the noun-list of ["drink",
"rink"], you'll get results that you don't likely want.

My code hasn't considered case differences, but you should be able to
normalize both the list of nouns and the word you're testing in the
"modify()" function so that it would find "Drink" as well as "drink"

Also, note that some words serve both as nouns and other parts of
speech, e.g. "It's kind of you to house me for the weekend and drink
tea with me."

-tkc

import re

r = re.compile(r"""
\b # assert a word boundary
\w+ # 1+ word characters
(?: # a group
[-'] # a dash or apostrophe
\w+ # followed by 1+ word characters
)? # make the group optional (0 or 1 instances)
\b # assert a word boundary here
""", re.VERBOSE)

nouns = set([
"drink",
"house",
])

def modify(matchobj):
word = matchobj.group(0)
if word in nouns:
return "<%s>" % word
else:
return word

print r.sub(modify, "I drink in the house")

MRAB · Sep 28, 2013

Hi,

I have a list of sentences and a list of words. Every full word that appears within sentence shall be extended by <WORD> i.e. "I drink in the house." Would become "I <drink> in the <house>." (and not "I <d<rink> in the <house>.")I have attempted it like this:
for sentence in sentences:
for noun in nouns:
if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
sentence = sentence.replace(noun, '<' + noun + '>')

print(sentence)

but what if The word is in the beginning of a sentence and I also don't like the approach using defined word terminations. Also, is there a way to make it faster?

It sounds like a regex problem to me:

import re

nouns = ["drink", "house"]

pattern = re.compile(r"\b(" + "|".join(nouns) + r")\b")

for sentence in sentences:
sentence = pattern.sub(r"<\g<0>>", sentence)
print(sentence)

Jussi Piitulainen · Sep 28, 2013

MRAB said:
Hi,

I have a list of sentences and a list of words. Every full word
that appears within sentence shall be extended by <WORD> i.e. "I
drink in the house." Would become "I <drink> in the <house>." (and
not "I <d<rink> in the <house>.")I have attempted it like this:

Click to expand...

for sentence in sentences:
for noun in nouns:
if " "+noun+" " in sentence or " "+noun+"?" in sentence or " "+noun+"!" in sentence or " "+noun+"." in sentence:
sentence = sentence.replace(noun, '<' + noun + '>')

print(sentence)

but what if The word is in the beginning of a sentence and I also
don't like the approach using defined word terminations. Also, is
there a way to make it faster?

Click to expand...

It sounds like a regex problem to me:

import re

nouns = ["drink", "house"]

pattern = re.compile(r"\b(" + "|".join(nouns) + r")\b")

for sentence in sentences:
sentence = pattern.sub(r"<\g<0>>", sentence)
print(sentence)

Maybe tokenize by a regex and then join the replacements of all
tokens:

import re

def substitute(token):
if isfullword(token.lower()):
return '<{}>'.format(token)
else:
return token

def tokenize(sentence):
return re.split(r'(\W)', sentence)

sentence = 'This is, like, a test.'

tokens = map(substitute, tokenize(sentence))
sentence = ''.join(tokens)

For better results, both tokenization and substitution need to depend
on context. Doing some of that should be an interesting exercise.

cerr · Sep 28, 2013

I have a list of sentences and a list of words. Every full word

Click to expand...

that appears within sentence shall be extended by <WORD> i.e. "I

Click to expand...

drink in the house." Would become "I <drink> in the <house>." (and

Click to expand...

not "I <d<rink> in the <house>.")

Click to expand...

This is a good place to reach for regular expressions. It comes with

a "ensure there is a word-boundary here" token, so you can do

something like the code at the (way) bottom of this email. I've

pushed it off the bottom in the event you want to try and use regexps

on your own first. Or if this is homework, at least make you work a

*little*

Also, is there a way to make it faster?

Click to expand...

The code below should do the processing in roughly O(n) time as it

only makes one pass through the data and does O(1) lookups into your

set of nouns. I included code in the regexp to roughly find

contractions and hyphenated words. Your original code grows slower

as your list of nouns grows bigger and also suffers from

multiple-replacement issues (if you have the noun-list of ["drink",

"rink"], you'll get results that you don't likely want.

My code hasn't considered case differences, but you should be able to

normalize both the list of nouns and the word you're testing in the

"modify()" function so that it would find "Drink" as well as "drink"

Also, note that some words serve both as nouns and other parts of

speech, e.g. "It's kind of you to house me for the weekend and drink

tea with me."

-tkc

import re

r = re.compile(r"""

\b # assert a word boundary

\w+ # 1+ word characters

(?: # a group

[-'] # a dash or apostrophe

\w+ # followed by 1+ word characters

)? # make the group optional (0 or 1 instances)

\b # assert a word boundary here

""", re.VERBOSE)

nouns = set([

"drink",

"house",

])

def modify(matchobj):

word = matchobj.group(0)

if word in nouns:

return "<%s>" % word

else:

return word

print r.sub(modify, "I drink in the house")

Great, only I don't have the re module on my system....

MRAB · Sep 28, 2013

On 28/09/2013 18:43, cerr wrote:
[snip]

Great, only I don't have the re module on my system....

Really? It's part of Python's standard distribution.

Tim Chase · Sep 28, 2013

[mercy, you could have trimmed down that reply]

Great, only I don't have the re module on my system....

Um, it's a standard Python library. You sure about that?

http://docs.python.org/2/library/re.html

-tkc

cerr · Sep 28, 2013

On 28/09/2013 18:43, cerr wrote:

[snip]

Great, only I don't have the re module on my system....

Click to expand...

Really? It's part of Python's standard distribution.

Oh no, sorry, mis-nformation, i DO have module re available!!! All good!

cerr · Sep 28, 2013

[mercy, you could have trimmed down that reply]

Great, only I don't have the re module on my system....

Click to expand...

Um, it's a standard Python library. You sure about that?

http://docs.python.org/2/library/re.html

Oh no, sorry, mis-nformation, i DO have module re available!!! All good!

read file into list of lists	6	Jul 11, 2008
Replace the first instance only of a string	14	Jun 19, 2009
find and replace with regular expressions	6	Jul 31, 2008
User prompt as file to read	1	Mar 22, 2014
Sort and count word pairs in a string	6	Jan 29, 2023
Replace stop words (remove words from a string)	6	Jan 17, 2008
Data saving in condition of changing reality	0	Apr 29, 2022
[HELP] Add-on - Twitch chat input	0	Sep 1, 2024

replace only full words

cerr

Tim Chase

MRAB

Jussi Piitulainen

cerr

MRAB

Tim Chase

cerr

cerr

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads