How can I exclude a word by using re?

could ildg · Aug 14, 2005

In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

Christoph Rackwitz · Aug 14, 2005

re.findall('(.*)hello|(.*)', 'hi, how are you. hello')
re.findall('(.*)hello|(.*)', 'hi, how are you. ello')
take a look at the outputs of these.

Jeff Schwab · Aug 14, 2005

could said:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

import re

def demonstrate(regex, text):
pattern = re.compile(regex)
match = pattern.search(text)

print " ", text
if match:
print " Matched '%s'" % match.group(0)
print " Captured '%s'" % match.group(1)
else:
print " Did not match"

# Option 1: Match it all, but capture only the part before "hello." The
(.*?)
# matches as few characters as possible, so that this pattern would end
before
# the first hello in "hello hello".

pattern = r"(.*?)hello"
print "Option 1:", pattern
demonstrate( pattern, "hi, how are you. hello" )

# Option 2: Don't even match the "hello," but make sure it's there.
# The first of these calls will match, but the second will not. The
# (?=...) construct is using a feature called "forward look-ahead."

pattern = r"(.*)(?=hello)"
print "\nOption 2:", pattern
demonstrate( pattern, "hi, how are you. hello" )
demonstrate( pattern, "hi, how are you. ", )

could ildg · Aug 14, 2005

Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

Bruno Desthuilliers · Aug 14, 2005

could ildg a écrit :

Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello".

Read The Fine Manual ?-)

For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

re.findall(r'^(.*)hello', your_string_full_of_hellos)

Peter Otten · Aug 15, 2005

could said:
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

The simplest solution is to use str.split():

helo = "hi, how are you? HELLO I'm fine, thank you hello. that's it"
helo.split("hello", 1)[0]

Click to expand...

Click to expand...

"hi, how are you? HELLO I'm fine, thank you "

But regular expressions offer a similar feature:

re.compile("hello", re.IGNORECASE).split(helo, 1)[0]

Click to expand...

Click to expand...

'hi, how are you? '

Peter

John Machin · Aug 15, 2005

Bruno said:
could ildg a écrit :

Read The Fine Manual ?-)

re.findall(r'^(.*)hello', your_string_full_of_hellos)

Nice try, but it needs a little refinement to do what the OP asked for:

>>> import re
>>> h = "hi g'day hello hello hello"
>>> re.findall(r'^(.*)hello', h) ["hi g'day hello hello "]
>>> re.findall(r'^(.*?)hello', h) ["hi g'day "]
>>> re.findall(r'^(.*?)hello', h)[0]

Click to expand...

Click to expand...

"hi g'day "

John Machin · Aug 15, 2005

could said:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John

could ildg · Aug 16, 2005

I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.

could said:
could said:

In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

Click to expand...

(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John

Jordan Rastrick · Aug 16, 2005

could ildg said:

I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

Actually, for properly processing html, you shouldn't really be using
regular expressions, precisely because the problem is complicated -
regular expressions are too simple and can't properly model a language
like HTML, which is generated by a context free grammar.

If thats only meaningless technical mumbo-jumbo to you, never mind -
the important point is you shouldn't really use an re. Trust me

What you want for a job like is an HTML parser. Theres one in the
standard library; if it doesnt suit, there are plenty of third party
ones. I like Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you insist on using an re, well I'm sure someone on this group will
figure out a solution to your issue thats as good as you're going to
get...

could said:
My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.

could said:

In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

Click to expand...

(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John

Click to expand...

Paul McGuire · Aug 16, 2005

Given the example re that you've been trying to get working, here is a
pyparsing approach that might be more, um, approachable.
Unfortunately, since I don't have the URL of the page you are working
with, I'm unable to test this before posting.

Good luck,
-- Paul

# getMP3s.py
# get pyparsing at http://pyparsing.sourceforge.net
#

from pyparsing import *
import urllib

#~
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'

#~ ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
#~ ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)

tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

number = Word(nums)
valign = CaselessLiteral("valign=top>")

mp3Entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

# get list of mp3's
targetURL = "http://whatever"
targetPage = urllib.urlopen( targetURL )
targetHTML = targetPage.read()
targetPage.close()

for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href

could ildg · Aug 16, 2005

Thank you,
you code using pyparsing works very well. Now I got the "number" and
the "url". But I still want to get the "name".
I'll turn to pyparsing and see how to get the "name" from the html.
But I hope you can enlighten me for one more time since I'm not
farmiliar with the pyparsing module.

Paul McGuire · Aug 16, 2005

Just as with re you were using "?P<xxx>" to assign the matching text to
the variable "xxx", pyparsing allows you to associate a name with an
element of your grammar using setResultsName.

Here is your original re:
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)

Here is the pyparsing expression:
valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

Here are the re and pyparsing pieces side by side:
re => pyparsing
-----------------------
valign=top> => valign = CaselessLiteral("valign=top>")
(?P<number>\d{1,2}) => number = Word(nums),
number.setResultsName("number")
</td> => tdEnd
<td[^>]*> => tdStart
\s{0,2} => I don't know what this re does, so I just used
SkipTo(aStart)
<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank> => aStart (which
returns a value whose named attributes correspond to the HTML
attributes, such as href)
(?P<name>.+) => SkipTo(tdEnd) *** here is where we'll make our
change ***
</td> => tdEnd

To capture the body of the second <td></td> tag pair, we'll add
setResultsName("name") to the pyparsing expression:
mp3entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd)setResultsName("name") + tdEnd

Now you should be able to extract the data using:
for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href, toks.name

Good luck!
-- Paul

Paul McGuire · Aug 16, 2005

Oof! That should be:

mp3entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd).setResultsName("name") + tdEnd

Dennis Lee Bieber · Aug 16, 2005

I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

Yeesh... Wouldn't it be faster to use an HTML parser (I think there
is one in the standard library) that just doesn't emit anything for the
particular tags in question (and, at the simplest, just copies
everything else to the output unchanged).
--

Paul McGuire · Aug 16, 2005

I just reviewed what the re "\s" signifies: whitespace. This is easy,
pyparsing ignores all intervening whitespace by default. So mp3Entry
simplfies to:

mp3entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + aStart + \
SkipTo(tdEnd).setResultsName("name") + tdEnd

which leads me to another question - isn't there a closing </a> in
there somewhere, probably at the end of the name? If so, then you
might be better off with:

mp3entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + aStart + \
SkipTo(aEnd).setResultsName("name") + aEnd + tdEnd

-- Paul

How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
How can I create a table using the input element?	1	Mar 25, 2022
How can I train a neural network by reading different csv files	0	Nov 24, 2022
How can I add arrows to my FAQ	0	Aug 9, 2023
How can I structure the final array to meet the requirements of Bootstrap Tree View for building a tree in JavaScript?	1	Mar 29, 2024
How can I hide a div using an event listener on multiple checkboxes?	6	Dec 23, 2022
CSS: How can I stop overflow on the y-axis?	2	Dec 24, 2022
How can I execute a function ONLY if fetch request returns 404 status?	0	Sep 17, 2022

How can I exclude a word by using re?

could ildg

Christoph Rackwitz

Jeff Schwab

could ildg

Bruno Desthuilliers

Peter Otten

John Machin

John Machin

could ildg

Jordan Rastrick

Paul McGuire

could ildg

Paul McGuire

Paul McGuire

Dennis Lee Bieber

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads