How can I exclude a word by using re?

C

could ildg

In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.
 
C

Christoph Rackwitz

re.findall('(.*)hello|(.*)', 'hi, how are you. hello')
re.findall('(.*)hello|(.*)', 'hi, how are you. ello')
take a look at the outputs of these.
 
J

Jeff Schwab

could said:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

import re

def demonstrate(regex, text):
pattern = re.compile(regex)
match = pattern.search(text)

print " ", text
if match:
print " Matched '%s'" % match.group(0)
print " Captured '%s'" % match.group(1)
else:
print " Did not match"

# Option 1: Match it all, but capture only the part before "hello." The
(.*?)
# matches as few characters as possible, so that this pattern would end
before
# the first hello in "hello hello".

pattern = r"(.*?)hello"
print "Option 1:", pattern
demonstrate( pattern, "hi, how are you. hello" )

# Option 2: Don't even match the "hello," but make sure it's there.
# The first of these calls will match, but the second will not. The
# (?=...) construct is using a feature called "forward look-ahead."

pattern = r"(.*)(?=hello)"
print "\nOption 2:", pattern
demonstrate( pattern, "hi, how are you. hello" )
demonstrate( pattern, "hi, how are you. ", )
 
C

could ildg

Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?
 
B

Bruno Desthuilliers

could ildg a écrit :
Thank you.
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello".

Read The Fine Manual ?-)

For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

re.findall(r'^(.*)hello', your_string_full_of_hellos)
 
P

Peter Otten

could said:
But what should I do if there are more than one hello and I only want
to extract what's before the first "hello". For example, the raw
string is "hi, how are you? hello I'm fine, thank you hello. that's it
hello", I want to extract all the stuff before the first hello?

The simplest solution is to use str.split():
helo = "hi, how are you? HELLO I'm fine, thank you hello. that's it"
helo.split("hello", 1)[0]
"hi, how are you? HELLO I'm fine, thank you "

But regular expressions offer a similar feature:
re.compile("hello", re.IGNORECASE).split(helo, 1)[0]
'hi, how are you? '

Peter
 
J

John Machin

Bruno said:
could ildg a écrit :



Read The Fine Manual ?-)




re.findall(r'^(.*)hello', your_string_full_of_hellos)

Nice try, but it needs a little refinement to do what the OP asked for:
>>> import re
>>> h = "hi g'day hello hello hello"
>>> re.findall(r'^(.*)hello', h) ["hi g'day hello hello "]
>>> re.findall(r'^(.*?)hello', h) ["hi g'day "]
>>> re.findall(r'^(.*?)hello', h)[0]
"hi g'day "
 
J

John Machin

could said:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
 
C

could ildg

I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.
could said:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
 
J

Jordan Rastrick

could ildg said:
I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

Actually, for properly processing html, you shouldn't really be using
regular expressions, precisely because the problem is complicated -
regular expressions are too simple and can't properly model a language
like HTML, which is generated by a context free grammar.

If thats only meaningless technical mumbo-jumbo to you, never mind -
the important point is you shouldn't really use an re. Trust me :)

What you want for a job like is an HTML parser. Theres one in the
standard library; if it doesnt suit, there are plenty of third party
ones. I like Beautiful Soup:

http://www.crummy.com/software/BeautifulSoup/

If you insist on using an re, well I'm sure someone on this group will
figure out a solution to your issue thats as good as you're going to
get...

My re is as below:
_____________________________________________
r=re.compile(ur'valign=top>(?P<number>\d{1,2})</td><td[^>]*>\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICODE|re.IGNORECASE)
_____________________________________________
There should be over 30 matches in the html. But I find nothing by
re.finditer(html) because my last line of re is wrong. I can't use
"(?P<name>.+)</td>" because there are many many "</td>" in the html
and I just want the ".*" to match what are before the firest "</td>".
So I think if there is some idea I can exclude a word, this will be
done. Assume there is "NOT(WORD)" can do it, I just need to write the
last line of the re as "(?P<name>(NOT(</td>))+)</td>".
But I still have no idea after thinking and trying for a very long time.

In other words, I want the "</td>" of "(?P<name>.+)</td>" to be
exactly the first "</td>" in this match. And there is more than one
match in this html, so this must be done by using re.

And I can't use any of your idea because what I want I deal with is a
very complicated html, not just a single line of word.

I can copy part of the html up to here but it's kinda too lengthy.
could said:
In re, the punctuation "^" can exclude a single character, but I want
to exclude a whole word now. for example I have a string "hi, how are
you. hello", I want to extract all the part before the world "hello",
I can't use ".*[^hello]" because "^" only exclude single char "h" or
"e" or "l" or "o". Will somebody tell me how to do it? Thanks.

(1) Why must you use re? It's often a good idea to use string methods
where they can do the job you want.
(2) What do you want to have happen if "hello" is not in the string?

Example:

C:\junk>type upto.py
def upto(strg, what):
k = strg.find(what)
if k > -1:
return strg[:k]
return None # or raise an exception

helo = "hi, how are you? HELLO I'm fine, thank you hello hello hello.
that's it"

print repr(upto(helo, "HELLO"))
print repr(upto(helo, "hello"))
print repr(upto(helo, "hi"))
print repr(upto(helo, "goodbye"))
print repr(upto("", "goodbye"))
print repr(upto("", ""))

C:\junk>upto.py
'hi, how are you? '
"hi, how are you? HELLO I'm fine, thank you "
''
None
None
''

HTH,
John
 
P

Paul McGuire

Given the example re that you've been trying to get working, here is a
pyparsing approach that might be more, um, approachable.
Unfortunately, since I don't have the URL of the page you are working
with, I'm unable to test this before posting.

Good luck,
-- Paul

# getMP3s.py
# get pyparsing at http://pyparsing.sourceforge.net
#

from pyparsing import *
import urllib

#~
r=re.compile(ur'valign=top>(?P­<number>\d{1,2})</td><td[^>]*>­\s{0,2}'

#~ ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
#~ ur'(?P<name>.+)</td>',re.UNICO­DE|re.IGNORECASE)

tdStart,tdEnd = makeHTMLTags("td")
aStart,aEnd = makeHTMLTags("a")

number = Word(nums)
valign = CaselessLiteral("valign=top>")

mp3Entry = valign + number.setResultsName("number") + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

# get list of mp3's
targetURL = "http://whatever"
targetPage = urllib.urlopen( targetURL )
targetHTML = targetPage.read()
targetPage.close()

for toks,s,e in mp3Entry.scanString(targetHTML):
print toks.number, toks.starta.href
 
C

could ildg

Thank you,
you code using pyparsing works very well. Now I got the "number" and
the "url". But I still want to get the "name".
I'll turn to pyparsing and see how to get the "name" from the html.
But I hope you can enlighten me for one more time since I'm not
farmiliar with the pyparsing module.
 
P

Paul McGuire

Just as with re you were using "?P<xxx>" to assign the matching text to
the variable "xxx", pyparsing allows you to associate a name with an
element of your grammar using setResultsName.

Here is your original re:
r=re.compile(ur'valign=top>(?P­­<number>\d{1,2})</td><td[^>]*­>­\s{0,2}'
ur'<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank>'
ur'(?P<name>.+)</td>',re.UNICO­­DE|re.IGNORECASE)

Here is the pyparsing expression:
valign + number.setResultsName("number"­) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd) + tdEnd

Here are the re and pyparsing pieces side by side:
re => pyparsing
-----------------------
valign=top> => valign = CaselessLiteral("valign=top>")
(?P­­<number>\d{1,2}) => number = Word(nums),
number.setResultsName("number")
</td> => tdEnd
<td[^>]*­>­ => tdStart
\s{0,2} => I don't know what this re does, so I just used
SkipTo(aStart)
<a href="(?P<url>[^<>]+\.mp3)"( )target=_blank> => aStart (which
returns a value whose named attributes correspond to the HTML
attributes, such as href)
(?P<name>.+) => SkipTo(tdEnd) *** here is where we'll make our
change ***
</td> => tdEnd

To capture the body of the second <td></td> tag pair, we'll add
setResultsName("name") to the pyparsing expression:
mp3entry = valign + number.setResultsName("number"­) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd)setResultsName("name") + tdEnd

Now you should be able to extract the data using:
for toks,s,e in mp3Entry.scanString(targetHTML­):
print toks.number, toks.starta.href, toks.name

Good luck!
-- Paul
 
P

Paul McGuire

Oof! That should be:

mp3entry = valign + number.setResultsName("number"­­) + tdEnd + \
tdStart + SkipTo(aStart) + aStart + \
SkipTo(tdEnd).setResultsName("n­ame") + tdEnd
 
D

Dennis Lee Bieber

I want to use re because I want to extract something from a html. It
will be very complicated without using re. But while using re, I
found that I must exlude a hole word "</td>", certainly, there are
many many "</td>" in this html.

Yeesh... Wouldn't it be faster to use an HTML parser (I think there
is one in the standard library) that just doesn't emit anything for the
particular tags in question (and, at the simplest, just copies
everything else to the output unchanged).
--
 
P

Paul McGuire

I just reviewed what the re "\s" signifies: whitespace. This is easy,
pyparsing ignores all intervening whitespace by default. So mp3Entry
simplfies to:

mp3entry = valign + number.setResultsName("number"­­­) + tdEnd + \
tdStart + aStart + \
SkipTo(tdEnd).setResultsName("­n­ame") + tdEnd

which leads me to another question - isn't there a closing </a> in
there somewhere, probably at the end of the name? If so, then you
might be better off with:

mp3entry = valign + number.setResultsName("number"­­­) + tdEnd + \
tdStart + aStart + \
SkipTo(aEnd).setResultsName("­n­ame") + aEnd + tdEnd

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,262
Messages
2,571,311
Members
47,986
Latest member
ColbyG935

Latest Threads

Top