regex help

D

David

Hi

I have a few regexs I need to do, but im struggling to come up with a
nice way of doing them, and more than anything am here to learn some
tricks and some neat code rather than getting an answer - although
thats obviously what i would like to get to.

Problem 1 -

<span class="chg"
id="ref_678774_cp">(25.47%)</span><br>

I want to extract 25.47 from here - so far I've tried -

xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
\">(.*?)%', content)

and

xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
\">\((\d*)%\)</span><br>', content)

neither of these seem to do what I want - am I not doing this
correctly? (obviously!)

Problem 2 -

<td>&nbsp;</td>

<td width="1%" class=key>Open:
</td>
<td width="1%" class=val>5.50
</td>
<td>&nbsp;</td>
<td width="1%" class=key>Mkt Cap:
</td>
<td width="1%" class=val>6.92M
</td>
<td>&nbsp;</td>
<td width="1%" class=key>P/E:
</td>
<td width="1%" class=val>21.99
</td>


I want to extract the open, mkt cap and P/E values - but apart from
doing loads of indivdual REs which I think would look messy, I can't
think of a better and neater looking way. Any ideas?

Cheers

David
 
C

Chris Rebert

Hi

I have a few regexs I need to do, but im struggling to come up with a
nice way of doing them, and more than anything am here to learn some
tricks and some neat code rather than getting an answer - although
thats obviously what i would like to get to.

Problem 1 -

<span class="chg"
               id="ref_678774_cp">(25.47%)</span><br>

I want to extract 25.47 from here - so far I've tried -

xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
\">(.*?)%', content)

and

xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
\">\((\d*)%\)</span><br>', content)

neither of these seem to do what I want - am I not doing this
correctly? (obviously!)

Problem 2 -

<td>&nbsp;</td>

<td width="1%" class=key>Open:
</td>
<td width="1%" class=val>5.50
</td>
<td>&nbsp;</td>
<td width="1%" class=key>Mkt Cap:
</td>
<td width="1%" class=val>6.92M
</td>
<td>&nbsp;</td>
<td width="1%" class=key>P/E:
</td>
<td width="1%" class=val>21.99
</td>


I want to extract the open, mkt cap and P/E values - but apart from
doing loads of indivdual REs which I think would look messy, I can't
think of a better and neater looking way. Any ideas?

Use an actual HTML parser? Like BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/), for instance.

I will never understand why so many people try to parse/scrape
HTML/XML with regexes...

Cheers,
Chris
 
T

Tim Harig

You are downloading market data? Yahoo offers its stats in CSV format that
is easier to parse without a dedicated parser.
Use an actual HTML parser? Like BeautifulSoup
(http://www.crummy.com/software/BeautifulSoup/), for instance.

I agree with your sentiment exactly. If the regex he is trying to get is
difficult enough that he has to ask; then, yes, he should be using a
parser.
I will never understand why so many people try to parse/scrape
HTML/XML with regexes...

Why? Because some times it is good enough to get the job done easily.
 
R

Rhodri James

Hi

I have a few regexs I need to do, but im struggling to come up with a
nice way of doing them, and more than anything am here to learn some
tricks and some neat code rather than getting an answer - although
thats obviously what i would like to get to.

Problem 1 -

<span class="chg"
id="ref_678774_cp">(25.47%)</span><br>

I want to extract 25.47 from here - so far I've tried -

xPer = re.search('<span class="chg" id="ref_"'+str(xID.group(1))+'"_cp
\">(.*?)%', content)

Supposing that str(xID.group(1)) == "678774", let's see how that string
concatenation turns out:

<span class="chg" id="ref_"678774"_cp">(.*?)%

The obvious problems here are the spurious double quotes, the spurious
(but harmless) escaping of a double quote, and the lack of (escaped)
backslash and (escaped) open parenthesis. The latter you can always
strip off later, but the first sink the match rather thoroughly.
and

xPer = re.search('<span class=\"chg\" id=\"ref_"+str(xID.group(1))+"_cp
\">\((\d*)%\)</span><br>', content)

With only two single quotes present, the biggest problem should be obvious.

Unfortunately if you just fix the obvious in either of the two regular
expressions, you're setting yourself up for a fall later on. As The Fine
Manual says right at the top of the page on the re module
(http://docs.python.org/library/re.html), you want to be using raw string
literals when you're dealing with regular expressions, because you want
the backslashes getting through without being interpreted specially by
Python's own parser. As it happens you get away with it in this case,
since neither '\d' nor '\(' have a special meaning to Python, so aren't
changed, and '\"' is interpreted as '"', which happens to be the right
thing anyway.

Problem 2 -

<td>&nbsp;</td>

<td width="1%" class=key>Open:
</td>
<td width="1%" class=val>5.50
</td>
<td>&nbsp;</td>
<td width="1%" class=key>Mkt Cap:
</td>
<td width="1%" class=val>6.92M
</td>
<td>&nbsp;</td>
<td width="1%" class=key>P/E:
</td>
<td width="1%" class=val>21.99
</td>


I want to extract the open, mkt cap and P/E values - but apart from
doing loads of indivdual REs which I think would look messy, I can't
think of a better and neater looking way. Any ideas?

What you're trying to do is inherently messy. You might want to use
something like BeautifulSoup to hide the mess, but never having had
cause to use it myself I couldn't say for sure.
 
P

Peter Otten

David said:
<td>&nbsp;</td>

<td width="1%" class=key>Open:
</td>
<td width="1%" class=val>5.50
</td>
<td>&nbsp;</td>
<td width="1%" class=key>Mkt Cap:
</td>
<td width="1%" class=val>6.92M
</td>
<td>&nbsp;</td>
<td width="1%" class=key>P/E:
</td>
<td width="1%" class=val>21.99
</td>


I want to extract the open, mkt cap and P/E values - but apart from
doing loads of indivdual REs which I think would look messy, I can't
think of a better and neater looking way. Any ideas?
....
.... <td width="1%" class=key>Open:
.... </td>
.... <td width="1%" class=val>5.50
.... </td>
.... <td>&nbsp;</td>
.... <td width="1%" class=key>Mkt Cap:
.... </td>
.... <td width="1%" class=val>6.92M
.... </td>
.... <td>&nbsp;</td>
.... <td width="1%" class=key>P/E:
.... </td>
.... <td width="1%" class=val>21.99
.... value = key.findNext(attrs={"class": "val"})
.... print key.string.strip(), "-->", value.string.strip()
....
Open: --> 5.50
Mkt Cap: --> 6.92M
P/E: --> 21.99
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,201
Messages
2,571,049
Members
47,655
Latest member
eizareri

Latest Threads

Top