Tips Re Pattern Matching / REGEX

E

egonslokar

Hello Python Community,

I have a large text file (1GB or so) with structure similar to the
html example below.

I have to extract content (text between div and tr tags) from this
file and put it into a spreadsheet or a database - given my limited
python knowledge I was going to try to do this with regex pattern
matching.

Would someone be able to provide pointers regarding how do I approach
this? Any code samples would be greatly appreciated.

Thanks.

Sam



<html>

\\ there are hundreds of thousands of items

\\Item1

<div class="ItemHead">123</div>
.....
<div class="special">Text1: What do I do with these lines
That span several rows? </div>
....
<tr tag="ItemFoot">Foot</tr>

\\Item2

<div class="ItemHead">First Line Can go here
But the second line can go here</div>
....
<tr tag="ItemFoot">Foot
Can span
Over several <b>pages</b></tr>


\\Item3

<div class="ItemHead">First Line Can go here
But the second line can go here</div>
....
<div class="special">This can
Span several rows</div>

</html>
 
M

Miki

Hello,
I have a large text file (1GB or so) with structure similar to the
html example below.

I have to extract content (text between div and tr tags) from this
file and put it into a spreadsheet or a database - given my limited
python knowledge I was going to try to do this with regex pattern
matching.

Would someone be able to provide pointers regarding how do I approach
this? Any code samples would be greatly appreciated.
The ultimate tool for handling HTML is http://www.crummy.com/software/BeautifulSoup/
where you can do stuff like:
soup = BeautifulSoup(html)
for div in soup("div", {"class" : "special"}):
...

Not sure how fast it is though.

There is also the htmllib module that comes with python, it might do
the work as well and maybe a bit faster.
If the file is valid HTML and you need some speed, have a look at
xml.sax.

HTH,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top