E
egonslokar
Hello Python Community,
I have a large text file (1GB or so) with structure similar to the
html example below.
I have to extract content (text between div and tr tags) from this
file and put it into a spreadsheet or a database - given my limited
python knowledge I was going to try to do this with regex pattern
matching.
Would someone be able to provide pointers regarding how do I approach
this? Any code samples would be greatly appreciated.
Thanks.
Sam
<html>
\\ there are hundreds of thousands of items
\\Item1
<div class="ItemHead">123</div>
.....
<div class="special">Text1: What do I do with these lines
That span several rows? </div>
....
<tr tag="ItemFoot">Foot</tr>
\\Item2
<div class="ItemHead">First Line Can go here
But the second line can go here</div>
....
<tr tag="ItemFoot">Foot
Can span
Over several <b>pages</b></tr>
\\Item3
<div class="ItemHead">First Line Can go here
But the second line can go here</div>
....
<div class="special">This can
Span several rows</div>
</html>
I have a large text file (1GB or so) with structure similar to the
html example below.
I have to extract content (text between div and tr tags) from this
file and put it into a spreadsheet or a database - given my limited
python knowledge I was going to try to do this with regex pattern
matching.
Would someone be able to provide pointers regarding how do I approach
this? Any code samples would be greatly appreciated.
Thanks.
Sam
<html>
\\ there are hundreds of thousands of items
\\Item1
<div class="ItemHead">123</div>
.....
<div class="special">Text1: What do I do with these lines
That span several rows? </div>
....
<tr tag="ItemFoot">Foot</tr>
\\Item2
<div class="ItemHead">First Line Can go here
But the second line can go here</div>
....
<tr tag="ItemFoot">Foot
Can span
Over several <b>pages</b></tr>
\\Item3
<div class="ItemHead">First Line Can go here
But the second line can go here</div>
....
<div class="special">This can
Span several rows</div>
</html>