What do I do to read html files on my pc?

M

mikcec82

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele
 
C

Chris Angelico

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Try BeautifulSoup. You can find it at the opposite end of a web search.

Not trying to be unhelpful, but without more description of the
problem, there's not a lot more to say :)

ChrisA
 
M

Mark Lawrence

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Type something like "python html parsing" into the box of your favourite
search engine, hit return and follow the links it comes back with.
Write some code. If you have problems give us the smallest code snippet
that reproduces the issue together with the complete traceback and we'll
help.
 
M

mikcec82

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
Hallo,



I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?



thank you so much.



Michele

Hi ChrisA, Hi Mark.
Thanks a lot.

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

</th>
<td>
<samp>
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
</samp>
XXXX
</td>
</tr>
<tr>
..
..
..
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer:NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)
 
J

Joel Goldstick

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hi ChrisA, Hi Mark.
Thanks a lot.

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

</th>
<td>
<samp>
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
</samp>
XXXX
</td>
</tr>
<tr>
.
.
.
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)

from your example it doesn't seem there is enough information to know
where in the html your strings will be.

If you just read the whole file into a string you can do this:
.... print 'yes'
....
yes
Of course you will be testing for 'XXXX' or 'NOT PASSED'
 
C

Chris Angelico

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

Start by scribbling down some notes in your native language (that is,
don't bother trying to write code yet), defining exactly what you're
looking for. What constitutes a hit? What would be a false positive
that you need to avoid? For instance:

* The string XXXX must occur outside of any HTML tag.
or:
* The string XXXX must occur inside a <td> but not inside <samp>.
or:
* The string XXXX must be in the first <td> inside of a <tr> in the
<table> that immediately follows the text "abcdefg".

Make sure it's clear enough that anybody could follow it, even without
knowing everything you know about your files. Once you have that
algorithmic description, it's simply a matter of translating it into a
language the computer can handle; and that's fairly straight-forward.
An hour or two with language/library documentation and you'll quite
possibly have working code, or if you don't, you'll at least have
something that you can show to the list and ask for help with.

But until you have that, advice from this list is going to be fairly
vague, and may turn out to be quite misleading. We can't solve your
problem until we know what it is, and you can't tell us what the
problem is until you know yourself.

ChrisA
 
J

Jean-Michel Pichavant

mikcec82 said:
[snip]
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)
Html is not a format you wish to extract data from. Mainly because this
is the endpoint of content AND display, meaning, that what is properly
parsed today may not be parsed tomorrow because someone changed the
background color.
You should change your server so he can feed a client with data (xml for
instance is quite close from the html syntax, it's based on tags and is
suitable for data).

JM
 
M

mikcec82

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
Hallo,



I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?



thank you so much.



Michele

Thank you to all.

Hi Chris, thank you for your hint. I'll try to do as you said and to be clear:

I have to work on an HTML File. This file is not a website-file, neither it comes from internet.
It is a file created by a local software (where "local" means "on my pc").

On this file, I need to do this operation:

1) Open the file
2) Check the occurences of the strings:
2a) XXXX, in this case I have this code:

<tr style="font-size: 10" align="left">
<th>
</th><th>
DTC CODE Read:
</th>
<td>
<samp>
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
</samp>
XXXX
</td>
</tr>

2b) NOT PASSED, in this case I have this code:

<tr style="color: red" align="left">
<th>
</th><th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
Note: color in "<tr style="color: red" align="left">" can be "red" or "orange"

2c) OK or PASSED

3) Then, I need to fill an excel file following this rules:
3a) If 2a or 2b occurs on htmlfile, I'll write NOK in excel file
3b) If 2c occurs on htmlfile, I'll write OK in excel file

Note:
1) In this example, in 2b case, I have "CODE CHECK" in the code, but I could also have "TEXT CHECK" or "CHAR CHECK".
2) The research of occurences can be done either by tag ("<tr style="color: red" align="left">") or via (NOT PASSED, PASSED). But I would to use the first method.
==================================================

In my script I have used the second way to looking for, i.e.:

**
fileorig = "C:\Users\Mike\Desktop\\2012_05_16_1___p0201_13.html"

f = open(fileorig, 'r')
nomefile = f.read()

for x in nomefile:
if 'XXXX' in nomefile:
print 'NOK'
else :
print 'OK'
**
But this one works on charachters and not on strings (i.e.: in this way I have searched NOT string by string, but charachters-by-charachters).

===============================================

I hope I was clear.

Thank for your help
Michele
 
O

Oscar Benjamin

f = open(fileorig, 'r')
nomefile = f.read()

for x in nomefile:
if 'XXXX' in nomefile:
print 'NOK'
else :
print 'OK'

You don't need the for loop. Just do:

nomefile = f.read()
if 'XXXX' in nomefile:
print('NOK')
**
But this one works on charachters and not on strings (i.e.: in this way I h=
ave searched NOT string by string, but charachters-by-charachters).

Oscar
 
M

mikcec82

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
Hallo,



I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?



thank you so much.



Michele

Hi Oscar,
I tried as you said and I've developed the code as you will see.
But, when I have a such situation in an html file, in wich there is a repetition of a string (XX in this case):
CODE Target: 0201
CODE Read: XXXX
CODE CHECK : NOT PASSED
TEXT Target: 13
TEXT Read: XX
TEXT CHECK : NOT PASSED
CHAR Target: AA
CHAR Read: XX
CHAR CHECK : NOT PASSED

With this code (created starting from yours)

index = nomefile.find('XXXX')
print 'XXXX_ found at location', index

index2 = nomefile.find('XX')
print 'XX_ found at location', index2

found = nomefile.find('XX')
while found > -1:
print "XX found at location", found
found = nomefile.find('XX', found+1)

I have an answer like this:

XXXX_ found at location 51315
XX_ found at location 51315
XX found at location 51315
XX found at location 51316
XX found at location 51317
XX found at location 52321
XX found at location 53328

I have done it to find all occurences of 'XXXX' and 'XX' strings. But, as you can see, the script find the occurrences of XX also at locations 51315, 51316 , 51317 corresponding to string XXXX.

Is there a way to search all occurences of XX avoiding XXXX location?

Thank you.
Michele
 
P

Peter Otten

mikcec82 said:
Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hi Oscar,
I tried as you said and I've developed the code as you will see.
But, when I have a such situation in an html file, in wich there is a
repetition of a string (XX in this case):
CODE Target: 0201
CODE Read: XXXX
CODE CHECK : NOT PASSED
TEXT Target: 13
TEXT Read: XX
TEXT CHECK : NOT PASSED
CHAR Target: AA
CHAR Read: XX
CHAR CHECK : NOT PASSED

With this code (created starting from yours)

index = nomefile.find('XXXX')
print 'XXXX_ found at location', index

index2 = nomefile.find('XX')
print 'XX_ found at location', index2

found = nomefile.find('XX')
while found > -1:
print "XX found at location", found
found = nomefile.find('XX', found+1)

I have an answer like this:

XXXX_ found at location 51315
XX_ found at location 51315
XX found at location 51315
XX found at location 51316
XX found at location 51317
XX found at location 52321
XX found at location 53328

I have done it to find all occurences of 'XXXX' and 'XX' strings. But, as
you can see, the script find the occurrences of XX also at locations
51315, 51316 , 51317 corresponding to string XXXX.

Is there a way to search all occurences of XX avoiding XXXX location?

Remove the wrong positives afterwards:

start = nomefile.find("XX")
while start != -1:
if nomefile[start:start+4] == "XXXX":
start += 4
else:
print "XX found at location", start
start += 3
start = nomefile.find("XX", start)

By the way, what do you want to do if there are runs of "X" with repeats
other than 2 or 4?
 
M

mikcec82

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
Hallo,



I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?



thank you so much.



Michele

Hi Peter and thanks for your precious help.
Fortunately, there aren't runs of "X" with repeats other than 2 or 4.
Starting from your code, I wrote this code (I post it, so it could be helpful for other people):
f = open(fileorig, 'r')
nomefile = f.read()

start = nomefile.find("XX")
start2 = nomefile.find("NOT PASSED")
c0 = 0
c1 = 0
c2 = 0

while (start != -1) | (start2 != -1):

if nomefile[start:start+4] == "XXXX":
print "XXXX found at location", start
start += 4
c0 +=1
elif nomefile[start:start+2] == "XX":
print "XX found at location", start
start += 2
c1 +=1

if nomefile[start2:start2+10] == "NOT PASSED":
print "NOT PASSED found at location", start2
start2 += 10
c2 +=1

start = nomefile.find("XX", start)
start2 = nomefile.find("NOT PASSED", start2)

print "XXXX %s founded" % c0, "\nXX %s founded" % c1, "\nNOT PASSED %s founded" % c2

Now, I'm able to find all occurences of strings: "XXXX", "XX" and "NOT PASSED"


Thank you so much.
 
U

Umesh Sharma

You can use httplib library to download the html and then for extracting the text from it either you can use any library (google for it) or you can use regular expression for it .
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,145
Messages
2,570,826
Members
47,372
Latest member
LucretiaFo

Latest Threads

Top