What do I do to read html files on my pc?

mikcec82 · Aug 27, 2012

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Chris Angelico · Aug 27, 2012

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Try BeautifulSoup. You can find it at the opposite end of a web search.

Not trying to be unhelpful, but without more description of the
problem, there's not a lot more to say

ChrisA

Mark Lawrence · Aug 27, 2012

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Type something like "python html parsing" into the box of your favourite
search engine, hit return and follow the links it comes back with.
Write some code. If you have problems give us the smallest code snippet
that reproduces the issue together with the complete traceback and we'll
help.

mikcec82 · Aug 27, 2012

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hallo,

I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Hi ChrisA, Hi Mark.
Thanks a lot.

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

</th>
<td>
<samp>
 
 
 
 
 
</samp>
XXXX
</td>
</tr>
<tr>
..
..
..
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer:NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)

Joel Goldstick · Aug 27, 2012

Il giorno lunedÃ¬ 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hi ChrisA, Hi Mark.
Thanks a lot.

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

</th>
<td>
<samp>
 
 
 
 
 
</samp>
XXXX
</td>
</tr>
<tr>
.
.
.
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)

from your example it doesn't seem there is enough information to know
where in the html your strings will be.

If you just read the whole file into a string you can do this:
.... print 'yes'
....
yes
Of course you will be testing for 'XXXX' or 'NOT PASSED'

Chris Angelico · Aug 27, 2012

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

Start by scribbling down some notes in your native language (that is,
don't bother trying to write code yet), defining exactly what you're
looking for. What constitutes a hit? What would be a false positive
that you need to avoid? For instance:

* The string XXXX must occur outside of any HTML tag.
or:
* The string XXXX must occur inside a <td> but not inside <samp>.
or:
* The string XXXX must be in the first <td> inside of a <tr> in the
<table> that immediately follows the text "abcdefg".

Make sure it's clear enough that anybody could follow it, even without
knowing everything you know about your files. Once you have that
algorithmic description, it's simply a matter of translating it into a
language the computer can handle; and that's fairly straight-forward.
An hour or two with language/library documentation and you'll quite
possibly have working code, or if you don't, you'll at least have
something that you can show to the list and ask for help with.

But until you have that, advice from this list is going to be fairly
vague, and may turn out to be quite misleading. We can't solve your
problem until we know what it is, and you can't tell us what the
problem is until you know yourself.

ChrisA

Jean-Michel Pichavant · Aug 27, 2012

mikcec82 said:
[snip]
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)

Html is not a format you wish to extract data from. Mainly because this
is the endpoint of content AND display, meaning, that what is properly
parsed today may not be parsed tomorrow because someone changed the
background color.
You should change your server so he can feed a client with data (xml for
instance is quite close from the html syntax, it's based on tags and is
suitable for data).

JM

mikcec82 · Aug 28, 2012

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hallo,

I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Thank you to all.

Hi Chris, thank you for your hint. I'll try to do as you said and to be clear:

I have to work on an HTML File. This file is not a website-file, neither it comes from internet.
It is a file created by a local software (where "local" means "on my pc").

On this file, I need to do this operation:

1) Open the file
2) Check the occurences of the strings:
2a) XXXX, in this case I have this code:

<tr style="font-size: 10" align="left">
<th>
</th><th>
DTC CODE Read:
</th>
<td>
<samp>
 
 
 
 
 
</samp>
XXXX
</td>
</tr>

2b) NOT PASSED, in this case I have this code:

<tr style="color: red" align="left">
<th>
</th><th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
Note: color in "<tr style="color: red" align="left">" can be "red" or "orange"

2c) OK or PASSED

3) Then, I need to fill an excel file following this rules:
3a) If 2a or 2b occurs on htmlfile, I'll write NOK in excel file
3b) If 2c occurs on htmlfile, I'll write OK in excel file

Note:
1) In this example, in 2b case, I have "CODE CHECK" in the code, but I could also have "TEXT CHECK" or "CHAR CHECK".
2) The research of occurences can be done either by tag ("<tr style="color: red" align="left">") or via (NOT PASSED, PASSED). But I would to use the first method.
==================================================

In my script I have used the second way to looking for, i.e.:

**
fileorig = "C:\Users\Mike\Desktop\\2012_05_16_1___p0201_13.html"

f = open(fileorig, 'r')
nomefile = f.read()

for x in nomefile:
if 'XXXX' in nomefile:
print 'NOK'
else :
print 'OK'
**
But this one works on charachters and not on strings (i.e.: in this way I have searched NOT string by string, but charachters-by-charachters).

===============================================

I hope I was clear.

Thank for your help
Michele

Oscar Benjamin · Aug 28, 2012

f = open(fileorig, 'r')
nomefile = f.read()

for x in nomefile:
if 'XXXX' in nomefile:
print 'NOK'
else :
print 'OK'

You don't need the for loop. Just do:

nomefile = f.read()
if 'XXXX' in nomefile:
print('NOK')

**
But this one works on charachters and not on strings (i.e.: in this way I h=
ave searched NOT string by string, but charachters-by-charachters).

Oscar

mikcec82 · Aug 28, 2012

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hallo,

I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Hi Oscar,
I tried as you said and I've developed the code as you will see.
But, when I have a such situation in an html file, in wich there is a repetition of a string (XX in this case):
CODE Target: 0201
CODE Read: XXXX
CODE CHECK : NOT PASSED
TEXT Target: 13
TEXT Read: XX
TEXT CHECK : NOT PASSED
CHAR Target: AA
CHAR Read: XX
CHAR CHECK : NOT PASSED

With this code (created starting from yours)

index = nomefile.find('XXXX')
print 'XXXX_ found at location', index

index2 = nomefile.find('XX')
print 'XX_ found at location', index2

found = nomefile.find('XX')
while found > -1:
print "XX found at location", found
found = nomefile.find('XX', found+1)

I have an answer like this:

XXXX_ found at location 51315
XX_ found at location 51315
XX found at location 51315
XX found at location 51316
XX found at location 51317
XX found at location 52321
XX found at location 53328

I have done it to find all occurences of 'XXXX' and 'XX' strings. But, as you can see, the script find the occurrences of XX also at locations 51315, 51316 , 51317 corresponding to string XXXX.

Is there a way to search all occurences of XX avoiding XXXX location?

Thank you.
Michele

Peter Otten · Aug 28, 2012

mikcec82 said:
Il giorno lunedÃ¬ 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hi Oscar,
I tried as you said and I've developed the code as you will see.
But, when I have a such situation in an html file, in wich there is a
repetition of a string (XX in this case):
CODE Target: 0201
CODE Read: XXXX
CODE CHECK : NOT PASSED
TEXT Target: 13
TEXT Read: XX
TEXT CHECK : NOT PASSED
CHAR Target: AA
CHAR Read: XX
CHAR CHECK : NOT PASSED

With this code (created starting from yours)

index = nomefile.find('XXXX')
print 'XXXX_ found at location', index

index2 = nomefile.find('XX')
print 'XX_ found at location', index2

found = nomefile.find('XX')
while found > -1:
print "XX found at location", found
found = nomefile.find('XX', found+1)

I have an answer like this:

XXXX_ found at location 51315
XX_ found at location 51315
XX found at location 51315
XX found at location 51316
XX found at location 51317
XX found at location 52321
XX found at location 53328

I have done it to find all occurences of 'XXXX' and 'XX' strings. But, as
you can see, the script find the occurrences of XX also at locations
51315, 51316 , 51317 corresponding to string XXXX.

Is there a way to search all occurences of XX avoiding XXXX location?

Remove the wrong positives afterwards:

start = nomefile.find("XX")
while start != -1:
if nomefile[start:start+4] == "XXXX":
start += 4
else:
print "XX found at location", start
start += 3
start = nomefile.find("XX", start)

By the way, what do you want to do if there are runs of "X" with repeats
other than 2 or 4?

mikcec82 · Aug 29, 2012

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:

Hallo,

I have an html file on my pc and I want to read it to extract some text.

Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

Hi Peter and thanks for your precious help.
Fortunately, there aren't runs of "X" with repeats other than 2 or 4.
Starting from your code, I wrote this code (I post it, so it could be helpful for other people):
f = open(fileorig, 'r')
nomefile = f.read()

start = nomefile.find("XX")
start2 = nomefile.find("NOT PASSED")
c0 = 0
c1 = 0
c2 = 0

while (start != -1) | (start2 != -1):

if nomefile[start:start+4] == "XXXX":
print "XXXX found at location", start
start += 4
c0 +=1
elif nomefile[start:start+2] == "XX":
print "XX found at location", start
start += 2
c1 +=1

if nomefile[start2:start2+10] == "NOT PASSED":
print "NOT PASSED found at location", start2
start2 += 10
c2 +=1

start = nomefile.find("XX", start)
start2 = nomefile.find("NOT PASSED", start2)

print "XXXX %s founded" % c0, "\nXX %s founded" % c1, "\nNOT PASSED %s founded" % c2

Now, I'm able to find all occurences of strings: "XXXX", "XX" and "NOT PASSED"

Thank you so much.

Umesh Sharma · Aug 29, 2012

You can use httplib library to download the html and then for extracting the text from it either you can use any library (google for it) or you can use regular expression for it .

Uhhhhh, What can I do next?	6	Nov 25, 2023
What should I do Before I give up programming?	6	Jan 14, 2023
Buzz controller on pc	1	Dec 5, 2021
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 19, 2024
How can I simply view old MS BASIC V7 files on my Win10 PC?	2	Aug 22, 2022
I am not sure what to do :(	0	Jun 6, 2023
How to convert MBOX files to HTML?	4	Dec 25, 2024
How do I turn my NSF files into a PST file?	4	Dec 30, 2024

What do I do to read html files on my pc?

mikcec82

Chris Angelico

Mark Lawrence

mikcec82

Joel Goldstick

Chris Angelico

Jean-Michel Pichavant

mikcec82

Oscar Benjamin

mikcec82

Peter Otten

mikcec82

Umesh Sharma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads