Splitting on a word

qwweeeit · Jul 13, 2005

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

# SplitMultichar.py

import re

# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

lHref=p.findall(s) # lHref=['href','HREF']
# for normal html files the lHref list has more elements
# (more web references)

c='~' # char to be used as delimiter
# c=chr(127) # char to be used as delimiter
for i in lHref:
s=s.replace(i,c)

# s ='ffy: ytrty <a ~="www.python.org">python</a> fyt <A
~="wwwx">wx</A> dtrtf'

list=s.split(c)
# list=['ffy: ytrty <a ', '="www.python.org">python</a> fyt <A ',
'="wwwx">wx</A> dtrtf']
#=-----------------------------------------------------

If you save the original s string to xxx.html, any browser
can visualize it.
To be sure as delimiter I choose chr(127)
which surely is not present in the html file.
Bye.

Robert Kern · Jul 13, 2005

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

For *this* particular task, certainly. It begins with

import BeautifulSoup

The rest is left as a (brief) exercise for the reader.

As for the more general task of splitting strings using regular
expressions, see re.split().

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Steven D'Aprano · Jul 13, 2005

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code,

[red rag to bull]
Because it was too slow? Or just to prove what a macho programmer you are?

Is your code even working yet? If it isn't working, you shouldn't be
trying to optimizing buggy code.

I found that an essential step is:
splitting on a word (in this case 'href').

Then just do it:

py> '<a href="web reference"> underlined reference</a>'.split('href')
['<a ', '="web reference"> underlined reference</a>']

If you are concerned about case issues, you can either convert the
entire HTML file to lowercase, or you might write a case-insensitive
regular expression to replace any "href" regardless of case with the
lowercase version.

[snip]

To be sure as delimiter I choose chr(127)
which surely is not present in the html file.

I wouldn't bet my life on that. I've found some weird characters in HTML
files.

Joe · Jul 13, 2005

# string s simulating an html file
s='ffy: ytrty <a href="www.python.org">python</a> fyt <A
HREF="wwwx">wx</A> dtrtf'
p=re.compile(r'\bhref\b',re.I)

list=p.split(s) #<<<<<<<<<<<<<<<<< gets you your final list.

good luck,

Joe

Bernhard Holzmayer · Jul 14, 2005

Hi all,
I am writing a script to visualize (and print)
the web references hidden in the html files as:
'<a href="web reference"> underlined reference</a>'
Optimizing my code, I found that an essential step is:
splitting on a word (in this case 'href').

I am asking if there is some alternative (more pythonic...):

Sure. The htmllib module provides HTMLparser.
Here's an example, run it with your HTML file as argument
and you'll see a list of all href's in the document.

#------------------------------------------------
#!/usr/bin/python
import htmllib

def test():
import sys, formatter

file = sys.argv[1]
f = open(file, 'r')
data = f.read()
f.close()

f = formatter.NullFormatter()
p = htmllib.HTMLParser(f)
p.feed(data)

for a_link in p.anchorlist:
print a_link

p.close()

test()
#------------------------------------------------

I'm sure that this is far more Pythonic!

Bernhard

qwweeeit · Jul 14, 2005

Hi all,
thanks for your contributions. To Robert Kern I can replay that I know
BeautifulSoap, but mine wanted to be a "generalization" (only
incidentally used in a web parsing application). The fact is that,
beeing a "macho newbie" programmer (the "macho" is from Steven
D'Aprano), I wanted to show how beaufiful solutions I can find...
Luckily there is Joe who shows me that he most of my "beautiful" code
(working, of course!) can be replaced by:
list=p.split(s)
Bernard... don't get angry, but I prefer the solution of Joe. It is
more general, and, besides that, for me "pythonic" means simple and
short (I may be wrong...).
By the way, I have found an alternative solution to the problem of
lists "unique", without sorting, but non beeing enough "macho"...
Bye.

Bernhard Holzmayer · Jul 14, 2005

Bernard... don't get angry, but I prefer the solution of Joe.

Oh. If I got angry in such a case, I would have stopped responding to such
posts long ago
You know the background... and you'll have to bear the consequences. ;-)

...
for me "pythonic" means simple and short (I may be wrong...).

It's your definition, isn't it?
One of the most important advantages of Python (for me!) besides its
readability is that it comes with "Batteries included", which means, that I
can benefit of the work others did before, and that I can rely on its
quality.

The solution which I proposed is nothing but the test code from htmllib,
stripped down to the absolut minimum, enriched with the print command
to show the anchor list.

If I had to write production-level code of your sort, I'd take such an
off-the-shelf solution, because it minimizes the risk of failures.

Think only of such issues like these:
- does your code find a tag like <A HREF= (capital letters)?
- does your code correctly handle incomplete tags like
<a href="linkadr"></a> or references with/without " ...?
- does it survive ill-coded html after all?

I've made the experience that it's usually better to rely on such
"library" code than to reinvent the wheel.

There's often a reason to take another approach.
I'd agree that a simple and short solution is fascinating.
However, every simple and short solution should be readable.
As a terrific example, here's a very tiny piece of code,
which does nothing but calculate the prime numbers up to 1000:

print filter(None,map(lambda y:y*reduce(lambda x,y:x*y!=0,
map(lambda x,y=y:y%x,range(2,int(pow(y,0.5)+1))),1),
range(2,1000)))

- simple (depends on your familiarity with stuff like map and lambda)
- short (compared with different solutions)
- and veeerrrryyy pythonic!

Bernhard

qwweeeit · Jul 14, 2005

Hi Bernhard,
firstly you must excuse my English ("angry" is a little ...strong, but
my vocabulary is limited). I hope that the experts keep on helping us
newbie.
Also if I am a newbie (in Python), I disagree with you: my solution
(with the help of Joe) answers to the problem of splitting a string
using a delimiter of more than one character (sometimes a word as
delimiter, but it is not required).
The code I supplied can be misleading because is centered in web
parsing, but my request is more general (Next time I will only make the
question without examples!)
If I were a professional programmer I could agree with you and the
"Batteries included" concept and all the other considerations
("off-the-shelf solutions" and ...not reinventing the wheel).
Also the terrific example you supply in order to caution me not to
follow dully (found in the dictionary) the "simple & short" concept,
doesn't apply to me (too complicated!).
I am so far from a real programmer that when an error occurs, I use
try/except (if they solve the problem) without caring of the sources of
the mistake, ...EAFP!).
So I don't care too much of possible future mistakes (also if the code
takes into account capital letters).
For the specific case I mentioned, actually if the closing tag ">" is
missing perhaps I obtain wrong results... I will worry when necessary
(also if the Murphy law...).
Bye.

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Guy Steele on Parallel Programing	1	Feb 5, 2011
Advice Criticism on Python App	4	Mar 24, 2010
Opening Word from .net	4	Jul 26, 2004
Help on thread pool	3	May 17, 2008
print header for output	0	Jun 19, 2011
generate and send mail with python: tutorial	8	Aug 11, 2011
Dr. Dobb's Python-URL! - weekly Python news and links (Aug 2)	1	Aug 2, 2006

Splitting on a word

qwweeeit

Robert Kern

Steven D'Aprano

Joe

Bernhard Holzmayer

qwweeeit

Bernhard Holzmayer

qwweeeit

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads