Regular Expression problem

J

John Blogger

(I don't know if it is the right place. So if I am wrong, please point
me the right direction.
If this post is read by you masters, I'm honoured. If I am getting a
mere response, I'm blessed!)

Hi,

I'm a newbie regular expression user. I use regex in my Python
programs. I have a strange

(sometimes not strange, but please bear in mind; I'm a newbie ;)
problem using regex. That I want

a particular tag value of one of my HTML files.

ie: I want only the value after 'href=' in the tag >>

'<link href="mystylesheet.css" rel="stylesheet" type="text/css">'

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it

is good).

_"<link\s+href=["]?(.*?)["]?\s+rel=["]?stylesheet["]?\s+type=["]?text/css["]?>"_
I thought I was doing fine until I got stuck by this tag >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css"> : same
tag but with 'href=' part

at a different place. I think you got the point!

So What should I do to get the exact value(here the value after
'href=') in any case even if the

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">
 
C

cdecarlo

Hey,

I'm new with regex's as well but here is my idea. Since you don't know
which attribute will come first why don't structure your regex like
this

(first off, I'll assume that \s == ' ', actually now that I think of
it, isn't \s any whitespace character? anyways \s == ' ' for now)

'<link\s*((\s*attribute1\s*)|(\s*attribute2\s*)|(\s*attribute3\s*))+>'

I think that should just about do it.

Hope this helped,

Colin
 
J

Justin Azoff

John said:
That I want a particular tag value of one of my HTML files.

ie: I want only the value after 'href=' in the tag >>

'<link href="mystylesheet.css" rel="stylesheet" type="text/css">'

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it is good).

No matter how good it is you should still use something that
understands html:
'mystylesheet.css'
 
A

Ant

So What should I do to get the exact value(here the value after
'href=') in any case even if the

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">

The following should do it:

expr = r'<link .*?href="(.*?)"'

or if single quotes might have been used:

expr = r'''<link .*?href=["'](.*?)['"]'''

But like the others have said, beautiful soup is very good for things
like this.
 
P

Paul McGuire

Pyparsing is also good for recognizing basic HTML tags and their
attributes, regardless of the order of the attributes.

-- Paul

testText = """sldkjflsa;faj

<link href="mystylesheet.css" rel="stylesheet" type="text/css">

here it would be 'mystylesheet.css'. I used the following regex to get
this value(I dont know if it

I thought I was doing fine until I got stuck by this tag >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css"> : same

tag but with 'href=' part

tags are like these? >>

<link rel="stylesheet" href="mystylesheet.css" type="text/css">
-OR-
<link href="mystylesheet.css" rel="stylesheet" type="text/css">
-OR-
<link type="text/css" href="mystylesheet.css" rel="stylesheet">

"""
from pyparsing import makeHTMLTags,line

linkTag = makeHTMLTags("link")[0]
for toks,s,e in linkTag.scanString(testText):
print toks.href
print line(s,testText)
print

Prints out:

mystylesheet.css
<link href="mystylesheet.css" rel="stylesheet" type="text/css">

mystylesheet.css
<link rel="stylesheet" href="mystylesheet.css" type="text/css"> : same


mystylesheet.css
<link rel="stylesheet" href="mystylesheet.css" type="text/css">

mystylesheet.css
<link href="mystylesheet.css" rel="stylesheet" type="text/css">

mystylesheet.css
<link type="text/css" href="mystylesheet.css" rel="stylesheet">
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,221
Messages
2,571,134
Members
47,748
Latest member
LyleMondra

Latest Threads

Top