String Regex problem

F

Fazer

Hello,

I have a string which has a url (Begins with a http://) somewhere in
it. I want to detect such a url and just spit out the url. Since I
am very poor in regex, can someone show me how to do it using a few
examples?

Thanks a lot!
 
D

djw

Fazer said:
Hello,

I have a string which has a url (Begins with a http://) somewhere in
it. I want to detect such a url and just spit out the url. Since I
am very poor in regex, can someone show me how to do it using a few
examples?

Thanks a lot!

I would look here to improve your re-ex skills:

http://www.amk.ca/python/howto/regex/

Also, I find Kodos to be invaluable in developing and debugging regexs.
Highly recommended.

http://kodos.sourceforge.net

Of course, you could just use urlparse in the standard library...

Good luck,

Don
 
A

Andrei

Skip Montanaro wrote on Mon, 24 Nov 2003 21:35:48 -0600:
Don> http://kodos.sourceforge.net

If you're a Mac Python person there's also Dinu Gherman's excellent
RegexPlor:

http://starship.python.net/crew/gherman/RegexPlor.html
<snip>

I'm biased here, but Kiki (but http://project5.freezope.org/kiki) is
cross-platform and doesn't depend on Qt but on wxPy which is much easier
for Windows users.

Anyway, here's a regex I ripped out of my own code - you might want to
simplify it though:

"""Regex for finding URLs:
URL's start with http(s)/ftp/news ((http)|(ftp)|(news))
followed by ://
then any number of non-whitespace characters including
numbers, dots, forward slashes, commas, question marks,
ampersands, equality signs, dashes, underscores and plusses,
but ending in a non-dot and non-plus!

Result:

(?:http|https|ftp|news)://(?:[@a-zA-Z0-9,/%:\&+#\?=\-_~;]+\.*)+[a-zA-Z0-9,/%:\&#\?=\-_]

Tests:
Plain old link: http://www.mail.yahoo.com.
Containing numbers: ftp://bla.com/di~ng/co.rt,39,%93 or other
Go to news://bl_a.com/?ha-h+a&query=tb for more info.
A real link: <a href="http://x.com">http://x.com</a>.
ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html
(long one)
<IMG src="http://b.com/image.gif" /> (a plain image tag)
<a href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original
invalid HTML)
Link containing an anchor
<b>"http://myhomepage.com/index.html#01"</b>.
"""

--
Yours,

Andrei

=====
Mail address in header catches spam. Real contact info (decode with rot13):
(e-mail address removed). Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq
gur yvfg, fb gurer'f ab arrq gb PP.
 
F

Fazer

djw said:
I would look here to improve your re-ex skills:

http://www.amk.ca/python/howto/regex/

Also, I find Kodos to be invaluable in developing and debugging regexs.
Highly recommended.

http://kodos.sourceforge.net

Of course, you could just use urlparse in the standard library...

Good luck,

Don

Wow awesome! Thanks a lot for kodos. I hope I find it useful. I
have actually found a better solution rather than using regex it self.

Here's my solution and I think it works well:
[x for x in moo.split(' ') if x.startswith('http://')]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,171
Messages
2,570,935
Members
47,472
Latest member
KarissaBor

Latest Threads

Top