Use Regular Expressions to extract URL's

J

Jimbo

Hello

I am using regular expressions to grab URL's from a string(of HTML
code). I am getting on very well & I seem to be grabbing the full URL
but
I also get a '"' character at the end of it. Do you know how I can get
rid of the '"' char at the end of my URL

Example of problem:
I get this when I extract a url from a string
http://google.com"

I want to get this
http://google.com

My regular expression:
Code:
def find_urls(string):
    """ Extract all URL's from a string & return as a list """

    url_list = re.findall(r'(?:http://|www.).*?["]',string)
    return url_list
 
S

Steven D'Aprano

Hello

I am using regular expressions to grab URL's from a string(of HTML
code). I am getting on very well & I seem to be grabbing the full URL
but
I also get a '"' character at the end of it. Do you know how I can get
rid of the '"' char at the end of my URL

Live dangerously and just drop the last character from string s no matter
what it is:

s = s[:-1]


Or be a little more cautious and test first:

if s.endswith('"'):
s = s[:-1]


Or fix the problem at the source. Using regexes to parse HTML is always
problematic. You should consider using a proper HTML parser. Otherwise,
try this regex:

r'"(http://(?:www)?\..*?)"'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,738
Latest member
JinaMacvit

Latest Threads

Top