how to strip the domain name in python?

M

Marko.Cain.23

Hi,

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):
s1, s2 = match[0]

print s2

but none of the site matched, can you please tell me what am i
missing?

Thank you.
 
A

Alex Martelli

Hi,

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):
s1, s2 = match[0]

print s2

but none of the site matched, can you please tell me what am i
missing?

You're using reverse slashes in your RE pattern, to start with, while
the URLs contain plain slashes (or don't have any slashes, in the case
of the second one).

Anyway, forget REs, and use standard library module urlparse,
specifically its urlparse.urlsplit function.


Alex
 
M

Michael Bentley

Hi,

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):
s1, s2 = match[0]

print s2

but none of the site matched, can you please tell me what am i
missing?

change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
 
M

Marko.Cain.23

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
but none of the site matched, can you please tell me what am i
missing?

change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)

Thanks. I try this:

but when the 'line' is http://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)


match = re.findall(pattern, line)

if (match):

s1, s2 = match[0]

print s2
 
M

Marko.Cain.23

On Apr 13, 2007, at 11:49 PM, (e-mail address removed) wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)

Thanks. I try this:

but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)

match = re.findall(pattern, line)

if (match):

s1, s2 = match[0]

print s2

Can anyone please help me with my problem? I still can't solve it.

Basically, I want to strip out the text after the first '.' in url
address:

http://www.cnn.com -> cnn.com
 
M

Marc 'BlackJack' Rintsch

Marko.Cain.23 said:
On Apr 13, 2007, at 11:49 PM, (e-mail address removed) wrote:

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)

Thanks. I try this:

but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)

match = re.findall(pattern, line)

if (match):

s1, s2 = match[0]

print s2

Can anyone please help me with my problem? I still can't solve it.

Basically, I want to strip out the text after the first '.' in url
address:

http://www.cnn.com -> cnn.com

from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch
 
M

Marko.Cain.23

In <[email protected]>, Marko.Cain.23
wrote:


On Apr 14, 12:02 am, Michael Bentley <[email protected]>
wrote:
On Apr 13, 2007, at 11:49 PM, (e-mail address removed) wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
Thanks. I try this:
but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?
pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
Can anyone please help me with my problem? I still can't solve it.
Basically, I want to strip out the text after the first '.' in url
address:

from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch

Thanks for your help.

But if the input string is "http://www.ebay.co.uk/", I only get
"co.uk"

how can I change it so that it works for both www.ebay.co.uk and www.cnn.com?
 
S

Steve Holden

In <[email protected]>, Marko.Cain.23
wrote:


On Apr 14, 10:36 am, (e-mail address removed) wrote:
On Apr 14, 12:02 am, Michael Bentley <[email protected]>
wrote:
On Apr 13, 2007, at 11:49 PM, (e-mail address removed) wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
Thanks. I try this:
but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?
pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
Can anyone please help me with my problem? I still can't solve it.
Basically, I want to strip out the text after the first '.' in url
address:
http://www.cnn.com-> cnn.com
from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch

Thanks for your help.

But if the input string is "http://www.ebay.co.uk/", I only get
"co.uk"

how can I change it so that it works for both www.ebay.co.uk and www.cnn.com?
... net_location = urlsplit(url)[1]
... return net_location.split(".", 1)[1]
...
regards
Steve
 
M

Michael Bentley

In <[email protected]>,
Marko.Cain.23
wrote:


On Apr 14, 10:36 am, (e-mail address removed) wrote:
On Apr 14, 12:02 am, Michael Bentley <[email protected]>
wrote:
On Apr 13, 2007, at 11:49 PM, (e-mail address removed) wrote:

I have a list of url names like this, and I am trying to strip
out the
domain name using the following code:

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile
("http:\/
\/(.*)\.(.*)", re.S)
Thanks. I try this:
but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?
pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
Can anyone please help me with my problem? I still can't solve it.
Basically, I want to strip out the text after the first '.' in url
address:

from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch

Thanks for your help.

But if the input string is "http://www.ebay.co.uk/", I only get
"co.uk"

how can I change it so that it works for both www.ebay.co.uk and
www.cnn.com?

from urlparse import urlsplit

def get_domain(url):
net_location = (
urlsplit(url)[1]
and urlsplit(url)[1].split('.')
or urlsplit(url)[2].split('.')
) # tricksy way to get long line into email
if net_location[0].lower() == 'www':
net_location = net_location[1:]
return '.'.join(net_location)

def main():
testItems = ['http://www.cnn.com',
'www.yahoo.com',
'http://www.ebay.co.uk']

for testItem in testItems:
print get_domain(testItem)

if __name__ == '__main__':
main()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,202
Messages
2,571,058
Members
47,668
Latest member
SamiraShac

Latest Threads

Top