Splitting URLs

Steven D'Aprano · Oct 21, 2007

I'm trying to split a URL into components. For example:

URL = 'http://steve:[email protected]:82/dir" + \
'ectory/file.html;params?query#fragment'

(joining the strings above with plus has no significance, it's just to
avoid word-wrapping)

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:[email protected]:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:

('http', 'steve:[email protected]:82', '/directory/file.html',
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.

This second question isn't specifically Python related, but I'm asking it
anyway...

I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?

(If it does, I might decide to live with the lack... it depends on how
odd the corner is, and how much work it takes to fix.)

Tim Chase · Oct 21, 2007

URL = 'http://steve:[email protected]:82/dir" + \

'ectory/file.html;params?query#fragment'

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:[email protected]:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:

('http', 'steve:[email protected]:82', '/directory/file.html',
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.

there are some goodies in urllib for doing some of this
splitting. Example code at the bottom of my reply (though it
seems to choke on certain protocols such as "mailto:" and "ssh:"
because urlparse doesn't return the netloc properly)

This second question isn't specifically Python related, but I'm asking it
anyway...

I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?

I believe that dots are the sanctioned separator, HOWEVER, you
can have a non-qualified machine-name with local scope, so you
can easily have NO TLD, such as

http://user:password@localhost:8000/path/to/thing

There's also the ambiguity of what "TLD" means if you use IP
addresses:

http://user:[email protected]:8000/path/to/thing

Does that make the TLD "1"? Other odd edge-cases that are
usually allowable (but frowned upon, mostly used by
spammers/phishers) include using a long-int as the domain-name,
such as

http://user:password@2130706433:8000/path/to/thing

In an attempt to play with these functions, I present the code below.

-tkc

import urlparse, urllib
tests = (
'http://steve:[email protected]:82/'
'directory/file.html;params?query#fragment',
'http://user:[email protected]/path/to/thing/',
'http://192.168.1.2/path/to/thing/',
'http://2130706433/path/to/thing/',
'http://localhost/path/to/thing/',
'http://user:password@localhost/path/to/thing/',
'telnet://[email protected]',
'ssh://[email protected]',
'gopher://wais.example.edu',
'svn+ssh://user

[email protected]/svn/here/there/',
'mailto:[email protected]',
)

def is_ip_address(s):
for i, part in enumerate(s.split('.')):
try:
assert 0 <= int(i) <= 255
except:
return False
return i == 3

def steve_parse(url):
(scheme, netloc, path,
params, query, fragment) = urlparse.urlparse(url)
creds, host = urllib.splituser(netloc)
username, password = urllib.splitpasswd(creds or '')
host, port = urllib.splitport(host)
if '.' in host and not is_ip_address(host):
domain, tld = host.rsplit('.', 1)
else:
domain = host
tld = ''
return (
scheme, username, password,
domain, tld, port,
path, params, query,
fragment)
if __name__ == '__main__':
for test in tests:
print test
(scheme, username, password,
domain, tld, port,
path, params, query,
fragment) = steve_parse(test)
print '\tScheme: ', scheme
print '\tUsername: ', username
print '\tPassword: ', password
print '\tDomain: ', domain
print '\tTLD: ', tld
print '\tPort: ', port
print '\tPath: ', path
print '\tParams: ', params
print '\tQuery: ', query
print '\tFragment: ', fragment
print '='*50

Steven D'Aprano · Oct 22, 2007

there are some goodies in urllib for doing some of this splitting.
Example code at the bottom of my reply (though it seems to choke on
certain protocols such as "mailto:" and "ssh:" because urlparse doesn't
return the netloc properly)

It doesn't? That's... bad. But for my application, probably not
important: I only care about HTTP.

Thanks for the reply and sample code.

Tim Chase · Oct 22, 2007

there are some goodies in urllib for doing some of this splitting.

It doesn't? That's... bad. But for my application, probably not
important: I only care about HTTP.

This seems to be intentional, rather than a bug. In my
python2.4/urlparse.py file, there's a uses_netloc list which
clearly does not have 'mailto' in it. I can't give an
explanation/justification for it, but it seems to me (IMHO) that
there is a netloc involved in a mail address.

Or maybe I have a semantic misunderstanding of what the netloc
field means when returned from urlparse.urlparse However, since
this is where the hostname appears in "http", it makes me think
that the hostname from a mailto URL should also appear in this
result field.

-tkc

Paul Boddie · Oct 22, 2007

This seems to be intentional, rather than a bug. In my
python2.4/urlparse.py file, there's a uses_netloc list which
clearly does not have 'mailto' in it. I can't give an
explanation/justification for it, but it seems to me (IMHO) that
there is a netloc involved in a mail address.

As is often the case with the standard library, there are various open
issues around the functionality:

http://bugs.python.org/issue?@filter=status&status=-1,1,3&@search_text=RFC+3986

This proposed module (in the above search results) attempts to
implement RFC 3986:

http://bugs.python.org/issue1500504

I'm not sure whether itools.uri goes as far as you might like:

http://download.ikaaro.org/doc/itools/chapter--uri.html

Either way, after listening to Ron Stephens' most recent Python411
podcast, where he mentions that it's apparently up to the community to
fix the standard library (according to GvR and the core developers),
perhaps there's some demand for a "Python 300" which just cleans up
the standard library in a potentially (but not necessarily) backwards-
incompatible fashion.

Paul

Proxy authentication in common libraries	0	Oct 26, 2008
urlsplit() and windows paths	0	Oct 28, 2008
String splitting with exceptions	7	Aug 28, 2013
Sequence splitting	32	Jul 3, 2009
Splitting on '^' ?	10	Aug 14, 2009
TypeError: not all arguments converted during string formatting	2	Dec 13, 2013
Splitting a file from specific column content	14	Jan 22, 2012
open urls in browser	2	Jul 18, 2011

Splitting URLs

Steven D'Aprano

Tim Chase

Steven D'Aprano

Tim Chase

Paul Boddie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads