S
Steven D'Aprano
I'm trying to split a URL into components. For example:
URL = 'http://steve:[email protected]:82/dir" + \
'ectory/file.html;params?query#fragment'
(joining the strings above with plus has no significance, it's just to
avoid word-wrapping)
If I split the URL, I would like to get the following components:
scheme = 'http'
netloc = 'steve:[email protected]:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'
I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:
('http', 'steve:[email protected]:82', '/directory/file.html',
'params', 'query', 'fragment')
If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.
This second question isn't specifically Python related, but I'm asking it
anyway...
I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?
(If it does, I might decide to live with the lack... it depends on how
odd the corner is, and how much work it takes to fix.)
URL = 'http://steve:[email protected]:82/dir" + \
'ectory/file.html;params?query#fragment'
(joining the strings above with plus has no significance, it's just to
avoid word-wrapping)
If I split the URL, I would like to get the following components:
scheme = 'http'
netloc = 'steve:[email protected]:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'
I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:
('http', 'steve:[email protected]:82', '/directory/file.html',
'params', 'query', 'fragment')
If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.
This second question isn't specifically Python related, but I'm asking it
anyway...
I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?
(If it does, I might decide to live with the lack... it depends on how
odd the corner is, and how much work it takes to fix.)