Extract domain name

C

Charles Calvert

All,

I have the same basic issue as discussed in this thread last year:
<http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/77cfd3250a633c7e/0430f50818a01ddd>.

Justin Collins points out the greatest difficulty with the situation,
i.e. that when dealing with a country code TLD, one may well have a
different number of parts (e.g. example.co.uk) than when dealing with
a gTLD (example.com).

The only solution that has occurred to me is to have a list of known
TLDs and second level domains (e.g. co.uk) that are insufficiently
specific, requiring a subdomain for additional specificity. The
problem is that this requires maintenance as well as initial research.

Does anyone have any suggestions for an alternative method to solve
this problem? I'm currently using Addressable:URI
(http://addressable.rubyforge.org/api/classes/Addressable/URI.html) to
parse the URLs and extract the host names.
 
M

Mr zengr

Charles said:
All,

I have the same basic issue as discussed in this thread last year:
<http://groups.google.com/group/comp.lang.ruby/browse_thread/thread/77cfd3250a633c7e/0430f50818a01ddd>.

Justin Collins points out the greatest difficulty with the situation,
i.e. that when dealing with a country code TLD, one may well have a
different number of parts (e.g. example.co.uk) than when dealing with
a gTLD (example.com).

The only solution that has occurred to me is to have a list of known
TLDs and second level domains (e.g. co.uk) that are insufficiently
specific, requiring a subdomain for additional specificity. The
problem is that this requires maintenance as well as initial research.

Does anyone have any suggestions for an alternative method to solve
this problem? I'm currently using Addressable:URI
(http://addressable.rubyforge.org/api/classes/Addressable/URI.html) to
parse the URLs and extract the host names.

I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat

A stackoverflow question on the same topic:
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url

Their solution is regex.
 
B

Brian Candler

And remember that some things which look like domains have set
themselves up as registries - e.g. uk.com
 
C

Charles Calvert

[snip my question about extracting domain name (e.g. "example.com"
from "www.example.com").
I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names.dat

Wow. That's a big help. Thanks.
A stackoverflow question on the same topic:
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url
Interesting.

Their solution is regex.

As the poster pointed out, matching everything leads to a huge regex,
which is likely to cause maintenance problems (though he indicated
that they started generating the regex from other data to address
that) and would make me concerned about resource allocation, though I
couldn't find anything in the core Ruby Doc about a max length for a
regex.

On the other hand it might be more performant than looping through a
bunch of substring matches or matching against database records. I
sense some testing in my future.

Thanks,
 
M

Michael Fellinger

[snip my question about extracting domain name (e.g. "example.com"
from "www.example.com").
I think the best way will be actually match with a list of TLDs and
gTLDs.

Mozilla has a list of domains:
http://mxr.mozilla.org/mozilla/source/netwerk/dns/src/effective_tld_names=
dat

Wow. =C2=A0That's a big help. =C2=A0Thanks.
A stackoverflow question on the same topic:
http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url
Interesting.

Their solution is regex.

As the poster pointed out, matching everything leads to a huge regex,
which is likely to cause maintenance problems (though he indicated
that they started generating the regex from other data to address
that) and would make me concerned about resource allocation, though I
couldn't find anything in the core Ruby Doc about a max length for a
regex.

On the other hand it might be more performant than looping through a
bunch of substring matches or matching against database records. =C2=A0I
sense some testing in my future.
http://github.com/pauldix/domainatrix

Thanks,
--
Charles Calvert
Moderator - alt.computer.consultants.moderated
Submission Address: (e-mail address removed)
Contact Address: (e-mail address removed)



--=20
Michael Fellinger
CTO, The Rubyists, LLC
I check email a couple times daily; to reach me sooner, use:
http://awayfind.com/manveru
 
C

Charles Calvert

[snip my question about extracting domain name (e.g. "example.com"
from "www.example.com").
I think the best way will be actually match with a list of TLDs and
gTLDs.
[snip]

http://github.com/pauldix/domainatrix

For a minute, I thought your reply was generated by a porn spam bot
until I saw github in the URL. :)

For those reading the thread, this is a gem that uses
http://publicsuffix.org/ to parse domain names and identify the suffix
(e.g. "com" or "co.uk"), domain, subdomains, etc. It extends
Addressable.URI.

That was very helpful. Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,240
Members
46,828
Latest member
LauraCastr

Latest Threads

Top